# Karamoja Crop Yield Insights



   - By: ***Israel Wasike Kahayi***

## Problem Statement

In the Karamoja region, NGOs face challenges in assessing agricultural productivity due to limited visibility into crop yields across various districts and sub-counties. Despite providing technical support and farm inputs to farmers, there is a significant gap in understanding the spatial distribution and variation of crop yields, particularly for sorghum and maize. This lack of comprehensive data impedes the effective prioritization of interventions and resource allocation. To address this issue, the goal of this project is to develop an interactive visualization tool that leverages data from the 2017 crop season to offer clear insights into crop yield patterns. This tool will help NGOs make informed decisions, optimize their support strategies, and ultimately improve food security in the region.

## Introduction

In this notebook, I will clean and prepare data on crop yields and population in Karamoja for the 2017 crop season. This includes handling missing values, correcting data types, and standardizing formats to ensure the data is ready for analysis and visualization.

I will use data from two Excel files:
- **District Crop Yield Population Data:** Contains information on crop yields and population by district.
- **Subcounty Crop Yield Population Data:** Contains information on crop yields and population by sub-county.



## Objectives

The goal of this analysis is to develop an interactive visualization tool that provides insights into crop yields across different districts and sub-counties in Karamoja for the 2017 crop season. By analyzing and visualizing the data, we aim to:
1. **Identify Patterns:** Understand the distribution of crop yields across various regions.
2. **Highlight Trends:** Detect any significant trends or anomalies in crop yields.
3. **Support Decision-Making:** Provide actionable insights that can help NGOs prioritize their activities and allocate resources more effectively.


## Research questions

1. What are the spatial patterns of population across different districts and  in Karamoja during the 2017 crop season?
2. Which districts in Karamoja are experiencing the lowest crop yields, and how do these areas compare with regions of higher productivity?
3. How does the population size of a district or sub-county correlate with the crop yield in Karamoja?
4. What is the proportion of land dedicated to Sorghum and Maize in different districts?
5. What trends can be observed in crop production relative to changes in crop area over the specified period?

### 1. Importing the necesarry libraries

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Loading the files

In [2]:
# Load the CSV files
district = pd.read_csv('C:/Users/DELL/Documents/course_materials/phase_1/Karamoja-Project/Data/TABLES/Uganda_Karamoja_District_Crop_Yield_Population.csv')
subcounty = pd.read_csv('C:/Users/DELL/Documents/course_materials/phase_1/Karamoja-Project/Data/TABLES/Uganda_Karamoja_Subcounty_Crop_Yield_Population.csv')

# Display the first few rows of each DataFrame
district.head(3)




Unnamed: 0,OBJECTID,NAME,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
0,92,ABIM,90385,2771977106,449,1040,5470.068394,3277.295971,1848.621855,1471506,1922567
1,96,AMUDAT,101790,1643582836,205,1297,5765.443719,2973.42386,2733.661014,609552,3545558
2,20,KAABONG,627057,7373606003,279,945,28121.67253,20544.19496,7394.416334,5731830,6987723


In [3]:
subcounty.head(3)

Unnamed: 0,OBJECTID,SUBCOUNTY_NAME,DISTRICT_NAME,POP,Area,Karamoja,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
0,263,KACHERI,KOTIDO,17244,1067176155,Y,354.207411,1137.467019,7023.533691,6434.342449,528.124229,2279092.0,600723.8929
1,264,KOTIDO,KOTIDO,52771,597575188,Y,367.890523,1162.996687,13587.99076,12455.59264,824.767081,4582294.0,959201.3825
2,265,KOTIDO TOWN COUNCIL,KOTIDO,27389,23972401,Y,369.314177,1167.005832,1656.531855,1520.322052,8.561644,561476.5,9991.488268


Displaying summary statistics for the dataframe

In [4]:
print(district.info(),"\n")
print(subcounty.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   OBJECTID      7 non-null      int64  
 1   NAME          7 non-null      object 
 2   POP           7 non-null      int64  
 3   Area          7 non-null      int64  
 4   S_Yield_Ha    7 non-null      int64  
 5   M_Yield_Ha    7 non-null      int64  
 6   Crop_Area_Ha  7 non-null      float64
 7   S_Area_Ha     7 non-null      float64
 8   M_Area_Ha     7 non-null      float64
 9   S_Prod_Tot    7 non-null      int64  
 10  M_Prod_Tot    7 non-null      int64  
dtypes: float64(3), int64(7), object(1)
memory usage: 744.0+ bytes
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OBJECTID        52 non-null     int64  
 1   SUBCOUNTY_NAME  52 non-

### Checking for duplicate rows

If you want to find and handle duplicate rows in a pandas DataFrame, the ***.duplicated()*** method is the go-to tool to use.

In [5]:
# checking whether there are duplicates in the dataframe
print("Number of duplicate rows in district is:",district.duplicated().sum(),"\n")
print("Number of duplicate rows in subcounty is:",subcounty.duplicated().sum())

Number of duplicate rows in district is: 0 

Number of duplicate rows in subcounty is: 0


### Shape of the dataframes

In [6]:
print(f'The dhspe of the district data frame is:',district.shape)
print(f'The shape of the subcounty data frame is:',subcounty.shape)

The dhspe of the district data frame is: (7, 11)
The shape of the subcounty data frame is: (52, 13)


### Checking for missing values in the DataFrames

 ***.isnull()*** is a method used for checking missing values in a dataframe not a function. missung values/ null values are denoted by NaN which means not a number

In [7]:
# .isnull() checks for any missing values and returns a boolean true or false
# where true = the missing value and false = no missing value. 
#but in this case, i want to know the total number of missing values in every colum, thats where .sum() is used.
#.sum() adds all the boolean(true) and guves out a number eg. 5, 10, 12674...

missing_values = district.isnull().sum()
print('missing values in district:')
print(missing_values,"\n")


missing_values = district.isnull().sum()
print('missing values in subcounty:')
print(missing_values,"\n")

#the output below shows there are no missing values


missing values in district:
OBJECTID        0
NAME            0
POP             0
Area            0
S_Yield_Ha      0
M_Yield_Ha      0
Crop_Area_Ha    0
S_Area_Ha       0
M_Area_Ha       0
S_Prod_Tot      0
M_Prod_Tot      0
dtype: int64 

missing values in subcounty:
OBJECTID        0
NAME            0
POP             0
Area            0
S_Yield_Ha      0
M_Yield_Ha      0
Crop_Area_Ha    0
S_Area_Ha       0
M_Area_Ha       0
S_Prod_Tot      0
M_Prod_Tot      0
dtype: int64 



### Descriptive statistics

In [8]:
district.describe() #generates descriptive statistics of a DataFrame 
#it returns a table containing various summary statistics for each numerical column.


Unnamed: 0,OBJECTID,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,61.714286,214943.571429,3960853000.0,269.285714,986.142857,21094.520379,16737.636651,3983.947082,4873098.0,4085632.0
std,36.481567,188604.280916,1781860000.0,119.243049,321.5667,17363.854165,16625.96346,2678.911441,5743724.0,2877188.0
min,5.0,90385.0,1643583000.0,128.0,355.0,5470.068394,2973.42386,1190.050606,606944.0,422468.0
25%,37.0,114800.5,3171069000.0,171.0,899.5,5860.128883,4009.522373,1799.99707,1040529.0,1966571.0
50%,80.0,146780.0,3641540000.0,279.0,1040.0,22944.29602,16142.01588,2733.661014,2211456.0,3545558.0
75%,88.5,205391.0,4362553000.0,343.5,1206.0,27247.18551,19890.764085,6484.75374,6290160.0,6288030.0
max,96.0,627057.0,7373606000.0,449.0,1297.0,53032.64945,50247.4439,7394.416334,16631900.0,8122197.0


In [9]:
subcounty.describe()

Unnamed: 0,OBJECTID,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
count,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0
mean,787.865385,28934.692308,533191300.0,274.165405,940.259552,2839.646974,2253.143395,536.300569,655744.3,550073.0
std,280.101314,20865.122974,491330800.0,118.569907,321.641901,3110.505917,2954.355858,724.092288,991583.9,793970.7
min,263.0,1418.0,2121209.0,108.156411,0.0,0.17139,0.130941,0.0,17.28126,0.0
25%,597.75,16558.5,156892300.0,173.034066,743.075879,964.876031,405.394759,79.821743,121055.5,60870.12
50%,810.5,23053.5,384835600.0,277.255206,1016.684002,1654.265138,1231.824455,326.479336,254368.7,289623.9
75%,982.25,39461.0,774902900.0,368.246437,1203.548665,3267.564651,2429.985069,740.296675,604094.2,811457.4
max,1320.0,100919.0,2069555000.0,560.31307,1396.991494,13587.99076,12964.49973,3840.698081,4582294.0,4365058.0


In [10]:
print(district.describe(include = object),"\n")
print (subcounty.describe(include = object))


# district.describe(include=object)`: This retrieves descriptive statistics specifically for columns in the `district` DataFrame that are of type object (usually strings or categorical values).
# It will provide information such as count, unique values, top value, and frequency of the top value.

# subcounty.describe(include=object)`: Similarly, this generates descriptive statistics for the `subcounty` DataFrame for its object-type columns.


          NAME
count        7
unique       7
top     KOTIDO
freq         1 

       SUBCOUNTY_NAME DISTRICT_NAME Karamoja
count              52            52       52
unique             52             7        1
top            MATANY       KAABONG        Y
freq                1            14       52


### check the column names of the DataFrames by accessing the .columns attribute:

In [11]:
# Check column names for the district DataFrame
print("Column names in the District DataFrame:")
print(district.columns, "\n")
#checking colum names for the subcounty DataFrame
print("Column names in the subcountyDataFrame:")
print(subcounty.columns)

Column names in the District DataFrame:
Index(['OBJECTID', 'NAME', 'POP', 'Area', 'S_Yield_Ha', 'M_Yield_Ha',
       'Crop_Area_Ha', 'S_Area_Ha', 'M_Area_Ha', 'S_Prod_Tot', 'M_Prod_Tot'],
      dtype='object') 

Column names in the subcountyDataFrame:
Index(['OBJECTID', 'SUBCOUNTY_NAME', 'DISTRICT_NAME', 'POP', 'Area',
       'Karamoja', 'S_Yield_Ha', 'M_Yield_Ha', 'Crop_Area_Ha', 'S_Area_Ha',
       'M_Area_Ha', 'S_Prod_Tot', 'M_Prod_Tot'],
      dtype='object')


## Findings

1. ***Kaabong*** is the most populous district, with more than double the population of ***Kotido***. ***Abim*** and ***Amudat*** have notably lower populations, highlighting a significant disparity across the districts.
2. ***Kotido*** stands out as the leading district in maize production, significantly surpassing other districts, while ***Kaabong*** leads in sorghum yields, indicating varying agricultural strengths across the regions.
3.  In some districts, the population may rely more on alternative livelihoods (e.g pastoralism, trade, or mining), leading to a weaker connection between population size and agricultural land use.
4. ***Nakapiripirit*** and ***Kaabong*** excel in maize production, while ***Kotido*** is key for sorghum, highlighting distinct agricultural strengths among the districts.
5. The trends, while generally increasing, may indicate periods of lower production despite increased area, highlighting potential inefficiencies or challenges in crop yield.


## Recommendations

1. Focus resources on Kaabong and Nakapiripirit to boost maize production and on Kotido for sorghum, leveraging their strengths.
2. In districts with weaker agriculture, invest in alternative livelihoods like pastoralism, trade, or mining to balance economic reliance.
3. Examine why some districts with more cultivation area have lower yields and implement strategies to enhance productivity.
4. Factor in population disparities when distributing resources to ensure equitable support for agriculture and livelihoods.

## Conclusion

***Kaabong*** and ***Nakapiripirit*** should get a boost to improve maize production, and ***Kotido*** needs to focus on enhancing sorghum yields. For areas with weaker agriculture, exploring alternatives like pastoralism or trade could help. It’s also important to address why some districts aren't seeing better yields with more land and keep population differences in mind when planning resource support.
