<a href="https://colab.research.google.com/github/SMayienda/End-Project/blob/main/KARAMOJA_PROJECT_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KARAMOJA PROJECT

# Business Understanding

Karamoja is the most food-insecure region of Uganda. One of the main reasons is the low productivity level of the crops due to intense droughts as well as pest and disease outbreaks. Karamoja (Links to an external site.)Links to an external site.

In Karamoja, several NGOs provide technical support as well as farm inputs to the farmers experiencing extremely low yield. Though, they lack visibility into the overall state of the region and often need to rely on some very local sources of information to prioritize their activities.
Dalberg Data Insights (DDI) has been requested to develop a new food security monitoring tool to support the decision making of one of those NGOs active in Karamoja.

To do so, Dalberg Data Insights developed a methodology to remotely measure the yield of the two main staple crops of the region (i.e. sorghum and maize) based on satellite images. The agri-tech team just ran the model for the 2017 crop season.



# Data Understanding

I will be dealing with the following data:

1. Uganda Karamoja District Crop Yield
2. Uganda Karamoja Subcounty Crop Yield

Below is the data composition;

Yield and Population per Subcounty

POP: total population for the subcounty

S_Yield_Ha: average yield for sorghum for the subcounty (Kg/Ha)

M_Yield_Ha: average yield for maize for the subcounty (Kg/Ha)

Crop_Area_Ha: total crop area for the subcounty (Ha)

S_Area_Ha: total sorghum crop area for the subcounty (Ha)

M_Area_Ha: total maize crop area for the subcounty (Ha)

S_Prod_Tot: total productivity for the sorghum for the subcounty (Kg)

M_Prod_Tot: total productivity for the maize for the subcounty (Kg)

Yield and Population per District

**Research Question**

As a Data Analyst, the agri-tech team is asking you to develop an interactive visualization tool of the results for this first crop season. This visualization tool that you will develop will be used as a first mockup of the Food Security Monitoring tool that DDI will develop for the NGO.
Based on your experience, the team expects you to come up with a first draft within the coming 3 working days. They give you carte blanche in terms of structure and functionalities but they know that the client wants:

 At least a map in the dashboard

The possibility of visualizing the results by district or sub-county (two administrative levels used by the NGO)

**Objectives**

1. The population distribution across the district and their subcounties.

2. The crops productivity in each district.

3. The crops yield in each district.

4. View crops productivity verses population.

# Importing of Libraries

In [1]:
# Essential libraries to handle data manipulation and visualization.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# Loading of the data sets

1. District data set

In [2]:
#loading the district data
karamoja_district = pd.read_csv('/content/Uganda_Karamoja_District_Crop_Yield_Population.csv')

2. Subcounty data set

In [3]:
#loading the subcounty data
karamoja_subcounty = pd.read_csv('/content/Uganda_Karamoja_Subcounty_Crop_Yield_Population.csv')


# Data Shapes

How many columns and rows does each data set have.

In [4]:
karamoja_district.shape

(7, 11)

In [5]:
karamoja_subcounty.shape

(52, 13)

# Understand the structure and summary statistics of the data.

1. Looking into the top five raws of each data

In [6]:
karamoja_district.head()

Unnamed: 0,OBJECTID,NAME,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
0,92,ABIM,90385,2771977106,449,1040,5470.068394,3277.295971,1848.621855,1471506,1922567
1,96,AMUDAT,101790,1643582836,205,1297,5765.443719,2973.42386,2733.661014,609552,3545558
2,20,KAABONG,627057,7373606003,279,945,28121.67253,20544.19496,7394.416334,5731830,6987723
3,85,KOTIDO,243157,3641539808,331,1148,53032.64945,50247.4439,1751.372284,16631904,2010575
4,5,MOROTO,127811,3570160948,128,355,5954.814048,4741.748776,1190.050606,606944,422468


In [7]:
karamoja_subcounty.head()

Unnamed: 0,OBJECTID,SUBCOUNTY_NAME,DISTRICT_NAME,POP,Area,Karamoja,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
0,263,KACHERI,KOTIDO,17244,1067176155,Y,354.207411,1137.467019,7023.533691,6434.342449,528.124229,2279092.0,600723.8929
1,264,KOTIDO,KOTIDO,52771,597575188,Y,367.890523,1162.996687,13587.99076,12455.59264,824.767081,4582294.0,959201.3825
2,265,KOTIDO TOWN COUNCIL,KOTIDO,27389,23972401,Y,369.314177,1167.005832,1656.531855,1520.322052,8.561644,561476.5,9991.488268
3,266,NAKAPERIMORU,KOTIDO,38775,419111591,Y,283.324569,852.366578,7087.823334,6761.488901,45.721712,1915696.0,38971.65908
4,267,PANYANGARA,KOTIDO,65704,880955930,Y,373.836926,1283.859882,10398.24939,10111.19813,172.611914,3779939.0,221609.5114


2. Understanding the summary of our data sets

In [8]:
karamoja_district.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   OBJECTID      7 non-null      int64  
 1   NAME          7 non-null      object 
 2   POP           7 non-null      int64  
 3   Area          7 non-null      int64  
 4   S_Yield_Ha    7 non-null      int64  
 5   M_Yield_Ha    7 non-null      int64  
 6   Crop_Area_Ha  7 non-null      float64
 7   S_Area_Ha     7 non-null      float64
 8   M_Area_Ha     7 non-null      float64
 9   S_Prod_Tot    7 non-null      int64  
 10  M_Prod_Tot    7 non-null      int64  
dtypes: float64(3), int64(7), object(1)
memory usage: 744.0+ bytes


In [9]:
karamoja_subcounty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OBJECTID        52 non-null     int64  
 1   SUBCOUNTY_NAME  52 non-null     object 
 2   DISTRICT_NAME   52 non-null     object 
 3   POP             52 non-null     int64  
 4   Area            52 non-null     int64  
 5   Karamoja        52 non-null     object 
 6   S_Yield_Ha      52 non-null     float64
 7   M_Yield_Ha      52 non-null     float64
 8   Crop_Area_Ha    52 non-null     float64
 9   S_Area_Ha       52 non-null     float64
 10  M_Area_Ha       52 non-null     float64
 11  S_Prod_Tot      52 non-null     float64
 12  M_Prod_Tot      52 non-null     float64
dtypes: float64(7), int64(3), object(3)
memory usage: 5.4+ KB


3. Understanding the statistical summary of the Data sets

In [10]:
karamoja_district.describe()

Unnamed: 0,OBJECTID,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,61.714286,214943.571429,3960853000.0,269.285714,986.142857,21094.520379,16737.636651,3983.947082,4873098.0,4085632.0
std,36.481567,188604.280916,1781860000.0,119.243049,321.5667,17363.854165,16625.96346,2678.911441,5743724.0,2877188.0
min,5.0,90385.0,1643583000.0,128.0,355.0,5470.068394,2973.42386,1190.050606,606944.0,422468.0
25%,37.0,114800.5,3171069000.0,171.0,899.5,5860.128883,4009.522373,1799.99707,1040529.0,1966571.0
50%,80.0,146780.0,3641540000.0,279.0,1040.0,22944.29602,16142.01588,2733.661014,2211456.0,3545558.0
75%,88.5,205391.0,4362553000.0,343.5,1206.0,27247.18551,19890.764085,6484.75374,6290160.0,6288030.0
max,96.0,627057.0,7373606000.0,449.0,1297.0,53032.64945,50247.4439,7394.416334,16631900.0,8122197.0


The average population is approximately 214,944, with a standard deviation of 188,604. The minimum population is 90,385, and the maximum is 627,057.

The average yield per hectare for the sorghum is 269.29, with a standard deviation of 119.24. The minimum yield is 128, and the maximum is 449

The average yield per hectare for the maize is 986.14, with a standard deviation of 321.57. The minimum yield is 355, and the maximum is 1297.

In [11]:
karamoja_subcounty.describe()

Unnamed: 0,OBJECTID,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
count,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0
mean,787.865385,28934.692308,533191300.0,274.165405,940.259552,2839.646974,2253.143395,536.300569,655744.3,550073.0
std,280.101314,20865.122974,491330800.0,118.569907,321.641901,3110.505917,2954.355858,724.092288,991583.9,793970.7
min,263.0,1418.0,2121209.0,108.156411,0.0,0.17139,0.130941,0.0,17.28126,0.0
25%,597.75,16558.5,156892300.0,173.034066,743.075879,964.876031,405.394759,79.821743,121055.5,60870.12
50%,810.5,23053.5,384835600.0,277.255206,1016.684002,1654.265138,1231.824456,326.479336,254368.7,289623.9
75%,982.25,39461.0,774902900.0,368.246437,1203.548665,3267.564651,2429.985069,740.296675,604094.2,811457.4
max,1320.0,100919.0,2069555000.0,560.31307,1396.991494,13587.99076,12964.49973,3840.698081,4582294.0,4365058.0


The average population is approximately 28,935, with a standard deviation of 20,865. The minimum population is 1,418, and the maximum is 100,919

The average yield per hectare for the sorghum is 274.17, with a standard deviation of 118.57. The minimum yield is 108.16, and the maximum is 560.31.

The average yield per hectare for the maize is 940.26, with a standard deviation of 321.64. The minimum yield is 0, and the maximum is 1396.99.

# Missing Values and Duplicates



Checking for missing values in our data

In [12]:
karamoja_district.isnull().sum()

Unnamed: 0,0
OBJECTID,0
NAME,0
POP,0
Area,0
S_Yield_Ha,0
M_Yield_Ha,0
Crop_Area_Ha,0
S_Area_Ha,0
M_Area_Ha,0
S_Prod_Tot,0


In [13]:
karamoja_subcounty.isnull().sum()

Unnamed: 0,0
OBJECTID,0
SUBCOUNTY_NAME,0
DISTRICT_NAME,0
POP,0
Area,0
Karamoja,0
S_Yield_Ha,0
M_Yield_Ha,0
Crop_Area_Ha,0
S_Area_Ha,0


In [15]:
karamoja_district.duplicated().sum()

0

In [14]:
karamoja_subcounty.duplicated().sum()

0

From the results above our data sets do not have any missing values or any duplicates.

# Combining the two data sets

In [16]:
combined_df = pd.merge(karamoja_district, karamoja_subcounty, how='outer')

  combined_df = pd.merge(karamoja_district, karamoja_subcounty, how='outer')


In [17]:
# filling the missing values in the column region with mean for numeric value
numeric_columns = combined_df.select_dtypes(include=[np.number]).columns
combined_df[numeric_columns] = combined_df[numeric_columns].fillna(combined_df[numeric_columns].mean())

In [18]:
# filling missing values for non-numeric columns
non_numeric_columns = combined_df.select_dtypes(exclude=[np.number]).columns
combined_df[non_numeric_columns] = combined_df[non_numeric_columns].fillna(combined_df[non_numeric_columns].mode().iloc[0])

In [19]:
combined_df.head()

Unnamed: 0,OBJECTID,NAME,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot,SUBCOUNTY_NAME,DISTRICT_NAME,Karamoja
0,92,ABIM,90385,2771977106,449.0,1040.0,5470.068394,3277.295971,1848.621855,1471506.0,1922567.0,ABIM,KAABONG,Y
1,96,AMUDAT,101790,1643582836,205.0,1297.0,5765.443719,2973.42386,2733.661014,609552.0,3545558.0,ABIM,KAABONG,Y
2,20,KAABONG,627057,7373606003,279.0,945.0,28121.67253,20544.19496,7394.416334,5731830.0,6987723.0,ABIM,KAABONG,Y
3,85,KOTIDO,243157,3641539808,331.0,1148.0,53032.64945,50247.4439,1751.372284,16631904.0,2010575.0,ABIM,KAABONG,Y
4,5,MOROTO,127811,3570160948,128.0,355.0,5954.814048,4741.748776,1190.050606,606944.0,422468.0,ABIM,KAABONG,Y


In [20]:
combined_df.shape

(59, 14)

In [21]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OBJECTID        59 non-null     int64  
 1   NAME            59 non-null     object 
 2   POP             59 non-null     int64  
 3   Area            59 non-null     int64  
 4   S_Yield_Ha      59 non-null     float64
 5   M_Yield_Ha      59 non-null     float64
 6   Crop_Area_Ha    59 non-null     float64
 7   S_Area_Ha       59 non-null     float64
 8   M_Area_Ha       59 non-null     float64
 9   S_Prod_Tot      59 non-null     float64
 10  M_Prod_Tot      59 non-null     float64
 11  SUBCOUNTY_NAME  59 non-null     object 
 12  DISTRICT_NAME   59 non-null     object 
 13  Karamoja        59 non-null     object 
dtypes: float64(7), int64(3), object(4)
memory usage: 6.6+ KB


In [23]:
combined_df. isnull().sum()

Unnamed: 0,0
OBJECTID,0
NAME,0
POP,0
Area,0
S_Yield_Ha,0
M_Yield_Ha,0
Crop_Area_Ha,0
S_Area_Ha,0
M_Area_Ha,0
S_Prod_Tot,0


In [25]:
combined_df.describe()

Unnamed: 0,OBJECTID,POP,Area,S_Yield_Ha,M_Yield_Ha,Crop_Area_Ha,S_Area_Ha,M_Area_Ha,S_Prod_Tot,M_Prod_Tot
count,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0
mean,701.711864,51003.542373,939863000.0,273.586459,945.703334,5005.479412,3971.642595,945.343375,1156108.0,969546.1
std,353.856198,87994.178609,1338079000.0,117.624497,319.200199,8668.59376,7654.205284,1570.917869,2483804.0,1655393.0
min,5.0,1418.0,2121209.0,108.156411,0.0,0.17139,0.130941,0.0,17.28126,0.0
25%,592.5,16721.0,190148800.0,170.240064,757.082394,998.266305,533.066508,92.508396,146657.4,78628.14
50%,767.0,26644.0,499776900.0,279.0,1030.064093,2008.068169,1550.94457,358.550335,339760.6,306951.8
75%,980.5,43618.0,1048900000.0,361.945262,1211.009291,5616.53129,3216.917338,957.467889,966430.7,970042.2
max,1320.0,627057.0,7373606000.0,560.31307,1396.991494,53032.64945,50247.4439,7394.416334,16631900.0,8122197.0


In [26]:
from google.colab import files
combined_df.to_csv('combined_df.csv')
files.download('combined_df.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Conclusion

1. The population for both district and subcounty is evenly distributed except for Kaabong whose population is very high.

2. Kaabong has the highest population in district and subcounty level with low crop productivity.

3. Nakapiripirit seems to have a balance between population and crop productivity and crop yield.

4. Moronto has the lowest productivity which does not relate to its population which is higher.


# Recommendation

1. More support needs to be given to Kaabong district to improve its productivity which will match its population that is very high.

2. Support is also needed in Moroto to improve its productivity.

3. The same applies to Amudata which has very low sorghum production.

4. The farmers should try crop diversification.

# Future Work

In the future i would like to work with this data set having included the following information.
1. Soil type in each region.
2. Climate seasons.
3. The seed and fertelizer used is common across all the district.