# Project 2: More Data, More Visualizations

For this project, the students will:
* find a data set of their choosing
* get approval from the instructor to use that data set
* upon approval, find another (related) data set
* join the new data set with the original one to create an "enriched" data set
* perform an open-ended Exploratory Data Analysis (EDA) on the enriched data set

Regarding the last bullet, "open-ended" means the student chooses the EDA that is performed. The student should have at least three types of data analysis (e.g., mean, standard deviation) and at least three types of graphs (e.g., historgram, bar graph). The student will explain why those variables were chosen for numerical or graphical analysis. Finally, the student will make note of any unusual values for any variable that is analyzed.

### Step 1
On the data provided, I will load and clean the yield data and population data 

## Research Questions
##### 1) Which subcounties in Karamoja had the lowest maize and sorghum productivity in the 2017 crop season, and how does this correlate with their population density?
(Specific: I will focuses on subcounties with low productivity, Measurable:Both productivity and population density can be quantified, Achievable: Data is available in the dataset, Relevant: Helps NGOs prioritize interventions, Time-bound: Uses the 2017 crop season data.)
##### 2) How does the total crop area per subcounty impact the average yield of maize in Karamoja during the 2017 crop season?
(Specific: I will examines the relationship between crop area and yield, Measurable: Crop area and yield are numerical variables, Achievable: Data is present in the dataset, Relevant: Helps optimize land use strategies, Time-bound: Limited to the 2017 season.)
##### 3) Which districts in Karamoja have the highest food security risk based on the 2017 crop yield per capita, and how does this vary at the subcounty level?
(Specific: I will identifies food security risk based on crop yield per capita, Measurable: Yield per capita can be calculated, Achievable: Population and yield data are available, Relevant: Directly informs food security interventions, Time-bound: Uses 2017 season data.)


In [3]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv(r"C:\Users\peter\Downloads\DATA\DATA\TABLES\Uganda_Karamoja_Subcounty_Crop_Yield_Population.csv")  # Example CSV file

In [4]:
print(df.isnull().sum())  # Count missing values in each column


OBJECTID          0
SUBCOUNTY_NAME    0
DISTRICT_NAME     0
POP               0
Area              0
Karamoja          0
S_Yield_Ha        0
M_Yield_Ha        0
Crop_Area_Ha      0
S_Area_Ha         0
M_Area_Ha         0
S_Prod_Tot        0
M_Prod_Tot        0
dtype: int64


In [5]:
df.dropna(inplace=True)  # Remove rows with any missing values
df.dropna(axis=1, inplace=True)  # Remove columns with missing values


In [7]:
print(df.columns)  # Lists all column names


Index(['OBJECTID', 'SUBCOUNTY_NAME', 'DISTRICT_NAME', 'POP', 'Area',
       'Karamoja', 'S_Yield_Ha', 'M_Yield_Ha', 'Crop_Area_Ha', 'S_Area_Ha',
       'M_Area_Ha', 'S_Prod_Tot', 'M_Prod_Tot'],
      dtype='object')


In [10]:
df.fillna({"POP": "default_value"}, inplace=True)  # Fill missing values in a specific column
df["POP"].fillna(df["POP"].mean(), inplace=True)  # Fill with mean


In [13]:
df["SUBCOUNTY_NAME"].fillna("Unknown", inplace=True)


In [14]:
df.drop_duplicates(inplace=True)


In [None]:
Q1 = df["numeric_column"].quantile(0.25)
Q3 = df["numeric_column"].quantile(0.75)
IQR = Q3 - Q1
df = df[(df["numeric_column"] >= (Q1 - 1.5 * IQR)) & (df["numeric_column"] <= (Q3 + 1.5 * IQR))]


In [17]:
df.to_csv("cleaned_data.csv", index=False)


In [19]:
df.to_csv("cleaned_data.csv", index=False)  # Saves without the index column


In [None]:
import pandas as pd

# Load the cleaned CSV file
df = pd.read_csv("cleaned_data.csv")

# Display the first few rows
print(df.head())


In [20]:
import os
print(os.path.exists("cleaned_data.csv"))  # Should return True if the file exists


True


In [21]:
print(df)


    OBJECTID              SUBCOUNTY_NAME  DISTRICT_NAME     POP        Area  \
0        263                     KACHERI         KOTIDO   17244  1067176155   
1        264                      KOTIDO         KOTIDO   52771   597575188   
2        265         KOTIDO TOWN COUNCIL         KOTIDO   27389    23972401   
3        266                NAKAPERIMORU         KOTIDO   38775   419111591   
4        267                  PANYANGARA         KOTIDO   65704   880955930   
5        268                      RENGEN         KOTIDO   41273   652744859   
6        591               KAABONG  EAST        KAABONG   42221    60801942   
7        592        KAABONG TOWN COUNCIL        KAABONG   38857    13071455   
8        593                KAABONG WEST        KAABONG   41454    67612362   
9        594                    KALAPATA        KAABONG   99203   223116860   
10       595                      KAMION        KAABONG   60070  1199409465   
11       596                      KAPEDO        KAAB

In [22]:
import os

file_path = os.path.abspath("cleaned_data.csv")  # Get the full path
df.to_csv(file_path, index=False)  # Save CSV

print("File saved at:", file_path)  # Print file location


File saved at: C:\Users\peter\Documents\Flatiron\dsc-dvp-project02\cleaned_data.csv


### Tableau presentation link 
[here](https://public.tableau.com/views/Project2_17391229215050/KaramojaFoodSecurityMonitor?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)

### Power Point Presentation link 
[here](https://drive.google.com/file/d/1ar4XiE5wW1LJw14vOXoSbCcNQrc6Rtje/view?usp=sharing)  
