# Cleaning of the geographical enrichment variables

This notebook aims to bring the geographical enrichment variables (https://www.ers.usda.gov/data-products/county-level-data-sets/) into a format that is useful to work with for the purpose of this thesis.

## Set-Up

The goal is to work with pandas dataframes, so numpy and pandas are required.

In [21]:
import pandas as pd 
import numpy as np

## Raw Files

All data (separate datasets on education, population, poverty, and unemployment per state) is available in CSV format, the inbuilt pandas function `read_csv` is utilized to store the data in a pandas dataframe.

In [22]:
educ_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/Education.csv", encoding = "latin1")
pop_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/PopulationEstimates.csv", encoding = "latin1")
pov_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/PovertyEstimates.csv", encoding = "latin1")
unemp_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/Unemployment.csv", encoding = "latin1")

## Reshaping

The downloadable CSVs are in a different format than displayed on the USDA homepage. As the aim of these dataframes is to enrich the main dataset, which does contain FIPS codes, the data is reshaped in a way that this FIPS code (together with the corresponding are name) are used to index the dataframe in order to facilitate merging and/or county-level analysis.

### Education

The raw dataframe imported above is filtered to include only the relevant values (i.e. counties in California, but not the summary of the state of California itself). Only the percentage values relevant for the thesis are kept, the rest is dropped. Also, only the newest datapoints (years 2017-2021) are kept, the rest is dropped. As the data is structured in a way that information on years and KPIs are stored in the "Attribute" column, the dataframe is pivoted to have the KPIs as column for easier comparability.

In [62]:
educ_ca = educ_raw[(educ_raw["State"] == "CA") & 
                   (educ_raw["Attribute"].str.contains("Percent")) & 
                   (educ_raw["Attribute"].str.contains("2017-21")) & 
                   (educ_raw["Area name"].str.contains("County"))]

educ_ca_t = educ_ca.pivot(index = ["Federal Information Processing Standard (FIPS) Code", "Area name"],columns="Attribute", values="Value")

In [63]:
educ_ca_t.head()

Unnamed: 0_level_0,Attribute,"Percent of adults completing some college or associate's degree, 2017-21","Percent of adults with a bachelor's degree or higher, 2017-21","Percent of adults with a high school diploma only, 2017-21","Percent of adults with less than a high school diploma, 2017-21"
Federal Information Processing Standard (FIPS) Code,Area name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6001,Alameda County,22.707463,49.643049,16.653921,10.995566
6003,Alpine County,28.68937,39.318885,25.696594,6.29515
6005,Amador County,41.141056,19.51773,30.650906,8.690307
6007,Butte County,38.278925,29.760071,22.300534,9.66047
6009,Calaveras County,39.950216,19.910732,30.943321,9.195731


### Population

The only relevant KPI from this dataframe is the most recent (2022) population figure. Besides that (and a different variable naming), the procedure is the same as for the education dataframe.

In [59]:
pop_ca = pop_raw[(pop_raw["State"] == "CA") &
                 (pop_raw["Attribute"].str.contains("POP_ESTIMATE_2022")) &
                 (pop_raw["Area_Name"].str.contains("County"))]

pop_ca_t = pop_ca.pivot(index = ["FIPStxt", "Area_Name"],columns="Attribute", values="Value")

In [64]:
pop_ca_t.head()

Unnamed: 0_level_0,Attribute,POP_ESTIMATE_2022
FIPStxt,Area_Name,Unnamed: 2_level_1
6001,Alameda County,1628997.0
6003,Alpine County,1190.0
6005,Amador County,41412.0
6007,Butte County,207303.0
6009,Calaveras County,46563.0


### Poverty

The variable of interest in this dataframe is "PCTPOVALL_2021", which is the estimation of people living in poverty among all people in the county. The procedure is the same as for the education dataframe.

In [73]:
pov_ca = pov_raw[(pov_raw["Stabr"] == "CA") &
                 (pov_raw["Attribute"].str.contains("PCTPOVALL_2021")) &
                 (pov_raw["Area_name"].str.contains("County"))]

pov_ca_t = pov_ca.pivot(index = ["FIPS_Code", "Area_name"],columns="Attribute", values="Value")

In [74]:
pov_ca_t.head()

Unnamed: 0_level_0,Attribute,PCTPOVALL_2021
FIPS_Code,Area_name,Unnamed: 2_level_1
6001,Alameda County,9.4
6003,Alpine County,15.8
6005,Amador County,11.1
6007,Butte County,16.6
6009,Calaveras County,13.5


### Unemployment

The unemployment dataframe contains two values of interest per county: The unemployment rate (for the sake of this analysis, the most recent figures from 2022 - "Unemployment_rate_2022" is used) and the median household income ("Median_Household_Income_2022"). The procedure is the same as for the previous dataframes. 

In [97]:
unemp_ca = unemp_raw[(unemp_raw["State"] == "CA") &
                     (unemp_raw["Attribute"].isin(["Unemployment_rate_2022", "Median_Household_Income_2021"])) &
                     (unemp_raw["Area_Name"].str.contains("County"))]

unemp_ca_t = unemp_ca.pivot(index = ["FIPS_Code", "Area_Name"],columns="Attribute", values="Value")

In [98]:
unemp_ca_t.head()

Unnamed: 0_level_0,Attribute,Median_Household_Income_2021,Unemployment_rate_2022
FIPS_Code,Area_Name,Unnamed: 2_level_1,Unnamed: 3_level_1
6001,"Alameda County, CA",108971.0,3.3
6003,"Alpine County, CA",87570.0,5.4
6005,"Amador County, CA",68159.0,4.4
6007,"Butte County, CA",62982.0,4.5
6009,"Calaveras County, CA",68298.0,3.6


## Check completeness

Checking if there are data available for all counties

In [99]:
print(educ_ca_t.shape, pop_ca_t.shape, pov_ca_t.shape, unemp_ca_t.shape)

(58, 4) (58, 1) (58, 1) (58, 2)


In [104]:
print(educ_ca_t.isna().sum(), "\n", pop_ca_t.isna().sum(), "\n", pov_ca_t.isna().sum(), "\n", unemp_ca_t.isna().sum())

Attribute
Percent of adults completing some college or associate's degree, 2017-21    0
Percent of adults with a bachelor's degree or higher, 2017-21               0
Percent of adults with a high school diploma only, 2017-21                  0
Percent of adults with less than a high school diploma, 2017-21             0
dtype: int64 
 Attribute
POP_ESTIMATE_2022    0
dtype: int64 
 Attribute
PCTPOVALL_2021    0
dtype: int64 
 Attribute
Median_Household_Income_2021    0
Unemployment_rate_2022          0
dtype: int64


## Merging

In order to create one dataframe containing all information, the dataframes are merged based on their indices. As the indices are named differently and multiindexing is used, the indices are renamed to match the FIPS code and county name of the main dataframe by overwriting the indices of all dataframes with the indices of the education dataframes. This is only possible as we have already checked for completeness of the data and have ensured that all values are identical (albeit named differently). The resulting dataframe holds all information in a unified format, structured by FIPS code and county name.

In [109]:
pop_ca_t.index = educ_ca_t.index
pov_ca_t.index = educ_ca_t.index
unemp_ca_t.index = educ_ca_t.index

In [114]:
# Merge the dataframes based on the first item of the index
ca_geo = educ_ca_t.merge(pop_ca_t, left_index = True, right_index = True, how = "inner")
ca_geo = ca_geo.merge(pov_ca_t, left_index = True, right_index = True, how = "inner")
ca_geo = ca_geo.merge(unemp_ca_t, left_index = True, right_index = True, how = "inner")

In [115]:
ca_geo.head()

Unnamed: 0_level_0,Attribute,"Percent of adults completing some college or associate's degree, 2017-21","Percent of adults with a bachelor's degree or higher, 2017-21","Percent of adults with a high school diploma only, 2017-21","Percent of adults with less than a high school diploma, 2017-21",POP_ESTIMATE_2022,PCTPOVALL_2021,Median_Household_Income_2021,Unemployment_rate_2022
Federal Information Processing Standard (FIPS) Code,Area name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6001,Alameda County,22.707463,49.643049,16.653921,10.995566,1628997.0,9.4,108971.0,3.3
6003,Alpine County,28.68937,39.318885,25.696594,6.29515,1190.0,15.8,87570.0,5.4
6005,Amador County,41.141056,19.51773,30.650906,8.690307,41412.0,11.1,68159.0,4.4
6007,Butte County,38.278925,29.760071,22.300534,9.66047,207303.0,16.6,62982.0,4.5
6009,Calaveras County,39.950216,19.910732,30.943321,9.195731,46563.0,13.5,68298.0,3.6


## Renaming

The columns are renamed to reflect the KPIs they contain.

In [120]:
ca_geo.index.names = ["FIPS", "County"]
ca_geo.columns = ["percentage_college_degree", "percentage_bachelor_or_higher", "percentage_high_school", "percentage_less_than_high_school", "population", "poverty_rate", "median_household_income", "unemployment_rate"]

In [121]:
ca_geo.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,percentage_college_degree,percentage_bachelor_or_higher,percentage_high_school,percentage_less_than_high_school,population,poverty_rate,median_household_income,unemployment_rate
FIPS,County,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6001,Alameda County,22.707463,49.643049,16.653921,10.995566,1628997.0,9.4,108971.0,3.3
6003,Alpine County,28.68937,39.318885,25.696594,6.29515,1190.0,15.8,87570.0,5.4
6005,Amador County,41.141056,19.51773,30.650906,8.690307,41412.0,11.1,68159.0,4.4
6007,Butte County,38.278925,29.760071,22.300534,9.66047,207303.0,16.6,62982.0,4.5
6009,Calaveras County,39.950216,19.910732,30.943321,9.195731,46563.0,13.5,68298.0,3.6
