# Cleaning of the geographical enrichment variables

This notebook aims to bring the geographical enrichment variables (https://www.ers.usda.gov/data-products/county-level-data-sets/) into a format that is useful to work with for the purpose of this thesis.

## Set-Up

The goal is to work with pandas dataframes, so numpy and pandas are required.

In [21]:
import pandas as pd 
import numpy as np

## Raw Files

All data (separate datasets on education, population, poverty, and unemployment per state) is available in CSV format, the inbuilt pandas function `read_csv` is utilized to store the data in a pandas dataframe.

In [22]:
educ_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/Education.csv", encoding = "latin1")
pop_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/PopulationEstimates.csv", encoding = "latin1")
pov_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/PovertyEstimates.csv", encoding = "latin1")
unemp_raw = pd.read_csv("C:/Users/Hauke/OneDrive - ucp.pt/04_Thesis/00_GitHub/Thesis/data/raw/California_Geographic/Unemployment.csv", encoding = "latin1")

## Reshaping

The downloadable CSVs are in a different format than displayed on the USDA homepage. As the aim of these dataframes is to enrich the main dataset, which does contain FIPS codes, the data is reshaped in a way that this FIPS code (together with the corresponding are name) are used to index the dataframe in order to facilitate merging and/or county-level analysis.

### Education

The raw dataframe imported above is filtered to include only the relevant values (i.e. counties in California, but not the summary of the state of California itself). Only the percentage values relevant for the thesis are kept, the rest is dropped. Also, only the newest datapoints (years 2017-2021) are kept, the rest is dropped. As the data is structured in a way that information on years and KPIs are stored in the "Attribute" column, the dataframe is pivoted to have the KPIs as column for easier comparability.

In [62]:
educ_ca = educ_raw[(educ_raw["State"] == "CA") & 
                   (educ_raw["Attribute"].str.contains("Percent")) & 
                   (educ_raw["Attribute"].str.contains("2017-21")) & 
                   (educ_raw["Area name"].str.contains("County"))]

educ_ca_t = educ_ca.pivot(index = ["Federal Information Processing Standard (FIPS) Code", "Area name"],columns="Attribute", values="Value")

In [63]:
educ_ca_t.head()

Unnamed: 0_level_0,Attribute,"Percent of adults completing some college or associate's degree, 2017-21","Percent of adults with a bachelor's degree or higher, 2017-21","Percent of adults with a high school diploma only, 2017-21","Percent of adults with less than a high school diploma, 2017-21"
Federal Information Processing Standard (FIPS) Code,Area name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6001,Alameda County,22.707463,49.643049,16.653921,10.995566
6003,Alpine County,28.68937,39.318885,25.696594,6.29515
6005,Amador County,41.141056,19.51773,30.650906,8.690307
6007,Butte County,38.278925,29.760071,22.300534,9.66047
6009,Calaveras County,39.950216,19.910732,30.943321,9.195731


### Population

The only relevant KPI from this dataframe is the most recent (2022) population figure. Besides that (and a different variable naming), the procedure is the same as for the education dataframe.

In [59]:
pop_ca = pop_raw[(pop_raw["State"] == "CA") &
                 (pop_raw["Attribute"].str.contains("POP_ESTIMATE_2022")) &
                 (pop_raw["Area_Name"].str.contains("County"))]

pop_ca_t = pop_ca.pivot(index = ["FIPStxt", "Area_Name"],columns="Attribute", values="Value")

In [64]:
pop_ca_t.head()

Unnamed: 0_level_0,Attribute,POP_ESTIMATE_2022
FIPStxt,Area_Name,Unnamed: 2_level_1
6001,Alameda County,1628997.0
6003,Alpine County,1190.0
6005,Amador County,41412.0
6007,Butte County,207303.0
6009,Calaveras County,46563.0
