# Socioeconomic Statistics - Data Processing Steps

This Jupyter notebook documents the repeatable steps of processing the CSVs of Census socioeconomic data (2012-2016 American Community Survey) by block group. The below worklow is summarized here:

1. Import socioeconomic data from CSVs
2. Merge data variable dataframes into one dataframe
3. Complete formating of pandas dataframe
4. Load final socioeconomic statistics data to AWS

Once the violent crime data and the socioeceonomic data are available via AWS, they can be remotely joined and be ready for machine learning analysis.

## 1. Import socioeconomic data from CSVs

In [1]:
# SQL Alchemy
from sqlalchemy import create_engine

# PyMySQL 
import pymysql
pymysql.install_as_MySQLdb()

# Config variables
from config import remote_db_endpoint, remote_db_port
from config import remote_dccrime_dbname, remote_dccrime_dbuser, remote_dccrime_dbpwd

# Import Pandas
import pandas as pd

In [2]:
# Read in CSV files for socioeconomic indicators and rename relevant columns to human-readable.
total_pop_raw = pd.read_csv("SocioEcon_Data_Raw/aff_download/ACS_16_5YR_B01003_with_ann.csv")
total_pop_df = total_pop_raw.rename(columns={"HD01_VD01":"total_pop"})

poverty_raw = pd.read_csv("SocioEcon_Data_Raw/aff_download/ACS_16_5YR_B17021_with_ann.csv")
poverty_df = poverty_raw.rename(columns={"HD01_VD02":"num_in_pov"})

employment_raw = pd.read_csv("SocioEcon_Data_Raw/aff_download/ACS_16_5YR_B23025_with_ann.csv")
employment_df = employment_raw.rename(columns={"HD01_VD05":"num_unemp", "HD01_VD03":"labor_force"})

vacancy_raw = pd.read_csv("SocioEcon_Data_Raw/aff_download/ACS_16_5YR_B25002_with_ann.csv")
vacancy_df = vacancy_raw.rename(columns={"HD01_VD01":"total_units", "HD01_VD03":"vacant_units"})


## 2. Merge multiple dataframes

In [3]:
# Merge multiple dataframes of socioeconomic data on block group ID.
from functools import reduce
df_list = [total_pop_df, poverty_df, employment_df, vacancy_df]
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['GEO.id2'],
                                            how='outer'), df_list)

df_merged.drop(df_merged.index[0], inplace=True)
df_merged.head()

Unnamed: 0,GEO.id_x,GEO.id2,GEO.display-label_x,total_pop,HD02_VD01_x,GEO.id_y,GEO.display-label_y,HD01_VD01_x,HD02_VD01_y,num_in_pov,...,HD01_VD07_y,HD02_VD07_y,GEO.id_y.1,GEO.display-label_y.1,total_units,HD02_VD01_y.1,HD01_VD02_y,HD02_VD02,vacant_units,HD02_VD03
1,1500000US110010001001,110010001001,"Block Group 1, Census Tract 1, District of Col...",1382,312,1500000US110010001001,"Block Group 1, Census Tract 1, District of Col...",1359,312,0,...,201,113,1500000US110010001001,"Block Group 1, Census Tract 1, District of Col...",728,133,607,123,121,105
2,1500000US110010001002,110010001002,"Block Group 2, Census Tract 1, District of Col...",1463,304,1500000US110010001002,"Block Group 2, Census Tract 1, District of Col...",1463,304,45,...,346,155,1500000US110010001002,"Block Group 2, Census Tract 1, District of Col...",1089,164,918,161,171,118
3,1500000US110010001003,110010001003,"Block Group 3, Census Tract 1, District of Col...",972,217,1500000US110010001003,"Block Group 3, Census Tract 1, District of Col...",972,217,14,...,160,70,1500000US110010001003,"Block Group 3, Census Tract 1, District of Col...",524,120,437,90,87,96
4,1500000US110010001004,110010001004,"Block Group 4, Census Tract 1, District of Col...",1188,337,1500000US110010001004,"Block Group 4, Census Tract 1, District of Col...",1188,337,106,...,297,195,1500000US110010001004,"Block Group 4, Census Tract 1, District of Col...",601,127,492,118,109,93
5,1500000US110010002011,110010002011,"Block Group 1, Census Tract 2.01, District of ...",3733,361,1500000US110010002011,"Block Group 1, Census Tract 2.01, District of ...",83,53,55,...,2081,329,1500000US110010002011,"Block Group 1, Census Tract 2.01, District of ...",3,4,3,4,0,12


## 3. Calculate metrics and format final dataframe

In [4]:
# Cast relevant dataframe fields as float to allow for calculation

BG_ID = df_merged["GEO.id2"].str[5:]
GEOID = df_merged["GEO.id2"]
df_merged["total_pop"] = df_merged["total_pop"].astype(float)
df_merged["num_in_pov"] = df_merged["num_in_pov"].astype(float)
df_merged["labor_force"] = df_merged["labor_force"].astype(float)
df_merged["num_unemp"] = df_merged["num_unemp"].astype(float)
df_merged["total_units"] = df_merged["total_units"].astype(float)
df_merged["vacant_units"] = df_merged["vacant_units"].astype(float)

In [5]:
# Calculate total population, percent in poverty, percent unemployed, and percent housing vacancy.

total_pop = df_merged["total_pop"]
pct_poverty = (df_merged["num_in_pov"]/df_merged["total_pop"])*100
pct_unemployed = (df_merged["num_unemp"]/df_merged["labor_force"])*100
pct_vacancy = (df_merged["vacant_units"]/df_merged["total_units"])*100

In [6]:
# Write socioeconomic variable calcluations to new dataframe
socioecon_data_df = pd.DataFrame({"Block Group ID": BG_ID,
                                  "GEOID": GEOID,
                                   "Total Population": total_pop,
                                   "Pct Poverty": pct_poverty,
                                   "Pct Unemployed": pct_unemployed,
                                   "Pct Vacant": pct_vacancy})

socioecon_data_df.head(10)

Unnamed: 0,Block Group ID,GEOID,Total Population,Pct Poverty,Pct Unemployed,Pct Vacant
1,1001,110010001001,1382.0,0.0,0.0,16.620879
2,1002,110010001002,1463.0,3.075871,5.433746,15.702479
3,1003,110010001003,972.0,1.440329,9.256449,16.603053
4,1004,110010001004,1188.0,8.922559,3.095975,18.136439
5,2011,110010002011,3733.0,1.473346,6.604938,0.0
6,2021,110010002021,1224.0,15.686275,2.124431,16.824197
7,2022,110010002022,566.0,7.243816,0.0,6.329114
8,2023,110010002023,895.0,20.782123,10.46729,5.135135
9,2024,110010002024,1732.0,18.591224,4.782609,21.698113
10,3001,110010003001,1114.0,11.220826,3.059805,8.206107


## 4. Load Socioeconomic Data to AWS

In [9]:
# Load socioeconomic data to AWS
engine = create_engine(f"mysql://{remote_dccrime_dbuser}:{remote_dccrime_dbpwd}@{remote_db_endpoint}:{remote_db_port}/{remote_dccrime_dbname}")
conn = engine.connect()


socioecon_data_df.to_sql(name='socioecon_data_update', if_exists='replace', con=conn, chunksize=1000, index=False)