# Data ingestion, pt. 2

In this workbook, I ingest and clean an additional dataset from the California Department of Public Health on the number of people living below 200% of the Federal Poverty level, per county.

In 2018, this level was approximately 12,100 for one person, 16,500 for a two-person household, and 25,100 for a four-person household. Due to California’s high cost of living, the state considers 200\% of the federal poverty rate to be a more realistic measure of financial hardship.

This data was derived from the U.S. Census Bureau American Community Survey, 2011-2015 Selected Population Tables table C17002 (overall poverty), and can be found online at:

https://www.cdph.ca.gov/Programs/OHE/Pages/HCI-Search.aspx

---

In [1]:
import pandas as pd
import numpy as np
import os

To start, I imported the data from a sheet in the Excel file.

In [2]:
ca_poverty_data = pd.read_excel('../data/ca_hhs_poverty_rate_2011-2015.xlsx', sheet_name = 'Data')

Not surprisingly, it contains a lot of extraneous columns.

In [3]:
ca_poverty_data.shape

(32005, 26)

In [4]:
ca_poverty_data.columns

Index(['ind_id', 'ind_definition', 'reportyear', 'race_eth_code',
       'race_eth_name', 'geotype', 'geotypevalue', 'geoname', 'county_name',
       'county_fips', 'region_name', 'region_code', 'strata_one_code',
       'strata_one_name', 'strata_two_code', 'strata_two_name', 'numerator',
       'denominator', 'estimate', 'LL_95CI', 'UL_95CI', 'SE', 'RSE',
       'CA_decile', 'CA_RR', 'version'],
      dtype='object')

First, I filtered by county-level data ONLY, then dropped a bunch of columns.

In [5]:
# Filter by county-level data only for the years 2012 - 2016
ca_poverty_counties = ca_poverty_data[ca_poverty_data['geotype']=='CO']

In [6]:
# Drop unneccessary columns
ca_poverty_counties = ca_poverty_counties.drop(['ind_id', 'ind_definition', 'geotype', 'geotypevalue', 'geoname',
                                                'county_fips', 'region_name', 'region_code', 'strata_one_code',
                                                'strata_one_name', 'strata_two_code', 'strata_two_name', 'version'
                                               ], axis = 1)

Then, I filtered further by the total of all races and ethnicities, for the 2011-2015 census survey.

In [7]:
ca_poverty_counties_2015 = ca_poverty_counties[
                     (ca_poverty_counties['race_eth_name']=='Total') & #pull total value for all race/ethnicities
                     (ca_poverty_counties['reportyear']=='2011-2015') ] #from report year 2011-2015

In [8]:
ca_poverty_counties_2015 = ca_poverty_counties_2015.drop(['reportyear',
                                                          'race_eth_code', 'race_eth_name',
                                                          'CA_decile'], axis = 1)

Next, I renamed all the columns for readability and descriptiveness according to the source's data dictionary.

In [9]:
ca_poverty_counties_2015.columns = ['county', 'below_200pct_poverty', 'total_pop', 'pct_estimate',
                                    'lower_bound_95CI', 'upper_bound_95CI', 'st_error', 'rel_error',
                                    'pct_above_below_state_est']

Finally, I re-indexed the counties in alphabetical order.

In [10]:
ca_poverty_counties_2015 = ca_poverty_counties_2015.reset_index()
ca_poverty_counties_2015 = ca_poverty_counties_2015.drop('index', axis = 1)

Luckily, the data is already formatted properly.

In [11]:
ca_poverty_counties_2015.dtypes

county                        object
below_200pct_poverty           int64
total_pop                      int64
pct_estimate                 float64
lower_bound_95CI             float64
upper_bound_95CI             float64
st_error                     float64
rel_error                    float64
pct_above_below_state_est    float64
dtype: object

Time to save to a new .csv!

In [12]:
#ca_poverty_counties_2015.to_csv('../data/ca_poverty_counties_2015.csv')