## Children Household Count Per County File Processing

In this file, we will be making a dataframe that includes the number of children in all of the households in each county of the US.

### Import statements

In [1]:
import pandas as pd
import geopandas as gpd
import os



### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/procesing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/procesing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/procesing```.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Importing county level shape file

This shapefile of US counties was taken from the United States Census Bureau's Cartographic Boundary Files (available [here](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)). It file provides the polygon geometries of each county in the US; this will allow us to make visualizations of the US with breakdowns by county. In this case, we are making this processing file to allow us make a map of US with children count by county.

In [3]:
df_counties = gpd.read_file(DATA_DIR + '/source_files/county_shapefiles/counties.shp')
df_counties['county_fips'] = df_counties['STATEFP'] + df_counties['COUNTYFP']
df_counties = df_counties[((df_counties['STATEFP'] != '02') & (df_counties['STATEFP'] != '15') & 
                           (df_counties['STATEFP'] != '72') & (df_counties['STATEFP'] != '78') &
                           (df_counties['STATEFP'] != '60') & (df_counties['STATEFP'] != '66') &
                          (df_counties['STATEFP'] != '69'))]
df_counties.head(n=3)

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry,county_fips
0,20,161,485044,0500000US20161,20161,Riley,Riley County,KS,Kansas,6,1579077672,32047392,"POLYGON ((-96.96095 39.28670, -96.96106 39.288...",20161
1,19,159,465268,0500000US19159,19159,Ringgold,Ringgold County,IA,Iowa,6,1386932347,8723135,"POLYGON ((-94.47167 40.81255, -94.47166 40.819...",19159
2,30,9,1720111,0500000US30009,30009,Carbon,Carbon County,MT,Montana,6,5303728455,35213028,"POLYGON ((-109.79867 45.16734, -109.68779 45.1...",30009


### Reading InfoUSA data

This pre-processed InfoUSA household file is all the merged InfoUSA files combined in processing notebok **01_merging_files**. It contains the children counts of households as well as county FIPS of each household. This file also includes the transformed latitude and longitude coordinates.

In this dataset, the ```county_fips``` column represents each individual county; this is the column we will be grouping the children count by. After reading in the data, we filter for only the columns we will need for our visualizations.

In [4]:
df_hh = pd.read_parquet(DATA_DIR + '/infousa_merged.parquet')
df_hh.head()

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,16965,42269,IN,3,0,C,39.230097,-76.864096,-8556472.0,4754685.0
1,79667,8484,NV,5,1,C,44.024061,-96.665285,-10760730.0,5469166.0
2,88819,35578,ID,1,1,I,34.490381,-112.402712,-12512610.0,4094840.0
3,16748,25538,PA,10,1,K,34.74522,-88.55372,-9857755.0,4129311.0
4,43449,11049,NJ,1,1,C,44.178941,-83.250028,-9267351.0,5493176.0


In [5]:
df_hh = df_hh[['county_fips', 'state', 'child_num', 'has_child', 'age_code']]
df_hh.head()

Unnamed: 0,county_fips,state,child_num,has_child,age_code
0,42269,IN,3,0,C
1,8484,NV,5,1,C
2,35578,ID,1,1,I
3,25538,PA,10,1,K
4,11049,NJ,1,1,C


### Using ```.groupby()``` to find the number of children per county

Here, we are taking the dataframe from above and calculating the number of children per county. The ```.groupby()``` method groups the ```df_hh``` by ```county_fips```, and the ```.sum()``` method sums the values in ```child_num``` in each ```county_fips``` group. The resulting dataframe, ```df_child_count```, has the number of children in each county.

In [6]:
df_child_count = df_hh.groupby('county_fips')['child_num'].sum().reset_index()
df_child_count.head()

Unnamed: 0,county_fips,child_num
0,0,7
1,2,11
2,4,28
3,5,16
4,6,23


### Using ```.merge()``` to get processed dataframe

In order to map this dataframe visually, we must have a geometry for each county. To do this, we merge ```df_child_count``` from above with ```df_counties``` ```on``` the ```county_fips``` column. Afterwards, we will select only the columns we want to keep in this dataframe and rename column names for standardization purposes before exporting this as a parquet file to be used in future visualizations.

In [7]:
df = df_counties.merge(df_child_count, on = ['county_fips'], how = 'left')
df.head(n=3)

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry,county_fips,child_num
0,20,161,485044,0500000US20161,20161,Riley,Riley County,KS,Kansas,6,1579077672,32047392,"POLYGON ((-96.96095 39.28670, -96.96106 39.288...",20161,4.0
1,19,159,465268,0500000US19159,19159,Ringgold,Ringgold County,IA,Iowa,6,1386932347,8723135,"POLYGON ((-94.47167 40.81255, -94.47166 40.819...",19159,28.0
2,30,9,1720111,0500000US30009,30009,Carbon,Carbon County,MT,Montana,6,5303728455,35213028,"POLYGON ((-109.79867 45.16734, -109.68779 45.1...",30009,22.0


In [8]:
df = df[['STATEFP', 'NAME', 'county_fips','child_num', 'geometry']]
df.head()

Unnamed: 0,STATEFP,NAME,county_fips,child_num,geometry
0,20,Riley,20161,4.0,"POLYGON ((-96.96095 39.28670, -96.96106 39.288..."
1,19,Ringgold,19159,28.0,"POLYGON ((-94.47167 40.81255, -94.47166 40.819..."
2,30,Carbon,30009,22.0,"POLYGON ((-109.79867 45.16734, -109.68779 45.1..."
3,16,Bear Lake,16007,12.0,"POLYGON ((-111.63452 42.57034, -111.63010 42.5..."
4,55,Buffalo,55011,33.0,"POLYGON ((-92.08384 44.41200, -92.08310 44.414..."


In [9]:
df.rename(columns = {'STATEFP': 'state', 'NAME': 'county'}, inplace = True)

In [10]:
df.to_file(DATA_DIR + '/children_count_by_county.shp')

  df.to_file(DATA_DIR + '/children_count_by_county.shp')
