## Children Household Count Per County File Processing

In this file, we will be making a dataframe that includes the number of children in all of the households in each county of the US.

In [1]:
import pandas as pd
import geopandas as gpd
import os



### Importing county level shape file

This shapefile of US counties was also found online. This file provides the polygon geometries of each county in the US; this will allow us to make visualizations of the US with breakdowns by county. In this case, we are making this processing file to allow us make a map of US with children count by county.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

In [3]:
df_counties = gpd.read_file(DATA_DIR + '/source_files/county_shapefiles/counties.shp')
df_counties['county_fips'] = df_counties['STATEFP'] + df_counties['COUNTYFP']
df_counties

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry,county_fips
0,20,161,00485044,0500000US20161,20161,Riley,Riley County,KS,Kansas,06,1579077672,32047392,"POLYGON ((-96.96095 39.28670, -96.96106 39.288...",20161
1,19,159,00465268,0500000US19159,19159,Ringgold,Ringgold County,IA,Iowa,06,1386932347,8723135,"POLYGON ((-94.47167 40.81255, -94.47166 40.819...",19159
2,30,009,01720111,0500000US30009,30009,Carbon,Carbon County,MT,Montana,06,5303728455,35213028,"POLYGON ((-109.79867 45.16734, -109.68779 45.1...",30009
3,16,007,00395090,0500000US16007,16007,Bear Lake,Bear Lake County,ID,Idaho,06,2527123155,191364281,"POLYGON ((-111.63452 42.57034, -111.63010 42.5...",16007
4,55,011,01581065,0500000US55011,55011,Buffalo,Buffalo County,WI,Wisconsin,06,1750290818,87549529,"POLYGON ((-92.08384 44.41200, -92.08310 44.414...",55011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3229,53,003,01533502,0500000US53003,53003,Asotin,Asotin County,WA,Washington,06,1647427905,11291731,"POLYGON ((-117.47999 46.12199, -117.41948 46.1...",53003
3230,13,043,00342852,0500000US13043,13043,Candler,Candler County,GA,Georgia,06,629520841,15018189,"POLYGON ((-82.25457 32.35150, -82.25276 32.353...",13043
3231,48,451,01384011,0500000US48451,48451,Tom Green,Tom Green County,TX,Texas,06,3941965409,48077315,"POLYGON ((-101.26763 31.55646, -101.25039 31.5...",48451
3232,39,089,01074057,0500000US39089,39089,Licking,Licking County,OH,Ohio,06,1767478831,12761090,"POLYGON ((-82.78181 39.94698, -82.78126 39.955...",39089


### Importing infousa data

This preprocessed infousa household data contains the children counts of households as well as the zipcode and county fips of each household. This file also includes the transformed latitude and longitude coordinates.

In this dataset, the county_fips column represents each individual county; this is the column we will be grouping the children count by. 

In [4]:
df_hh = pd.read_parquet(DATA_DIR + '/infousa_merged.parquet')
df_hh

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,84606,37041,NH,12,0,H,47.536477,-123.299151,-1.372560e+07,6.030085e+06
1,66723,23407,OR,3,0,A,43.121187,-123.473248,-1.374498e+07,5.330436e+06
2,59965,50536,NV,7,1,I,38.462998,-111.014933,-1.235813e+07,4.645040e+06
3,38676,38340,SD,3,1,B,48.433631,-80.411200,-8.951334e+06,6.179301e+06
4,75640,24383,OR,15,1,J,42.203405,-73.802723,-8.215682e+06,5.191497e+06
...,...,...,...,...,...,...,...,...,...,...
99995,16202,52106,AL,5,1,M,41.305474,-71.274986,-7.934295e+06,5.057504e+06
99996,85007,53083,RI,6,1,B,40.467252,-90.764017,-1.010380e+07,4.934076e+06
99997,64030,18524,AZ,7,0,I,36.134337,-97.876163,-1.089552e+07,4.319122e+06
99998,32071,12458,LA,1,1,J,37.978958,-106.804916,-1.188947e+07,4.576454e+06


In [5]:
df_hh = df_hh[['county_fips', 'state', 'child_num', 'has_child', 'age_code']]
df_hh

Unnamed: 0,county_fips,state,child_num,has_child,age_code
0,37041,NH,12,0,H
1,23407,OR,3,0,A
2,50536,NV,7,1,I
3,38340,SD,3,1,B
4,24383,OR,15,1,J
...,...,...,...,...,...
99995,52106,AL,5,1,M
99996,53083,RI,6,1,B
99997,18524,AZ,7,0,I
99998,12458,LA,1,1,J


### Groupby

Here, we are taking the infousa preprocessed file and calculating the number of children per county. We are doing so using the ```groupby``` function, which groups the county_fips together by summing the corresponding ```child_num``` values. The resulting dataframe is giving the calculated number of children per county_fips.

In [6]:
df_child_count = df_hh.groupby('county_fips')['child_num'].sum().reset_index()
df_child_count

Unnamed: 0,county_fips,child_num
0,00000,191
1,00001,148
2,00002,198
3,00003,180
4,00004,155
...,...,...
35995,59595,197
35996,59596,252
35997,59597,146
35998,59598,249


### Merging child number counts with county geometries

Now that we have the child_num calculated for each ```county_fips```, we are merging the county geometry shapefile with the child count dataframe on the ```county_fips``` column. Afterwards, we will select on the columns we want to keep in this dataframe, before exporting this as a parquet file to be used in future visualizations.

In [7]:
df_final = df_counties.merge(df_child_count, on = ['county_fips'], how = 'left')
df_final

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,STATE_NAME,LSAD,ALAND,AWATER,geometry,county_fips,child_num
0,20,161,00485044,0500000US20161,20161,Riley,Riley County,KS,Kansas,06,1579077672,32047392,"POLYGON ((-96.96095 39.28670, -96.96106 39.288...",20161,111.0
1,19,159,00465268,0500000US19159,19159,Ringgold,Ringgold County,IA,Iowa,06,1386932347,8723135,"POLYGON ((-94.47167 40.81255, -94.47166 40.819...",19159,239.0
2,30,009,01720111,0500000US30009,30009,Carbon,Carbon County,MT,Montana,06,5303728455,35213028,"POLYGON ((-109.79867 45.16734, -109.68779 45.1...",30009,138.0
3,16,007,00395090,0500000US16007,16007,Bear Lake,Bear Lake County,ID,Idaho,06,2527123155,191364281,"POLYGON ((-111.63452 42.57034, -111.63010 42.5...",16007,112.0
4,55,011,01581065,0500000US55011,55011,Buffalo,Buffalo County,WI,Wisconsin,06,1750290818,87549529,"POLYGON ((-92.08384 44.41200, -92.08310 44.414...",55011,143.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3229,53,003,01533502,0500000US53003,53003,Asotin,Asotin County,WA,Washington,06,1647427905,11291731,"POLYGON ((-117.47999 46.12199, -117.41948 46.1...",53003,213.0
3230,13,043,00342852,0500000US13043,13043,Candler,Candler County,GA,Georgia,06,629520841,15018189,"POLYGON ((-82.25457 32.35150, -82.25276 32.353...",13043,206.0
3231,48,451,01384011,0500000US48451,48451,Tom Green,Tom Green County,TX,Texas,06,3941965409,48077315,"POLYGON ((-101.26763 31.55646, -101.25039 31.5...",48451,237.0
3232,39,089,01074057,0500000US39089,39089,Licking,Licking County,OH,Ohio,06,1767478831,12761090,"POLYGON ((-82.78181 39.94698, -82.78126 39.955...",39089,150.0


In [8]:
df_final = df_final[['STATEFP', 'NAME', 'county_fips','child_num', 'geometry']]
df_final

Unnamed: 0,STATEFP,NAME,county_fips,child_num,geometry
0,20,Riley,20161,111.0,"POLYGON ((-96.96095 39.28670, -96.96106 39.288..."
1,19,Ringgold,19159,239.0,"POLYGON ((-94.47167 40.81255, -94.47166 40.819..."
2,30,Carbon,30009,138.0,"POLYGON ((-109.79867 45.16734, -109.68779 45.1..."
3,16,Bear Lake,16007,112.0,"POLYGON ((-111.63452 42.57034, -111.63010 42.5..."
4,55,Buffalo,55011,143.0,"POLYGON ((-92.08384 44.41200, -92.08310 44.414..."
...,...,...,...,...,...
3229,53,Asotin,53003,213.0,"POLYGON ((-117.47999 46.12199, -117.41948 46.1..."
3230,13,Candler,13043,206.0,"POLYGON ((-82.25457 32.35150, -82.25276 32.353..."
3231,48,Tom Green,48451,237.0,"POLYGON ((-101.26763 31.55646, -101.25039 31.5..."
3232,39,Licking,39089,150.0,"POLYGON ((-82.78181 39.94698, -82.78126 39.955..."


### Converting GeoDataFrame

In order to export this dataframe as a shapefile, the pandas dataframe must be converted to a GeoDataFrame using the ```gpd.GeoDataFrame()``` function. 

In [9]:
gpd_df = gpd.GeoDataFrame(df_final)
gpd_df

Unnamed: 0,STATEFP,NAME,county_fips,child_num,geometry
0,20,Riley,20161,111.0,"POLYGON ((-96.96095 39.28670, -96.96106 39.288..."
1,19,Ringgold,19159,239.0,"POLYGON ((-94.47167 40.81255, -94.47166 40.819..."
2,30,Carbon,30009,138.0,"POLYGON ((-109.79867 45.16734, -109.68779 45.1..."
3,16,Bear Lake,16007,112.0,"POLYGON ((-111.63452 42.57034, -111.63010 42.5..."
4,55,Buffalo,55011,143.0,"POLYGON ((-92.08384 44.41200, -92.08310 44.414..."
...,...,...,...,...,...
3229,53,Asotin,53003,213.0,"POLYGON ((-117.47999 46.12199, -117.41948 46.1..."
3230,13,Candler,13043,206.0,"POLYGON ((-82.25457 32.35150, -82.25276 32.353..."
3231,48,Tom Green,48451,237.0,"POLYGON ((-101.26763 31.55646, -101.25039 31.5..."
3232,39,Licking,39089,150.0,"POLYGON ((-82.78181 39.94698, -82.78126 39.955..."


In [10]:
gpd_df.to_file(DATA_DIR + '/children_count_by_county.shp')

  gpd_df.to_file(DATA_DIR + '/children_count_by_county.shp')
