## Children Household Count Per County File Processing

In this file, we will be making a dataframe that includes the number of children in all of the households in each county of the US.

In [12]:
import pandas as pd
import geopandas as gpd

'/hpc/home/sf282/ondemand/code-plus-celine/processing'

### Importing zipcode file

This zipcode file we found online. This file provides a list of zipcodes as well as the associated county each zipcode is in. This is important because we ultimately want a dataframe that contains children household counts by county. In this dataframe, a lot of the columns are unnecessary; what we will be using and focusing on are the zipcode and county columns because we want to use this data to classify zipcode by county.

In [13]:
zipcodes = pd.read_csv('/hpc/group/codeplus22-vis/us_zipcode_shp_files/zip_code_database.csv')
zipcodes.head()

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population
0,501,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


### Importing county level shape file

This shapefile of US counties was also found online. This file provides the polygon geometries of the zipcodes in the US. The rest of the columns we will drop because the purpose of including this dataframe is to be able to correlate a polygon geometry to each zipcode. As you can see, this file and the file above are all for the purpose of being able to provide additional classification information for each zipcode; this will allow us to do file merging to make visualizations on a county level. 

In [14]:
shp_file = gpd.read_file('/hpc/group/codeplus22-vis/us_zipcode_shp_files/cb_2018_us_zcta510_500k.shp')
shp_file.rename(columns = {'ZCTA5CE10' : 'zip'}, inplace = True)
shp_file = shp_file.astype({"zip": int})
shp_file


Unnamed: 0,zip,AFFGEOID10,GEOID10,ALAND10,AWATER10,geometry
0,36083,8600000US36083,36083,659750662,5522919,"MULTIPOLYGON (((-85.63225 32.28098, -85.62439 ..."
1,35441,8600000US35441,35441,172850429,8749105,"MULTIPOLYGON (((-87.83287 32.84437, -87.83184 ..."
2,35051,8600000US35051,35051,280236456,5427285,"POLYGON ((-86.74384 33.25002, -86.73802 33.251..."
3,35121,8600000US35121,35121,372736030,5349303,"POLYGON ((-86.58527 33.94743, -86.58033 33.948..."
4,35058,8600000US35058,35058,178039922,3109259,"MULTIPOLYGON (((-86.87884 34.21196, -86.87649 ..."
...,...,...,...,...,...,...
33139,10983,8600000US10983,10983,5267037,16676,"POLYGON ((-73.96564 41.02787, -73.96612 41.029..."
33140,50460,8600000US50460,50460,93166133,0,"POLYGON ((-92.80629 43.23026, -92.80354 43.232..."
33141,40870,8600000US40870,40870,18226594,201441,"POLYGON ((-83.19264 36.91650, -83.19086 36.916..."
33142,40914,8600000US40914,40914,32269366,419039,"POLYGON ((-83.62748 37.07419, -83.62455 37.073..."


### Joining zipcode file and county shp file by zipcode

We are now joining the zipcode file with county classifications with the zipcode files with the geometries. The column specified in the ```on``` parameter indicates which column the two dataframes have in column, and how they should be merged. Afterwards, we are just keeping the columns we want.

In [15]:

zip_shp = pd.merge(zipcodes, shp_file, on = ['zip'], how = 'outer')

zip_shp = zip_shp[['zip', 'county', 'geometry']]
zip_shp

Unnamed: 0,zip,county,geometry
0,501,Suffolk County,
1,544,Suffolk County,
2,601,Adjuntas Municipio,"POLYGON ((-66.83526 18.20998, -66.83287 18.214..."
3,602,Aguada Municipio,"POLYGON ((-67.23935 18.37626, -67.23810 18.377..."
4,603,Aguadilla Municipio,"POLYGON ((-67.16965 18.47511, -67.16909 18.477..."
...,...,...,...
42719,99926,Prince of Wales-Outer Ketchikan Borough,"MULTIPOLYGON (((-131.33528 55.18820, -131.3336..."
42720,99927,Prince of Wales-Hyder Census Area,"MULTIPOLYGON (((-133.12865 56.26789, -133.1261..."
42721,99928,Ketchikan Gateway Borough,
42722,99929,Wrangell City and Borough,"MULTIPOLYGON (((-132.14655 56.15122, -132.1453..."


### Importing infousa data

This preprocessed infousa household data provides the children counts of households as well as the zipcode and county of the household. This file also includes the transformed latitude and longitude coordinates.

We will then get rid of certain columns that are unnecessary because we will be merging the dataframes by the zipcode column.

In [16]:

df_HH = pd.read_parquet('/hpc/group/codeplus22-vis/infousa_copy/zip_00_99_final.parquet')
df_HH


Unnamed: 0,zip,county,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,18833,113,PA,0,0,K,41.546738,-76.540436,-8.520442e+06,5.093323e+06
1,18833,15,PA,0,0,H,41.590800,-76.424200,-8.507503e+06,5.099879e+06
2,18833,15,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,5.101307e+06
3,18833,15,PA,0,0,L,41.592483,-76.437832,-8.509021e+06,5.100129e+06
4,18833,15,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,5.096218e+06
...,...,...,...,...,...,...,...,...,...,...
190987608,92003,73,CA,0,0,C,33.285885,-117.240445,-1.305115e+07,3.933312e+06
190987609,92003,73,CA,0,0,E,33.284700,-117.210800,-1.304785e+07,3.933154e+06
190987610,92003,73,CA,0,0,G,33.282869,-117.183963,-1.304486e+07,3.932911e+06
190987611,92003,73,CA,0,0,H,33.278284,-117.181181,-1.304455e+07,3.932300e+06


In [17]:
df_HH = df_HH[['zip', 'county', 'state', 'child_num', 'has_child', 'age_code']]
df_HH

Unnamed: 0,zip,county,state,child_num,has_child,age_code
0,18833,113,PA,0,0,K
1,18833,15,PA,0,0,H
2,18833,15,PA,1,1,C
3,18833,15,PA,0,0,L
4,18833,15,PA,1,1,H
...,...,...,...,...,...,...
190987608,92003,73,CA,0,0,C
190987609,92003,73,CA,0,0,E
190987610,92003,73,CA,0,0,G
190987611,92003,73,CA,0,0,H


### Merge with infousa data with zip and county data

Now, we are merging the household data with the already merged zipcode dataframe so that the resulting dataframe has the zipcode, county, state, of each household along with the child count and geometries.

This merge function will merge the 2 specified dataframes on the column that is specified in the ```on``` parameter. in this example, we are merging the two dataframes on the ```zip``` column.

In [18]:
df_merged = df_HH.merge(zip_shp, on = ['zip'], how = 'left')
df_merged

Unnamed: 0,zip,county_x,state,child_num,has_child,age_code,county_y,geometry
0,18833,113,PA,0,0,K,Bradford County,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,15,PA,0,0,H,Bradford County,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18833,15,PA,1,1,C,Bradford County,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
3,18833,15,PA,0,0,L,Bradford County,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
4,18833,15,PA,1,1,H,Bradford County,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
...,...,...,...,...,...,...,...,...
190987608,92003,73,CA,0,0,C,San Diego County,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987609,92003,73,CA,0,0,E,San Diego County,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987610,92003,73,CA,0,0,G,San Diego County,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987611,92003,73,CA,0,0,H,San Diego County,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."


In [19]:
df_merged = df_merged[['zip', 'county_y', 'state', 'child_num', 'has_child', 'age_code',  'geometry']]

df_merged

Unnamed: 0,zip,county_y,state,child_num,has_child,age_code,geometry
0,18833,Bradford County,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,Bradford County,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18833,Bradford County,PA,1,1,C,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
3,18833,Bradford County,PA,0,0,L,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
4,18833,Bradford County,PA,1,1,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
...,...,...,...,...,...,...,...
190987608,92003,San Diego County,CA,0,0,C,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987609,92003,San Diego County,CA,0,0,E,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987610,92003,San Diego County,CA,0,0,G,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987611,92003,San Diego County,CA,0,0,H,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."


### Renaming

This is an example of code to rename a column name. the first parameter is the original column name, and the second is new name. By setting the ```inplace``` parameter to true, the dataframe

In [20]:
df_merged.rename(columns = {'county_y' : 'county'}, inplace = False)
df_merged

Unnamed: 0,zip,county_y,state,child_num,has_child,age_code,geometry
0,18833,Bradford County,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,Bradford County,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18833,Bradford County,PA,1,1,C,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
3,18833,Bradford County,PA,0,0,L,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
4,18833,Bradford County,PA,1,1,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
...,...,...,...,...,...,...,...
190987608,92003,San Diego County,CA,0,0,C,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987609,92003,San Diego County,CA,0,0,E,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987610,92003,San Diego County,CA,0,0,G,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987611,92003,San Diego County,CA,0,0,H,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."


### Groupby zipcode to find child counts (per zipcode)

The ```groupby``` function takes in the column parameter you want to group your variable by (all same values in this column will be grouped together). The second parameter, as seen in this example of ```child_num```, this is the variable you want to perform some action on. In this case, we are using the ```sum``` function to sum up the number of children in each zipcode. The ```reset_index``` function will rearrange the index of the rows in the dataframe so that they are in the order they were orignally in (this just prevents the order from being messed up when using the groupby function.

In [21]:
child_count = df_merged.groupby('zip')['child_num'].sum().reset_index()
child_count

Unnamed: 0,zip,child_num
0,1001,5116
1,1002,3335
2,1004,93
3,1005,1311
4,1007,4755
...,...,...
37819,99363,16
37820,99371,47
37821,99401,31
37822,99402,322


### More merging

Now, we are remerging this dataframe back with the original ```child_count``` dataframe so that we can get the county classfications.

In [22]:
df = df_merged.merge(child_count, on = ['zip'],how = 'left')
df = df[['zip', 'county', 'child_num_y']]
df

KeyError: "['county'] not in index"

In [None]:
df = df.drop_duplicates(keep = 'first')
df

### Dropping duplicates

To drop duplicate rows in a dataframe, you can specify the column in which duplicate rows are found as the parameter to ```subset``` in the ```drop_duplicates``` function. By specifying ```keep = 'first'```, the dataframe will keep the first instance of the duplicate row, but will drop any later occurrence of that row. If you would like to drop all instances of a row that appears more than once, you can specify ```keep = 'false'```.

In [None]:
df_merged = df_merged.drop_duplicates(subset = 'zip', keep = 'first')
df_merged

We are now merging the completed merged dataframe with the dataframe with child counts so that we have not only the child counts by county, but also additional information such as the geometries so that we can use this dataframe in making later visualizations. In the code chunk below, we are specifying the two columns we want the two dataframes to merge on. This can be any number of columns that you choose as long as the corresponding columns in the two dataframes have matching names.

In [None]:
df_final = df.merge(df_merged, on = ['zip', 'county'],how = 'left')
df_final

In [None]:
df_final = df_final[['zip', 'county', 'state', 'has_child', 'child_num_y', 'age_code', 'geometry']]
df_final.rename(columns = {'child_num_y' : 'child_num'}, inplace = True)
df_final

### Convert the dataframe into a geodataframe

Convert the merged dataframe to a geodataframe so that it can export convert as a shapefile.

In [None]:
gpd_df = gpd.GeoDataFrame(df_final)
gpd_df

Unnamed: 0,zip,county,state,has_child,child_num,age_code,geometry
0,18833,Bradford County,PA,0,319,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18079,Lehigh County,PA,0,76,G,"POLYGON ((-75.66384 40.74535, -75.65693 40.745..."
2,18350,Monroe County,PA,1,277,L,"POLYGON ((-75.52129 41.14508, -75.48068 41.135..."
3,23183,Gloucester County,VA,1,102,K,
4,16652,Huntingdon County,PA,1,3715,I,"MULTIPOLYGON (((-77.93462 40.43937, -77.93272 ..."
...,...,...,...,...,...,...,...
37819,85023,Maricopa County,AZ,0,6312,K,"POLYGON ((-112.11629 33.62960, -112.11435 33.6..."
37820,76305,Wichita County,TX,0,1358,M,"POLYGON ((-98.60335 33.99502, -98.56224 33.994..."
37821,97369,Lincoln County,OR,0,27,I,"POLYGON ((-124.07285 44.77613, -124.07136 44.7..."
37822,98632,Cowlitz County,WA,0,11544,I,"POLYGON ((-123.23787 46.17862, -123.23636 46.1..."


### Export to shp file

Use the ```to_file``` command to export the given dataframe to the file path indicated in the parenthesis.

In [36]:
gpd_df.to_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county.shp')

In [2]:
df_test = gpd.read_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county.shp')