## Children Household Count Per County File Processing

In this file, we will be making a dataframe that includes the number of children in all of the households in each county of the US.

In [3]:
import pandas as pd
import geopandas as gpd



### Importing county level shape file

This shapefile of US counties was also found online. This file provides the polygon geometries of each county in the US; this will allow us to make visualizations of the US with breakdowns by county. In this case, we are making this processing file to allow us make a map of US with children count by county.

In [4]:
df_counties = gpd.read_file('/hpc/group/codeplus22-vis/county_shp_files/us_counties.shp')
df_counties['county_fips'] = df_counties['STATEFP'] + df_counties['COUNTYFP']
df_counties

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,LSAD,ALAND,AWATER,geometry,county_fips
0,21,007,00516850,0500000US21007,21007,Ballard,06,639387454,69473325,"POLYGON ((-89.18137 37.04630, -89.17938 37.053...",21007
1,21,017,00516855,0500000US21017,21017,Bourbon,06,750439351,4829777,"POLYGON ((-84.44266 38.28324, -84.44114 38.283...",21017
2,21,031,00516862,0500000US21031,21031,Butler,06,1103571974,13943044,"POLYGON ((-86.94486 37.07341, -86.94346 37.074...",21031
3,21,065,00516879,0500000US21065,21065,Estill,06,655509930,6516335,"POLYGON ((-84.12662 37.64540, -84.12483 37.646...",21065
4,21,069,00516881,0500000US21069,21069,Fleming,06,902727151,7182793,"POLYGON ((-83.98428 38.44549, -83.98246 38.450...",21069
...,...,...,...,...,...,...,...,...,...,...,...
3228,31,073,00835858,0500000US31073,31073,Gosper,06,1186616237,11831826,"POLYGON ((-100.09510 40.43866, -100.08937 40.4...",31073
3229,39,075,01074050,0500000US39075,39075,Holmes,06,1094405866,3695230,"POLYGON ((-82.22066 40.66758, -82.19327 40.667...",39075
3230,48,171,01383871,0500000US48171,48171,Gillespie,06,2740719114,9012764,"POLYGON ((-99.30400 30.49983, -99.28234 30.499...",48171
3231,55,079,01581100,0500000US55079,55079,Milwaukee,06,625440563,2455383635,"POLYGON ((-88.06959 42.86726, -88.06959 42.872...",55079


### Importing infousa data

This preprocessed infousa household data contains the children counts of households as well as the zipcode and county fips of each household. This file also includes the transformed latitude and longitude coordinates.

In this dataset, the county_fips column represents each individual county; this is the column we will be grouping the children count by. 

In [5]:
df_hh = pd.read_parquet('/hpc/group/codeplus22-vis/infousa_copy/zip_00_99_final_fixed.parquet')
df_hh

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,18833,42113,PA,0,0,K,41.546738,-76.540436,-8.520442e+06,5.093323e+06
1,18833,42015,PA,0,0,H,41.590800,-76.424200,-8.507503e+06,5.099879e+06
2,18833,42015,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,5.101307e+06
3,18833,42015,PA,0,0,L,41.592483,-76.437832,-8.509021e+06,5.100129e+06
4,18833,42015,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,5.096218e+06
...,...,...,...,...,...,...,...,...,...,...
190987608,92003,06073,CA,0,0,C,33.285885,-117.240445,-1.305115e+07,3.933312e+06
190987609,92003,06073,CA,0,0,E,33.284700,-117.210800,-1.304785e+07,3.933154e+06
190987610,92003,06073,CA,0,0,G,33.282869,-117.183963,-1.304486e+07,3.932911e+06
190987611,92003,06073,CA,0,0,H,33.278284,-117.181181,-1.304455e+07,3.932300e+06


In [6]:
df_hh = df_hh[['county_fips', 'state', 'child_num', 'has_child', 'age_code']]
df_hh

Unnamed: 0,county_fips,state,child_num,has_child,age_code
0,42113,PA,0,0,K
1,42015,PA,0,0,H
2,42015,PA,1,1,C
3,42015,PA,0,0,L
4,42015,PA,1,1,H
...,...,...,...,...,...
190987608,06073,CA,0,0,C
190987609,06073,CA,0,0,E
190987610,06073,CA,0,0,G
190987611,06073,CA,0,0,H


### Groupby

Here, we are taking the infousa preprocessed file and calculating the number of children per county. We are doing so using the ```groupby``` function, which groups the county_fips together by summing the corresponding ```child_num``` values. The resulting dataframe is giving the calculated number of children per county_fips.

In [7]:
df_child_count = df_hh.groupby('county_fips')['child_num'].sum().reset_index()
df_child_count

Unnamed: 0,county_fips,child_num
0,01001,19566
1,01003,60951
2,01005,6527
3,01007,6087
4,01009,18963
...,...,...
3104,56037,10878
3105,56039,2754
3106,56041,4950
3107,56043,1841


### Merging child number counts with county geometries

Now that we have the child_num calculated for each ```county_fips```, we are merging the county geometry shapefile with the child count dataframe on the ```county_fips``` column. Afterwards, we will select on the columns we want to keep in this dataframe, before exporting this as a parquet file to be used in future visualizations.

In [8]:
df_final = df_child_count.merge(df_counties, on = ['county_fips'], how = 'left')
df_final

Unnamed: 0,county_fips,child_num,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,LSAD,ALAND,AWATER,geometry
0,01001,19566,01,001,00161526,0500000US01001,01001,Autauga,06,1.539602e+09,2.570696e+07,"POLYGON ((-86.92120 32.65754, -86.92035 32.658..."
1,01003,60951,01,003,00161527,0500000US01003,01003,Baldwin,06,4.117547e+09,1.133056e+09,"POLYGON ((-88.02858 30.22676, -88.02399 30.230..."
2,01005,6527,01,005,00161528,0500000US01005,01005,Barbour,06,2.292145e+09,5.053870e+07,"POLYGON ((-85.74803 31.61918, -85.74544 31.618..."
3,01007,6087,01,007,00161529,0500000US01007,01007,Bibb,06,1.612167e+09,9.602089e+06,"POLYGON ((-87.42194 33.00338, -87.31854 33.006..."
4,01009,18963,01,009,00161530,0500000US01009,01009,Blount,06,1.670104e+09,1.501542e+07,"POLYGON ((-86.96336 33.85822, -86.95967 33.857..."
...,...,...,...,...,...,...,...,...,...,...,...,...
3104,56037,10878,56,037,01609192,0500000US56037,56037,Sweetwater,06,2.700575e+10,1.662303e+08,"POLYGON ((-110.05438 42.01103, -110.05436 42.0..."
3105,56039,2754,56,039,01605083,0500000US56039,56039,Teton,06,1.035178e+10,5.708649e+08,"POLYGON ((-111.05361 44.66627, -110.75076 44.6..."
3106,56041,4950,56,041,01605084,0500000US56041,56041,Uinta,06,5.391632e+09,1.662582e+07,"POLYGON ((-111.04662 41.15604, -111.04659 41.2..."
3107,56043,1841,56,043,01605085,0500000US56043,56043,Washakie,06,5.798139e+09,1.042960e+07,"POLYGON ((-108.55056 44.16845, -108.50652 44.1..."


In [9]:
df_final = df_final[['STATEFP', 'NAME', 'county_fips','child_num', 'geometry']]
df_final

Unnamed: 0,STATEFP,NAME,county_fips,child_num,geometry
0,01,Autauga,01001,19566,"POLYGON ((-86.92120 32.65754, -86.92035 32.658..."
1,01,Baldwin,01003,60951,"POLYGON ((-88.02858 30.22676, -88.02399 30.230..."
2,01,Barbour,01005,6527,"POLYGON ((-85.74803 31.61918, -85.74544 31.618..."
3,01,Bibb,01007,6087,"POLYGON ((-87.42194 33.00338, -87.31854 33.006..."
4,01,Blount,01009,18963,"POLYGON ((-86.96336 33.85822, -86.95967 33.857..."
...,...,...,...,...,...
3104,56,Sweetwater,56037,10878,"POLYGON ((-110.05438 42.01103, -110.05436 42.0..."
3105,56,Teton,56039,2754,"POLYGON ((-111.05361 44.66627, -110.75076 44.6..."
3106,56,Uinta,56041,4950,"POLYGON ((-111.04662 41.15604, -111.04659 41.2..."
3107,56,Washakie,56043,1841,"POLYGON ((-108.55056 44.16845, -108.50652 44.1..."


### Converting GeoDataFrame

In order to export this dataframe as a shapefile, the pandas dataframe must be converted to a GeoDataFrame using the ```gpd.GeoDataFrame()``` function. 

In [10]:
gpd_df = gpd.GeoDataFrame(df_final)
gpd_df

Unnamed: 0,STATEFP,NAME,county_fips,child_num,geometry
0,01,Autauga,01001,19566,"POLYGON ((-86.92120 32.65754, -86.92035 32.658..."
1,01,Baldwin,01003,60951,"POLYGON ((-88.02858 30.22676, -88.02399 30.230..."
2,01,Barbour,01005,6527,"POLYGON ((-85.74803 31.61918, -85.74544 31.618..."
3,01,Bibb,01007,6087,"POLYGON ((-87.42194 33.00338, -87.31854 33.006..."
4,01,Blount,01009,18963,"POLYGON ((-86.96336 33.85822, -86.95967 33.857..."
...,...,...,...,...,...
3104,56,Sweetwater,56037,10878,"POLYGON ((-110.05438 42.01103, -110.05436 42.0..."
3105,56,Teton,56039,2754,"POLYGON ((-111.05361 44.66627, -110.75076 44.6..."
3106,56,Uinta,56041,4950,"POLYGON ((-111.04662 41.15604, -111.04659 41.2..."
3107,56,Washakie,56043,1841,"POLYGON ((-108.55056 44.16845, -108.50652 44.1..."


In [12]:
gpd_df.to_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county_fixed.shp')

  gpd_df.to_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county_fixed.shp')


### Merge with infousa data with zip and county data

Now, we are merging the household data with the already merged zipcode dataframe so that the resulting dataframe has the zipcode, county, state, of each household along with the child count and geometries.

This merge function will merge the 2 specified dataframes on the column that is specified in the ```on``` parameter. in this example, we are merging the two dataframes on the ```zip``` column.

In [15]:
df_merged = df_merged[['zip', 'county_fips', 'state', 'child_num', 'has_child', 'age_code',  'geometry']]

df_merged

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,geometry
0,18833,42113,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,42015,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18833,42015,PA,1,1,C,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
3,18833,42015,PA,0,0,L,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
4,18833,42015,PA,1,1,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
...,...,...,...,...,...,...,...
190987608,92003,06073,CA,0,0,C,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987609,92003,06073,CA,0,0,E,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987610,92003,06073,CA,0,0,G,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."
190987611,92003,06073,CA,0,0,H,"POLYGON ((-117.26225 33.28622, -117.25940 33.2..."


### Renaming

This is an example of code to rename a column name. the first parameter is the original column name, and the second is new name. By setting the ```inplace``` parameter to true, the dataframe

In [16]:
# df_merged.rename(columns = {'county_y' : 'county'}, inplace = False)
# df_merged

### Groupby zipcode to find child counts (per zipcode)

The ```groupby``` function takes in the column parameter you want to group your variable by (all same values in this column will be grouped together). The second parameter, as seen in this example of ```child_num```, this is the variable you want to perform some action on. In this case, we are using the ```sum``` function to sum up the number of children in each zipcode. The ```reset_index``` function will rearrange the index of the rows in the dataframe so that they are in the order they were orignally in (this just prevents the order from being messed up when using the groupby function.

In [18]:
child_count = df_merged.groupby('county_fips')['child_num'].sum().reset_index()
child_count

Unnamed: 0,county_fips,child_num
0,01001,19566
1,01003,60951
2,01005,6527
3,01007,6087
4,01009,18963
...,...,...
3104,56037,10878
3105,56039,2754
3106,56041,4950
3107,56043,1841


### More merging

Now, we are remerging this dataframe back with the original ```child_count``` dataframe so that we can get the county classfications.

In [20]:
df = df_merged.merge(child_count, on = ['county_fips'],how = 'left')

df

Unnamed: 0,zip,county_fips,state,child_num_x,has_child,age_code,geometry,child_num_y
0,18833,42113,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605...",1101
1,18833,42015,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605...",13048
2,18833,42015,PA,1,1,C,"POLYGON ((-76.68205 41.60605, -76.68016 41.605...",13048
3,18833,42015,PA,0,0,L,"POLYGON ((-76.68205 41.60605, -76.68016 41.605...",13048
4,18833,42015,PA,1,1,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605...",13048
...,...,...,...,...,...,...,...,...
190987608,92003,06073,CA,0,0,C,"POLYGON ((-117.26225 33.28622, -117.25940 33.2...",574508
190987609,92003,06073,CA,0,0,E,"POLYGON ((-117.26225 33.28622, -117.25940 33.2...",574508
190987610,92003,06073,CA,0,0,G,"POLYGON ((-117.26225 33.28622, -117.25940 33.2...",574508
190987611,92003,06073,CA,0,0,H,"POLYGON ((-117.26225 33.28622, -117.25940 33.2...",574508


In [21]:
df = df[['zip', 'county_fips', 'child_num_y']]
df

In [22]:
df = df.drop_duplicates(keep = 'first')
df

Unnamed: 0,zip,county_fips,child_num_y
0,18833,42113,1101
1,18833,42015,13048
1255,18079,42077,103995
1527,18350,42089,32811
2934,23183,51073,11727
...,...,...,...
190951631,76305,48077,2863
190954613,97369,41041,7925
190954796,98632,53015,25517
190957242,98632,53069,767


### Dropping duplicates

To drop duplicate rows in a dataframe, you can specify the column in which duplicate rows are found as the parameter to ```subset``` in the ```drop_duplicates``` function. By specifying ```keep = 'first'```, the dataframe will keep the first instance of the duplicate row, but will drop any later occurrence of that row. If you would like to drop all instances of a row that appears more than once, you can specify ```keep = 'false'```.

In [23]:
df_merged = df_merged.drop_duplicates(subset = 'county_fips', keep = 'first')
df_merged

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,geometry
0,18833,42113,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,42015,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1255,18079,42077,PA,0,0,G,"POLYGON ((-75.66384 40.74535, -75.65693 40.745..."
1527,18350,42089,PA,2,1,L,"POLYGON ((-75.52129 41.14508, -75.48068 41.135..."
2934,23183,51073,VA,0,1,K,
...,...,...,...,...,...,...,...
176865670,83204,16077,ID,0,0,M,"POLYGON ((-112.73756 42.87278, -112.73294 42.8..."
183139710,76380,48023,TX,0,0,L,"POLYGON ((-99.64993 33.64690, -99.64739 33.646..."
185895409,76951,48431,TX,0,1,E,"POLYGON ((-101.26698 31.64419, -101.26712 31.6..."
186285078,78385,48261,TX,3,1,G,"POLYGON ((-97.98589 27.20931, -97.98333 27.211..."


We are now merging the completed merged dataframe with the dataframe with child counts so that we have not only the child counts by county, but also additional information such as the geometries so that we can use this dataframe in making later visualizations. In the code chunk below, we are specifying the two columns we want the two dataframes to merge on. This can be any number of columns that you choose as long as the corresponding columns in the two dataframes have matching names.

In [32]:
df_final = df.merge(df_merged, on = ['county_fips'],how = 'left')
df_final

Unnamed: 0,zip_x,county_fips,child_num_y,zip_y,state,child_num,has_child,age_code,geometry
0,18833,42113,1101,18833,PA,0,0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,42015,13048,18833,PA,0,0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18079,42077,103995,18079,PA,0,0,G,"POLYGON ((-75.66384 40.74535, -75.65693 40.745..."
3,18350,42089,32811,18350,PA,2,1,L,"POLYGON ((-75.52129 41.14508, -75.48068 41.135..."
4,23183,51073,11727,23183,VA,0,1,K,
...,...,...,...,...,...,...,...,...,...
52320,76305,48077,2863,76310,TX,0,0,L,"POLYGON ((-98.70745 33.83478, -98.70717 33.855..."
52321,97369,41041,7925,97380,OR,0,0,K,"POLYGON ((-124.05322 44.79720, -124.05011 44.7..."
52322,98632,53015,25517,98674,WA,0,0,M,"POLYGON ((-122.81222 45.95446, -122.80484 45.9..."
52323,98632,53069,767,98621,WA,0,0,K,"POLYGON ((-123.64298 46.37761, -123.64328 46.3..."


In [34]:
df_final = df.merge(df_merged, on = ['county_fips', 'zip'],how = 'left')
df_final

Unnamed: 0,zip,county_fips,child_num_y,state,child_num,has_child,age_code,geometry
0,18833,42113,1101,PA,0.0,0.0,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,42015,13048,PA,0.0,0.0,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18079,42077,103995,PA,0.0,0.0,G,"POLYGON ((-75.66384 40.74535, -75.65693 40.745..."
3,18350,42089,32811,PA,2.0,1.0,L,"POLYGON ((-75.52129 41.14508, -75.48068 41.135..."
4,23183,51073,11727,VA,0.0,1.0,K,
...,...,...,...,...,...,...,...,...
52320,76305,48077,2863,,,,,
52321,97369,41041,7925,,,,,
52322,98632,53015,25517,,,,,
52323,98632,53069,767,,,,,


In [33]:
df_final = df_final[['zip', 'county_fips', 'state', 'has_child', 'child_num_y', 'age_code', 'geometry']]
df_final.rename(columns = {'child_num_y' : 'child_num'}, inplace = True)
df_final

KeyError: "['zip'] not in index"

### Convert the dataframe into a geodataframe

Convert the merged dataframe to a geodataframe so that it can export convert as a shapefile.

In [31]:
gpd_df = gpd.GeoDataFrame(df_final)
gpd_df

Unnamed: 0,zip,county_fips,state,has_child,child_num,age_code,geometry
0,18833,42113,PA,0.0,1101,K,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
1,18833,42015,PA,0.0,13048,H,"POLYGON ((-76.68205 41.60605, -76.68016 41.605..."
2,18079,42077,PA,0.0,103995,G,"POLYGON ((-75.66384 40.74535, -75.65693 40.745..."
3,18350,42089,PA,1.0,32811,L,"POLYGON ((-75.52129 41.14508, -75.48068 41.135..."
4,23183,51073,VA,1.0,11727,K,
...,...,...,...,...,...,...,...
52320,76305,48077,,,2863,,
52321,97369,41041,,,7925,,
52322,98632,53015,,,25517,,
52323,98632,53069,,,767,,


### Export to shp file

Use the ```to_file``` command to export the given dataframe to the file path indicated in the parenthesis.

In [36]:
gpd_df.to_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county.shp')

In [2]:
df_test = gpd.read_file('/hpc/group/codeplus22-vis/infousa_copy/children_count_by_county.shp')