# Calculating the shortest distances between all households with children in the United States

### Import statements
Need to install dask-geopandas and pygeos as well. Once you install these packages once, you can comment them out (you don't need to reinstall again).

In [1]:
# pip install dask-geopandas==0.1.0a4

In [2]:
# pip install pygeos

In [1]:
import dask
import pandas as pd
from dask import dataframe as dd
import dask_geopandas
import geopandas as gpd
import numpy as np




### Reading InfoUSA data

In [2]:
%%time
df = pd.read_parquet('/hpc/group/codeplus22-vis/celine_data/zip_00_99.parquet')
df

CPU times: user 41.3 s, sys: 23.7 s, total: 1min 5s
Wall time: 29.6 s


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,18833,42113,PA,0,0,K,41.546738,-76.540436,-8.520442e+06,5.093323e+06
1,18833,42015,PA,0,0,H,41.590800,-76.424200,-8.507503e+06,5.099879e+06
2,18833,42015,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,5.101307e+06
3,18833,42015,PA,0,0,L,41.592483,-76.437832,-8.509021e+06,5.100129e+06
4,18833,42015,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,5.096218e+06
...,...,...,...,...,...,...,...,...,...,...
190987608,92003,06073,CA,0,0,C,33.285885,-117.240445,-1.305115e+07,3.933312e+06
190987609,92003,06073,CA,0,0,E,33.284700,-117.210800,-1.304785e+07,3.933154e+06
190987610,92003,06073,CA,0,0,G,33.282869,-117.183963,-1.304486e+07,3.932911e+06
190987611,92003,06073,CA,0,0,H,33.278284,-117.181181,-1.304455e+07,3.932300e+06


Filter for only households with children, and exclude households in Alaska, Hawaii and Puerto Rico (no storage tanks in those areas):

In [3]:
%%time
df_hh = df[(df['has_child'] == 1)]
df_hh

CPU times: user 4.96 s, sys: 5.52 s, total: 10.5 s
Wall time: 8.18 s


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
2,18833,42015,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,5.101307e+06
4,18833,42015,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,5.096218e+06
8,18833,42015,PA,0,1,E,41.587904,-76.324061,-8.496356e+06,5.099448e+06
16,18833,42015,PA,1,1,G,41.612450,-76.446301,-8.509963e+06,5.103102e+06
17,18833,42015,PA,1,1,G,41.585339,-76.431989,-8.508370e+06,5.099066e+06
...,...,...,...,...,...,...,...,...,...,...
190987519,92003,06073,CA,0,1,H,33.263291,-117.229201,-1.304989e+07,3.930304e+06
190987545,92003,06073,CA,0,1,F,33.295585,-117.189475,-1.304547e+07,3.934604e+06
190987548,92003,06073,CA,2,1,L,33.292877,-117.212471,-1.304803e+07,3.934243e+06
190987552,92003,06073,CA,1,1,D,33.284700,-117.210800,-1.304785e+07,3.933154e+06


### Use Dask to transform pandas dataframe to a geopandas dataframe
For the code we use to calculate the shortest distance from each household to a tank, we must convert our dataframe ```df_hh``` to a GeoDataFrame. However, as this dataframe has 53 million rows and 10 columns, converting it without using Dask is not feasible. We attempted it, and ran the code for three hours and it was still not done. Hence, we turned to Dask, an open-source Python library for parallel computing. It allows us to efficiently execute the transformation of our dataframe to a GeoDataFrame, even when working with over 53 million rows. 

To use Dask, we first converted our dataframe to a Dask dataframe, using Dask's ```.from_pandas()``` method. This method takes in our pandas dataframe along with the ```npartitions``` parameter, which is used to specify the number of 'sections' the dask dataframe will be split into.

In [4]:
df_dask = dd.from_pandas(df_hh, npartitions = 500)

Then, we specify what manipulation to the dask dataframe ```df_dask``` to compute. In this case, we use Dask Geopandas' ```.points_from_xy()``` method to convert the pandas dask dataframe into a geopandas dask dataframe.

In [5]:
%%time
df_dask['geometry'] = dask_geopandas.points_from_xy(df_dask, 'lon_h_4326', 'lat_h_4326')

CPU times: user 1min 5s, sys: 20.8 s, total: 1min 25s
Wall time: 1min 26s


After, we convert the dask geodataframe into a geopandas dataframe:

In [6]:
%%time
gdf = dask_geopandas.from_dask_dataframe(df_dask)

CPU times: user 10.1 ms, sys: 1.07 ms, total: 11.2 ms
Wall time: 10.3 ms


Calling compute puts all the above code into action. Dask executes each set of commands on each partition, as specified above. This returns GeoDataFrame ```gdf_hh```, with over 53 million rows, in less than 20 seconds.

In [7]:
%%time
gdf_hh = gdf.compute()

CPU times: user 27.4 s, sys: 17.6 s, total: 45 s
Wall time: 21.7 s


In [8]:
gdf_hh = gdf_hh.reset_index(drop = True)
gdf_hh

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,geometry
0,18833,42015,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,5.101307e+06,POINT (-76.44172 41.60039)
1,18833,42015,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,5.096218e+06,POINT (-76.34798 41.56620)
2,18833,42015,PA,0,1,E,41.587904,-76.324061,-8.496356e+06,5.099448e+06,POINT (-76.32406 41.58790)
3,18833,42015,PA,1,1,G,41.612450,-76.446301,-8.509963e+06,5.103102e+06,POINT (-76.44630 41.61245)
4,18833,42015,PA,1,1,G,41.585339,-76.431989,-8.508370e+06,5.099066e+06,POINT (-76.43199 41.58534)
...,...,...,...,...,...,...,...,...,...,...,...
53067356,92003,06073,CA,0,1,H,33.263291,-117.229201,-1.304989e+07,3.930304e+06,POINT (-117.22920 33.26329)
53067357,92003,06073,CA,0,1,F,33.295585,-117.189475,-1.304547e+07,3.934604e+06,POINT (-117.18948 33.29559)
53067358,92003,06073,CA,2,1,L,33.292877,-117.212471,-1.304803e+07,3.934243e+06,POINT (-117.21247 33.29288)
53067359,92003,06073,CA,1,1,D,33.284700,-117.210800,-1.304785e+07,3.933154e+06,POINT (-117.21080 33.28470)


Filtering for only the ```geometry``` column, as it is the only one we need to run the code below.

In [9]:
gdf_hh = gdf_hh[['geometry']]
gdf_hh

Unnamed: 0,geometry
0,POINT (-76.44172 41.60039)
1,POINT (-76.34798 41.56620)
2,POINT (-76.32406 41.58790)
3,POINT (-76.44630 41.61245)
4,POINT (-76.43199 41.58534)
...,...
53067356,POINT (-117.22920 33.26329)
53067357,POINT (-117.18948 33.29559)
53067358,POINT (-117.21247 33.29288)
53067359,POINT (-117.21080 33.28470)


### Reading AST data
Converting it into a GeoDataFrame with point geometries from the center latitude and longitude from each tank. Then, filtering for only the columns we need.

### Importing risk of tanks

This dataframe contains the coordinates of the tanks and also each of the 6 risks associated with each tank. Below, we will also be dropping unused columns.

In [10]:
df_tanks = gpd.read_file('/hpc/group/codeplus22-vis/celine_data/tanks_risk_score.shp')
df_tanks

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,on_floodpl,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,adj_risk,geometry
0,New York,closed_roof_tank,39.6,40.625572,-73.745231,-8.209282e+06,4.957270e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,"POLYGON ((-73.74547 40.62575, -73.74500 40.625..."
1,New York,closed_roof_tank,19.8,40.624761,-73.744420,-8.209191e+06,4.957151e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,"POLYGON ((-73.74465 40.62485, -73.74419 40.624..."
2,New York,closed_roof_tank,12.6,40.626086,-73.746257,-8.209396e+06,4.957345e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,"POLYGON ((-73.74633 40.62615, -73.74618 40.626..."
3,New York,closed_roof_tank,30.6,40.625786,-73.746203,-8.209390e+06,4.957301e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,"POLYGON ((-73.74639 40.62593, -73.74601 40.625..."
4,New York,closed_roof_tank,24.0,40.625781,-73.745813,-8.209346e+06,4.957300e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,"POLYGON ((-73.74595 40.62590, -73.74567 40.625..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98164,Colorado,narrow_closed_roof_tank,5.4,39.777431,-104.920718,-1.167972e+07,4.833652e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,"POLYGON ((-104.92075 39.77746, -104.92069 39.7..."
98165,Colorado,narrow_closed_roof_tank,4.8,39.777301,-104.920631,-1.167971e+07,4.833633e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,"POLYGON ((-104.92066 39.77732, -104.92060 39.7..."
98166,Colorado,narrow_closed_roof_tank,3.6,39.777701,-104.920609,-1.167971e+07,4.833691e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,"POLYGON ((-104.92064 39.77772, -104.92058 39.7..."
98167,Colorado,narrow_closed_roof_tank,4.8,39.776628,-104.920617,-1.167971e+07,4.833535e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,"POLYGON ((-104.92065 39.77665, -104.92059 39.7..."


### Converting pandas dataframe to GeoDataFrame

In [11]:
gdf_tanks = gpd.GeoDataFrame(
    df_tanks, geometry=gpd.points_from_xy(df_tanks.lon_t_4326, df_tanks.lat_t_4326))

In [12]:
gdf_tanks = gdf_tanks.drop(['state', 'tank_type', 'diameter', 'county', 'on_floodpl', 'adj_risk'], axis = 1)
gdf_tanks

Unnamed: 0,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,geometry
0,40.625572,-73.745231,-8.209282e+06,4.957270e+06,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,POINT (-73.74523 40.62557)
1,40.624761,-73.744420,-8.209191e+06,4.957151e+06,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,POINT (-73.74442 40.62476)
2,40.626086,-73.746257,-8.209396e+06,4.957345e+06,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,POINT (-73.74626 40.62609)
3,40.625786,-73.746203,-8.209390e+06,4.957301e+06,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,POINT (-73.74620 40.62579)
4,40.625781,-73.745813,-8.209346e+06,4.957300e+06,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,POINT (-73.74581 40.62578)
...,...,...,...,...,...,...,...,...,...,...,...,...
98164,39.777431,-104.920718,-1.167972e+07,4.833652e+06,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,POINT (-104.92072 39.77743)
98165,39.777301,-104.920631,-1.167971e+07,4.833633e+06,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,POINT (-104.92063 39.77730)
98166,39.777701,-104.920609,-1.167971e+07,4.833691e+06,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,POINT (-104.92061 39.77770)
98167,39.776628,-104.920617,-1.167971e+07,4.833535e+06,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,POINT (-104.92062 39.77663)


### Finding the closest tank to each household
To find the tanks nearest to each household, we use an algorithm developed by the University of Helsinki. This code is copyrighted and licensed under the Creative Commons Attribution-ShareAlike 4.0 International licence and is available to the public to share and adapt, as long as it is attributed correctly and re-shared if edits are made. The material can be found [here](https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html). From this algorithm, we removed the code that calculates the distance between the two points. The reasoning for this is explained in further detail below.

These functions use the sklearn neighbors module, specifically the ```BallTree``` method, to use machine learning to identify the closest tank to each household. It returns a GeoDataFrame with the same number of indices inputted households GeoDataFrame, where each row corresponds to the row with the same index in the households GeoDataFrame. It also retains all the original columns in the inputted tanks GeoDataFrame.

In [13]:
from sklearn.neighbors import BallTree
import numpy as np

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15)

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)

    # Add distance if requested
    if return_dist:
        # Convert to meters from radians
        earth_radius = 6371009
        # earth_radius = 6371000  # meters
        closest_points['distance'] = dist * earth_radius

    return closest_points

Here, you can see the outputted dataframe has 2,335,208 rows- the same number of rows as the inputted ```gdf_harris``` GeoDataFrame, and the same columns as the inputted ```df_tanks``` GeoDataFrame. Tank at index 0 in ```df_closest_tanks_harris``` is the tank nearest to household at index 0 in ```df_harris```, which is in the same order as ```gdf_harris``` and so on. 

In [14]:
%%time
df_closest_tanks = nearest_neighbor(gdf_hh, gdf_tanks)
df_closest_tanks

CPU times: user 1h 21min 15s, sys: 31.2 s, total: 1h 21min 46s
Wall time: 1h 22min 5s


Unnamed: 0,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,geometry
0,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,POINT (-76.90542 41.26327)
1,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,POINT (-75.92870 41.29908)
2,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,POINT (-75.92870 41.29908)
3,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,POINT (-76.90542 41.26327)
4,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,POINT (-76.90542 41.26327)
...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,POINT (-117.11434 32.79411)
53067357,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,POINT (-117.11434 32.79411)
53067358,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,POINT (-117.11434 32.79411)
53067359,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,POINT (-117.11434 32.79411)


In [22]:
df_closest_tanks = df_closest_tanks.drop(['geometry'], axis = 1)
df_closest_tanks

Therefore, merging the two ```df_closest_tanks``` and ```df_hh_lat_lon``` will create a new dataframe, ```df_harris_dist``` with the coordinates of each household corresponding to that of the tank nearest to it. This information is what we use to calculate distance. We create new dataframe ```df_hh_lat_lon``` from ```df_hh``` and only keep the latitude and longitude of each household, as these are the only two columns necessary to merge with ```df_closest_tanks``` in order to compute the distance between the household coordinates and the tank coordinates for each household.

In [23]:
df_closest_tanks = df_closest_tanks.reset_index()
df_hh = df_hh.reset_index()

In [24]:
df_hh_lat_lon = df_hh[['lat_h_4326', 'lon_h_4326']]
df_hh_lat_lon = df_hh_lat_lon.reset_index()
df_hh_lat_lon

Unnamed: 0,index,lat_h_4326,lon_h_4326
0,0,41.600392,-76.441724
1,1,41.566196,-76.347977
2,2,41.587904,-76.324061
3,3,41.612450,-76.446301
4,4,41.585339,-76.431989
...,...,...,...
53067356,53067356,33.263291,-117.229201
53067357,53067357,33.295585,-117.189475
53067358,53067358,33.292877,-117.212471
53067359,53067359,33.284700,-117.210800


In [25]:
%%time
df_closest_tanks_hh = df_hh_lat_lon.merge(df_closest_tanks, left_index=True, right_index = True)
df_closest_tanks_hh = df_closest_tanks_hh.drop(['index_x', 'index_y'], axis = 1)
df_closest_tanks_hh

CPU times: user 6.74 s, sys: 10.4 s, total: 17.1 s
Wall time: 17.1 s


Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk
0,41.600392,-76.441724,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660
1,41.566196,-76.347977,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825
2,41.587904,-76.324061,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825
3,41.612450,-76.446301,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660
4,41.585339,-76.431989,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,33.263291,-117.229201,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785
53067357,33.295585,-117.189475,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785
53067358,33.292877,-117.212471,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785
53067359,33.284700,-117.210800,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785


To compute the distance between the two sets of coordinates (the household ones and the ones of the nearest tank), we use the haversine library. This library calculates the distance between two coordinates in EPSG 4326 projection, in kilometers. We multiplied the value by 1,000 to find the distance in meters.

In [26]:
import haversine as hs

In [27]:
%%time
import pandas as pd
from geopy import distance

def distancer(row):
    coords_1 = (row['lat_h_4326'], row['lon_h_4326'])
    coords_2 = (row['lat_t_4326'], row['lon_t_4326'])
    return (hs.haversine(coords_1, coords_2) * 1000)

df_closest_tanks_hh['distance_m'] = df_closest_tanks_hh.apply(distancer, axis=1)
df_closest_tanks_hh



CPU times: user 19min 58s, sys: 8.54 s, total: 20min 7s
Wall time: 20min 11s


Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m
0,41.600392,-76.441724,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53847.632898
1,41.566196,-76.347977,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,45869.438119
2,41.587904,-76.324061,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,46015.805516
3,41.612450,-76.446301,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,54518.419780
4,41.585339,-76.431989,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53297.730315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,33.263291,-117.229201,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,53258.019576
53067357,33.295585,-117.189475,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56199.475559
53067358,33.292877,-117.212471,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56209.490501
53067359,33.284700,-117.210800,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,55287.090852


In [30]:
df_closest_tanks_hh['distance_mi']  = df_closest_tanks_hh['distance_m'] / 1609.344
df_closest_tanks_hh

Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi
0,41.600392,-76.441724,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53847.632898,33.459368
1,41.566196,-76.347977,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,45869.438119,28.501947
2,41.587904,-76.324061,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,46015.805516,28.592896
3,41.612450,-76.446301,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,54518.419780,33.876175
4,41.585339,-76.431989,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53297.730315,33.117674
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,33.263291,-117.229201,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,53258.019576,33.092999
53067357,33.295585,-117.189475,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56199.475559,34.920735
53067358,33.292877,-117.212471,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56209.490501,34.926958
53067359,33.284700,-117.210800,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,55287.090852,34.353806


Then, we categorize each household by its distances from the nearest tank. These boundaries were set by our researcher. Using the numpy library's ```.select()``` function, we can assign different values to each category. Households within 0.5 miles of a tank are marked as ```1```, households between 0.5 miles and one mile are marked as ```2``` and households between one and five miles from a tank are marked as ```3```. All other households are marked as ```4```.

In [31]:
import numpy as np
conditions = [(df_closest_tanks_hh['distance_mi'] <= 0.5),
              ((df_closest_tanks_hh['distance_mi'] > 0.5) & (df_closest_tanks_hh['distance_mi'] <= 1)),
              ((df_closest_tanks_hh['distance_mi'] > 1) & (df_closest_tanks_hh['distance_mi'] <= 5)),
              (df_closest_tanks_hh['distance_mi'] > 5)]



values = [1, 2, 3, 4]


df_closest_tanks_hh['distance_category'] = np.select(conditions, values)
df_closest_tanks_hh

Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi,distance_category
0,41.600392,-76.441724,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53847.632898,33.459368,4
1,41.566196,-76.347977,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,45869.438119,28.501947,4
2,41.587904,-76.324061,41.299076,-75.928697,-8.452344e+06,5.056556e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,46015.805516,28.592896,4
3,41.612450,-76.446301,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,54518.419780,33.876175,4
4,41.585339,-76.431989,41.263274,-76.905421,-8.561072e+06,5.051252e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53297.730315,33.117674,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,33.263291,-117.229201,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,53258.019576,33.092999,4
53067357,33.295585,-117.189475,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56199.475559,34.920735,4
53067358,33.292877,-117.212471,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56209.490501,34.926958,4
53067359,33.284700,-117.210800,32.794111,-117.114344,-1.303711e+07,3.868007e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,55287.090852,34.353806,4


Then, we merge the ```df_closest_tanks_hh``` dataframe with the ```df_hh``` dataframe to add back in the demographic data for each household, which we will use in our visualizations, and drop unnecessary columns to prepare for GPU visualizations.

In [32]:
df_hh = df_hh.drop(['index'], axis = 1)
df_hh = df_hh.reset_index()
df_closest_tanks_hh = df_closest_tanks_hh.reset_index()

In [36]:
df = df_hh.merge(df_closest_tanks_hh, left_index = True, right_index = True)
df

Unnamed: 0,index_x,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326_x,lon_h_4326_x,lat_h_3857,...,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi,distance_category
0,0,18833,42015,PA,1,1,C,41.600392,-76.441724,-8.509454e+06,...,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53847.632898,33.459368,4
1,1,18833,42015,PA,1,1,H,41.566196,-76.347977,-8.499018e+06,...,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,45869.438119,28.501947,4
2,2,18833,42015,PA,0,1,E,41.587904,-76.324061,-8.496356e+06,...,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,46015.805516,28.592896,4
3,3,18833,42015,PA,1,1,G,41.612450,-76.446301,-8.509963e+06,...,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,54518.419780,33.876175,4
4,4,18833,42015,PA,1,1,G,41.585339,-76.431989,-8.508370e+06,...,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,53297.730315,33.117674,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,53067356,92003,06073,CA,0,1,H,33.263291,-117.229201,-1.304989e+07,...,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,53258.019576,33.092999,4
53067357,53067357,92003,06073,CA,0,1,F,33.295585,-117.189475,-1.304547e+07,...,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56199.475559,34.920735,4
53067358,53067358,92003,06073,CA,2,1,L,33.292877,-117.212471,-1.304803e+07,...,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,56209.490501,34.926958,4
53067359,53067359,92003,06073,CA,1,1,D,33.284700,-117.210800,-1.304785e+07,...,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,55287.090852,34.353806,4


Dropping unnecessary columns and renaming

In [37]:
df = df.drop(['index_x', 'index_y', 'has_child', 'lat_h_4326_x', 'lon_h_4326_x', 'lat_t_3857', 'lon_t_3857', 'state', 'county_fips', 'zip','distance_m'], axis = 1)
df = df.rename(columns = {'lat_h_4326_y': 'lat_h_4326', 'lon_h_4326_y': 'lon_h_4326'})
df

Unnamed: 0,child_num,age_code,lat_h_3857,lon_h_3857,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category
0,1,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.459368,4
1,1,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.501947,4
2,0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.592896,4
3,1,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.876175,4
4,1,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.117674,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,0,H,-1.304989e+07,3.930304e+06,33.263291,-117.229201,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,33.092999,4
53067357,0,F,-1.304547e+07,3.934604e+06,33.295585,-117.189475,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.920735,4
53067358,2,L,-1.304803e+07,3.934243e+06,33.292877,-117.212471,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.926958,4
53067359,1,D,-1.304785e+07,3.933154e+06,33.284700,-117.210800,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.353806,4


In [3]:


# df = pd.read_parquet('/hpc/group/codeplus22-vis/infousa_copy/distances_temp.parquet')
# df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category
0,1,C,-8.509454e+06,5.101307e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.459368,4
1,1,H,-8.499018e+06,5.096218e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.501947,4
2,0,E,-8.496356e+06,5.099448e+06,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.592896,4
3,1,G,-8.509963e+06,5.103102e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.876175,4
4,1,G,-8.508370e+06,5.099066e+06,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.117674,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,0,H,-1.304989e+07,3.930304e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,33.092999,4
53067357,0,F,-1.304547e+07,3.934604e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.920735,4
53067358,2,L,-1.304803e+07,3.934243e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.926958,4
53067359,1,D,-1.304785e+07,3.933154e+06,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.353806,4


### Renaming household latitude and longitude coordinate columns

Since we want to plot both tanks and households together in the GPU visualization, we now need to append the tank latitude and longitude coordinates onto the same column as the household latitude and longitude coordinates. To do so, we first need to rename so they all have the same column names. Then, we can use the ```append``` function to append them into the same column.

In [38]:
# Renaming
df.rename(columns = {'lat_h_3857': 'lat_3857'}, inplace = True)
df.rename(columns = {'lon_h_3857': 'lon_3857'}, inplace = True)

df.rename(columns = {'lat_h_4326': 'lat_4326'}, inplace = True)
df.rename(columns = {'lon_h_4326': 'lon_4326'}, inplace = True)

df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,lat_t_4326,lon_t_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category
0,1,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.459368,4
1,1,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.501947,4
2,0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.592896,4
3,1,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.876175,4
4,1,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.117674,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,0,H,-1.304989e+07,3.930304e+06,33.263291,-117.229201,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,33.092999,4
53067357,0,F,-1.304547e+07,3.934604e+06,33.295585,-117.189475,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.920735,4
53067358,2,L,-1.304803e+07,3.934243e+06,33.292877,-117.212471,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.926958,4
53067359,1,D,-1.304785e+07,3.933154e+06,33.284700,-117.210800,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.353806,4


### Defining ```is_elderly```

This code is categorizing whether or not a household has elderly or not. Here, we have defined elderly to be 65 years old and up; thus, we are looking for rows where ```age_code``` is either ```J```, ```K```, ```L```, or ```M```. If the age code is either of the previously mentioned letters, the ```is_elderly``` column will be assigned a 1. Otherwise, this means that the household does not have elderly (the condition coded is false), and the ```is_elderly``` column will be assigned a 2.

In [39]:
df['is_elderly'] = np.where(((df['age_code'] == 'J') | (df['age_code'] == 'K') | 
                                    (df['age_code'] == 'L') | (df['age_code'] == 'M')), 1, 2)

df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,lat_t_4326,lon_t_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,1,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.459368,4,2
1,1,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.501947,4,2
2,0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,41.299076,-75.928697,4.881886,15.876431,4.895073,24.892845,-1.000000,30.218719,13.460825,28.592896,4,2
3,1,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.876175,4,2
4,1,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,41.263274,-76.905421,2.050670,15.375901,5.380037,14.512438,-1.000000,17.062917,9.063660,33.117674,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53067356,0,H,-1.304989e+07,3.930304e+06,33.263291,-117.229201,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,33.092999,4,2
53067357,0,F,-1.304547e+07,3.934604e+06,33.295585,-117.189475,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.920735,4,2
53067358,2,L,-1.304803e+07,3.934243e+06,33.292877,-117.212471,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.926958,4,1
53067359,1,D,-1.304785e+07,3.933154e+06,33.284700,-117.210800,32.794111,-117.114344,34.617855,11.334705,1.771182,19.203448,2.036342,18.929178,14.648785,34.353806,4,2


For the tanks dataset, we are setting the ```is_elderly``` column equal to 0; 0 will represent a tank in the gpu visualizations.

In [40]:
df_tanks['is_elderly'] = 0
# df_tanks

### Renaming tank dataframe latitude and longitude columns

We are renaming the latitude and longitude columns so that they have the same name as the lat/lon columns of the household data. This is necessary in merging the two dataframes by the same columns. We are also dropping the ```lat_t_4326``` and ```lon_t_4326``` columns because those are unnecessary in the df dataframe.


Dropping latitude and longitude coordinates of the tanks in the 4326 projection not used in our GPU visualizations.

In [41]:
df_tanks.rename(columns = {'lat_t_3857': 'lat_3857'}, inplace = True)
df_tanks.rename(columns = {'lon_t_3857': 'lon_3857'}, inplace = True)

df_tanks.rename(columns = {'lat_t_4326': 'lat_4326'}, inplace = True)
df_tanks.rename(columns = {'lon_t_4326': 'lon_4326'}, inplace = True)


df = df.drop(['lat_t_4326', 'lon_t_4326'], axis = 1)

df_tanks

Unnamed: 0,state,tank_type,diameter,lat_4326,lon_4326,lat_3857,lon_3857,county,on_floodpl,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,adj_risk,geometry,is_elderly
0,New York,closed_roof_tank,39.6,40.625572,-73.745231,-8.209282e+06,4.957270e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,POINT (-73.74523 40.62557),0
1,New York,closed_roof_tank,19.8,40.624761,-73.744420,-8.209191e+06,4.957151e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,POINT (-73.74442 40.62476),0
2,New York,closed_roof_tank,12.6,40.626086,-73.746257,-8.209396e+06,4.957345e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,POINT (-73.74626 40.62609),0
3,New York,closed_roof_tank,30.6,40.625786,-73.746203,-8.209390e+06,4.957301e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,POINT (-73.74620 40.62579),0
4,New York,closed_roof_tank,24.0,40.625781,-73.745813,-8.209346e+06,4.957300e+06,36059,0,6.887656,14.447002,4.095282,13.081208,6.959016,14.834784,10.050825,10.050825,POINT (-73.74581 40.62578),0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98164,Colorado,narrow_closed_roof_tank,5.4,39.777431,-104.920718,-1.167972e+07,4.833652e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,POINT (-104.92072 39.77743),0
98165,Colorado,narrow_closed_roof_tank,4.8,39.777301,-104.920631,-1.167971e+07,4.833633e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,POINT (-104.92063 39.77730),0
98166,Colorado,narrow_closed_roof_tank,3.6,39.777701,-104.920609,-1.167971e+07,4.833691e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,POINT (-104.92061 39.77770),0
98167,Colorado,narrow_closed_roof_tank,4.8,39.776628,-104.920617,-1.167971e+07,4.833535e+06,08031,0,7.743007,12.625942,-1.000000,45.758161,-1.000000,6.179840,12.051158,12.051158,POINT (-104.92062 39.77663),0


### Merging tanks and households 

Here we are appending the household data and tanks data together. Once this data is merged, we are using the ```.drop()``` function to drop some of the columns. In this function, we specify the parameter ```axis``` equal to 1 because the columns we are dropping are located on axis = 1. Once the merged file is produced, we will export this as a parquet file.

In [42]:
df_merged = df.append(df_tanks, ignore_index=True)
df_merged

  df_merged = df.append(df_tanks, ignore_index=True)


Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,...,distance_mi,distance_category,is_elderly,state,tank_type,diameter,county,on_floodpl,adj_risk,geometry
0,1.0,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,2.050670,15.375901,5.380037,14.512438,...,33.459368,4.0,2,,,,,,,
1,1.0,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,4.881886,15.876431,4.895073,24.892845,...,28.501947,4.0,2,,,,,,,
2,0.0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,4.881886,15.876431,4.895073,24.892845,...,28.592896,4.0,2,,,,,,,
3,1.0,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,2.050670,15.375901,5.380037,14.512438,...,33.876175,4.0,2,,,,,,,
4,1.0,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,2.050670,15.375901,5.380037,14.512438,...,33.117674,4.0,2,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53165525,,,-1.167972e+07,4.833652e+06,39.777431,-104.920718,7.743007,12.625942,-1.000000,45.758161,...,,,0,Colorado,narrow_closed_roof_tank,5.4,08031,0.0,12.051158,POINT (-104.92072 39.77743)
53165526,,,-1.167971e+07,4.833633e+06,39.777301,-104.920631,7.743007,12.625942,-1.000000,45.758161,...,,,0,Colorado,narrow_closed_roof_tank,4.8,08031,0.0,12.051158,POINT (-104.92063 39.77730)
53165527,,,-1.167971e+07,4.833691e+06,39.777701,-104.920609,7.743007,12.625942,-1.000000,45.758161,...,,,0,Colorado,narrow_closed_roof_tank,3.6,08031,0.0,12.051158,POINT (-104.92061 39.77770)
53165528,,,-1.167971e+07,4.833535e+06,39.776628,-104.920617,7.743007,12.625942,-1.000000,45.758161,...,,,0,Colorado,narrow_closed_roof_tank,4.8,08031,0.0,12.051158,POINT (-104.92062 39.77663)


In [43]:
df_merged = df_merged.drop(['state', 'tank_type', 'diameter', 'county', 'on_floodpl', 'adj_risk', 'geometry'], axis = 1)
df_merged

### Filling in the na in the ```age_code``` column

We are filling the NAs with 'Z' in age_code, which represents tanks. We chose 'Z' because it is far from the other letters indicating ```age_code``` and won't be mistaken for a certain age category.

In [44]:
df_merged['age_code'] = df_merged['age_code'].fillna('Z')
df_merged

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,1.0,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.459368,4.0,2
1,1.0,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.501947,4.0,2
2,0.0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.592896,4.0,2
3,1.0,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.876175,4.0,2
4,1.0,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.117674,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53165525,,Z,-1.167972e+07,4.833652e+06,39.777431,-104.920718,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,,0
53165526,,Z,-1.167971e+07,4.833633e+06,39.777301,-104.920631,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,,0
53165527,,Z,-1.167971e+07,4.833691e+06,39.777701,-104.920609,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,,0
53165528,,Z,-1.167971e+07,4.833535e+06,39.776628,-104.920617,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,,0


### Fill ```distance_category```

We are going to fill the NAs in this column with 0 to represent the tanks. The other numbers denote the distance range the household is from a tank.

In [45]:
df_merged['distance_category'] = df_merged['distance_category'].fillna(0)
df_merged


Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,1.0,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.459368,4.0,2
1,1.0,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.501947,4.0,2
2,0.0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.592896,4.0,2
3,1.0,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.876175,4.0,2
4,1.0,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.117674,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53165525,,Z,-1.167972e+07,4.833652e+06,39.777431,-104.920718,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,0.0,0
53165526,,Z,-1.167971e+07,4.833633e+06,39.777301,-104.920631,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,0.0,0
53165527,,Z,-1.167971e+07,4.833691e+06,39.777701,-104.920609,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,0.0,0
53165528,,Z,-1.167971e+07,4.833535e+06,39.776628,-104.920617,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,,0.0,0


### Filling in ```distance_mi``` column for the tanks dataframe

The distance column in the final merged dataframe will represent the distance between a household and tank. However, for the tanks, there is no associated distance--when we do the range slider for distance, only households in a certain distance range will be changing. Therefore, we want to insert a number into the distance column that will not actually interfere with the other distance. In this case, we are finding the max distance and filling the tank dist column with a number a little bit higher than that (the max distance is around 213 miles, so we will fill the column in with 215 miles).

In [55]:
df_merged['distance_mi'].max()

213.4276172929592

In [47]:
df_merged['distance_mi'] = df_merged['distance_mi'].fillna(215)
df_merged

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,1.0,C,-8.509454e+06,5.101307e+06,41.600392,-76.441724,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.459368,4.0,2
1,1.0,H,-8.499018e+06,5.096218e+06,41.566196,-76.347977,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.501947,4.0,2
2,0.0,E,-8.496356e+06,5.099448e+06,41.587904,-76.324061,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.592896,4.0,2
3,1.0,G,-8.509963e+06,5.103102e+06,41.612450,-76.446301,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.876175,4.0,2
4,1.0,G,-8.508370e+06,5.099066e+06,41.585339,-76.431989,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.117674,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53165525,,Z,-1.167972e+07,4.833652e+06,39.777431,-104.920718,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165526,,Z,-1.167971e+07,4.833633e+06,39.777301,-104.920631,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165527,,Z,-1.167971e+07,4.833691e+06,39.777701,-104.920609,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165528,,Z,-1.167971e+07,4.833535e+06,39.776628,-104.920617,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0


### Exporting to parquet file
Finally, we export this dataframe as a parquet file. It will be used in our visualizations.

In [48]:
df_merged.to_parquet('/hpc/group/codeplus22-vis/celine_data/dist_all_hh_with_children.parquet')

In [18]:
df = pd.read_parquet('/hpc/group/codeplus22-vis/celine_data/dist_all_hh_with_children.parquet')
df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,1.0,C,-8.509454e+06,5.101307e+06,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.459368,4.0,2
1,1.0,H,-8.499018e+06,5.096218e+06,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.501947,4.0,2
2,2.0,E,-8.496356e+06,5.099448e+06,4.881886,15.876431,4.895073,24.892845,-1.0,30.218719,13.460825,28.592896,4.0,2
3,1.0,G,-8.509963e+06,5.103102e+06,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.876175,4.0,2
4,1.0,G,-8.508370e+06,5.099066e+06,2.050670,15.375901,5.380037,14.512438,-1.0,17.062917,9.063660,33.117674,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53165525,0.0,Z,-1.167972e+07,4.833652e+06,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165526,0.0,Z,-1.167971e+07,4.833633e+06,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165527,0.0,Z,-1.167971e+07,4.833691e+06,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
53165528,0.0,Z,-1.167971e+07,4.833535e+06,7.743007,12.625942,-1.000000,45.758161,-1.0,6.179840,12.051158,215.000000,0.0,0
