# Using Machine Learning to Calculate Shortest Distance Between Two Points
### Calculating the shortest distances between households and storage tanks in Harris and Charleston County

### Import libraries

In [5]:
import geopandas as gpd
import pandas as pd
import numpy as np
import haversine as hs
import os

### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/procesing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/procesing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/procesing```.

In [6]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Reading in Harris County and Charleston County InfoUSA Data
This reads in the merged InfoUSA dataset, created in processing notebook **01_merging_files**. This is if working with the original InfoUSA Data. However, since we are using test, synthetic data in this repository, we created two different datasets- one with synthetic InfoUSA data for Harris County, and one with synthetic InfoUSA data for Charleston County. This was done in processing notebook **synthetic_infousa**.

The commented-out code in the chunks below is how you would read in the file produced by processing notebook **01_merging_files**, then filter for only households within Harris County, the those within Charleston County. However, since we are using test, synthetic dataframes for each county, we read them in separately.

In [3]:
# df = pd.read_parquet('/hpc/group/codeplus22-vis/infousa_copy/zip_00_99_final.parquet')
# df.head()

Unnamed: 0,zip,county,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,18833,113,PA,0,0,K,41.546738,-76.540436,-8520442.0,5093323.0
1,18833,15,PA,0,0,H,41.5908,-76.4242,-8507503.0,5099879.0
2,18833,15,PA,1,1,C,41.600392,-76.441724,-8509454.0,5101307.0
3,18833,15,PA,0,0,L,41.592483,-76.437832,-8509021.0,5100129.0
4,18833,15,PA,1,1,H,41.566196,-76.347977,-8499018.0,5096218.0


Since the InfoUSA dataframe above contains information from all zip codes, we filter by state and county to select only observations for Harris County, Texas. We then drop the columns that we will not be working with.

In [12]:
# df_harris = df[df['county_fips'] == '48201']

df_harris = pd.read_parquet(DATA_DIR + '/source_files/infousa_files/harris_households.parquet')
df_harris = df_harris.drop(['zip', 'county_fips', 'state', 'child_num'], axis = 1)
df_harris

Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,0,H,29.808625,-95.165572,-1.059378e+07,3.478974e+06
1,1,A,29.759987,-95.051747,-1.058111e+07,3.472736e+06
2,0,J,30.061842,-95.007067,-1.057614e+07,3.511502e+06
3,0,G,30.101734,-95.229624,-1.060091e+07,3.516634e+06
4,1,I,29.666211,-95.619547,-1.064432e+07,3.460716e+06
...,...,...,...,...,...,...
499995,1,E,29.905114,-95.675370,-1.065053e+07,3.491359e+06
499996,1,M,30.028496,-95.755345,-1.065944e+07,3.507213e+06
499997,1,A,29.913911,-95.466206,-1.062725e+07,3.492489e+06
499998,1,F,30.069060,-95.759268,-1.065987e+07,3.512430e+06


We do the same for Charleston County, South Carolina.

In [10]:
# df_charleston = df[df['county_fips'] == '45019']

df_charleston = pd.read_parquet(DATA_DIR + '/source_files/infousa_files/charleston_households.parquet')
df_charleston = df_charleston.drop(['zip', 'county_fips', 'state', 'child_num'], axis = 1)
df_charleston

Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,0,H,32.853065,-79.677524,-8.869661e+06,3.875817e+06
1,1,A,32.817618,-79.557081,-8.856254e+06,3.871121e+06
2,1,L,32.904196,-79.640338,-8.865522e+06,3.882594e+06
3,0,A,32.653958,-79.778913,-8.880948e+06,3.849462e+06
4,0,I,32.838292,-80.168261,-8.924290e+06,3.873860e+06
...,...,...,...,...,...,...
99995,1,G,32.859898,-79.983927,-8.903770e+06,3.876722e+06
99996,1,I,32.976982,-80.108903,-8.917682e+06,3.892249e+06
99997,1,C,33.027838,-79.829499,-8.886579e+06,3.899000e+06
99998,0,M,32.805037,-79.584401,-8.859295e+06,3.869454e+06


### Reading in AST data
To calculate the shortest distance between each household and tank, we must also read in the processed AST file. This file was processed in processing notebook **02_processing_tanks**.

In [13]:
df_tanks = gpd.read_file(DATA_DIR + '/ast_master.shp')
df_tanks.head(n=3)

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-10151030.0,3568241.0,22033,POINT (-91.18830 30.50199)
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-10062820.0,3502289.0,22089,POINT (-90.39588 29.99019)
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9326761.0,4058617.0,13139,POINT (-83.78372 34.22175)


Since this dataframe contains information for tanks across the US, we filtered for tanks only in Harris County and Charleston County, then dropping all unrelevant columns. The tanks dataframes for Harris and Charleston County will be used at the end of our data processing.

In [14]:
df_tanks_harris = df_tanks[df_tanks['county'] == '48201']
df_tanks_charleston = df_tanks[df_tanks['county'] == '45019']
df_tanks = df_tanks.drop(['state', 'county'], axis = 1)

### Processing county data separately
Next, we will process each county's distances separately, as they will be saved in separate files for our visualizations. 

#### Harris County:

##### Finding the distance between each household and the nearest tank
The first step in finding the shortest distance between each household and a tank is converting the Harris households dataframe, ```df_harris``` into a GeoDataFrame. The code we run to find the distances rely on geometries, which are a property of GeoDataFrames. To do this, specify the name of the pandas dataframe to convert, then specify which columns to use for the ```POINT``` geometry. In this case, we use ```lon_h_4326``` and ```lat_h_4326```, which are the latitude and longitude coordinates of the household in EPSG 4326.

In [15]:
gdf_harris = gpd.GeoDataFrame(
    df_harris, geometry=gpd.points_from_xy(df_harris.lon_h_4326, df_harris.lat_h_4326))
gdf_harris = gdf_harris[['geometry']]
gdf_harris

Unnamed: 0,geometry
0,POINT (-95.16557 29.80862)
1,POINT (-95.05175 29.75999)
2,POINT (-95.00707 30.06184)
3,POINT (-95.22962 30.10173)
4,POINT (-95.61955 29.66621)
...,...
499995,POINT (-95.67537 29.90511)
499996,POINT (-95.75535 30.02850)
499997,POINT (-95.46621 29.91391)
499998,POINT (-95.75927 30.06906)


We then convert ```df_tanks``` to a GeoDataFrame. Here, we use ```df_tanks``` instead of ```df_tanks_harris``` because in edge cases, a household may be closest to a tank in another county. We will use ```df_tanks_harris``` later.

In [16]:
gdf_tanks = gpd.GeoDataFrame(
    df_tanks, geometry=gpd.points_from_xy(df_tanks.lon_t_4326, df_tanks.lat_t_4326))
gdf_tanks

Unnamed: 0,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,geometry
0,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,POINT (-91.18830 30.50199)
1,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,POINT (-90.39588 29.99019)
2,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,POINT (-83.78372 34.22175)
3,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,POINT (-87.92625 37.90602)
4,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,POINT (-106.64843 35.04534)
...,...,...,...,...,...,...,...
977,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,POINT (-90.73297 42.41190)
978,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,POINT (-106.29307 42.86233)
979,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,POINT (-89.57383 36.60867)
980,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,POINT (-71.37108 41.83177)


To find the tanks nearest to each household, we use an algorithm developed by the University of Helsinki. This code is copyrighted and licensed under the Creative Commons Attribution-ShareAlike 4.0 International licence and is available to the public to share and adapt, as long as it is attributed correctly and re-shared if edits are made. The material can be found [here](https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html). From this algorithm, we removed the code that calculates the distance between the two points. The reasoning for this is explained in further detail below.

These functions use the sklearn neighbors module, specifically the ```BallTree``` method, to use machine learning to identify the closest tank to each household. It returns a GeoDataFrame with the same number of indices inputted households GeoDataFrame, where each row corresponds to the row with the same index in the households GeoDataFrame. It also retains all the original columns in the inputted tanks GeoDataFrame.

In [17]:
from sklearn.neighbors import BallTree
import numpy as np

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric = 'euclidean')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)
    
    return closest_points

Here, you can see the outputted dataframe has 500,000 rows- the same number of rows as the inputted ```gdf_harris``` GeoDataFrame, and the same columns as the inputted ```df_tanks``` GeoDataFrame. Tank at index 0 in ```df_closest_tanks_harris``` is the tank nearest to household at index 0 in ```df_harris```, which is in the same order as ```gdf_harris``` and so on. 

Note: Using the original InfoUSA dataset and AST dataset, the outputted dataframe should have around 2 million rows.

In [18]:
%%time
df_closest_tanks_harris = nearest_neighbor(gdf_harris, gdf_tanks)
df_closest_tanks_harris.head()

CPU times: user 29.8 s, sys: 154 ms, total: 29.9 s
Wall time: 30 s


Unnamed: 0,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,geometry
0,closed_roof_tank,29.4,29.666204,-95.200325,-1.059765e+07,3.460715e+06,POINT (-95.20033 29.66620)
1,narrow_closed_roof_tank,6.0,29.626542,-95.040781,-1.057989e+07,3.455635e+06,POINT (-95.04078 29.62654)
2,closed_roof_tank,9.6,29.859656,-94.908778,-1.056520e+07,3.485523e+06,POINT (-94.90878 29.85966)
3,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,POINT (-95.38185 29.93697)
4,external_floating_roof_tank,10.2,29.760722,-95.340470,-1.061325e+07,3.472830e+06,POINT (-95.34047 29.76072)
...,...,...,...,...,...,...,...
499995,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,POINT (-95.38185 29.93697)
499996,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,POINT (-95.38185 29.93697)
499997,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,POINT (-95.38185 29.93697)
499998,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,POINT (-95.38185 29.93697)


Therefore, merging the two ```df_closest_tanks_harris``` and ```df_harris``` will create a new dataframe, ```df_harris_dist``` with the coordinates of each household corresponding to that of the tank nearest to it. This information is what we use to calculate distance.

In [19]:
df_closest_tanks_harris = df_closest_tanks_harris.reset_index(drop = True)
df_harris = df_harris.reset_index(drop = True)

In [20]:
df_harris_dist = df_harris.merge(df_closest_tanks_harris, left_index=True, right_index = True)
df_harris_dist = df_harris_dist.drop(['geometry_x', 'geometry_y'], axis = 1)
df_harris_dist.head()

Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857
0,0,H,29.808625,-95.165572,-10593780.0,3478974.0,closed_roof_tank,29.4,29.666204,-95.200325,-10597650.0,3460715.0
1,1,A,29.759987,-95.051747,-10581110.0,3472736.0,narrow_closed_roof_tank,6.0,29.626542,-95.040781,-10579890.0,3455635.0
2,0,J,30.061842,-95.007067,-10576140.0,3511502.0,closed_roof_tank,9.6,29.859656,-94.908778,-10565200.0,3485523.0
3,0,G,30.101734,-95.229624,-10600910.0,3516634.0,closed_roof_tank,23.4,29.936972,-95.381851,-10617860.0,3495451.0
4,1,I,29.666211,-95.619547,-10644320.0,3460716.0,external_floating_roof_tank,10.2,29.760722,-95.34047,-10613250.0,3472830.0


To compute the distance between the two sets of coordinates (the household ones and the ones of the nearest tank), we use the haversine library. This library calculates the distance between two coordinates in EPSG 4326 projection, in kilometers. We multiplied the value by 1,000 to find the distance in meters.

In [21]:
%%time

def distancer(row):
    coords_1 = (row['lat_h_4326'], row['lon_h_4326'])
    coords_2 = (row['lat_t_4326'], row['lon_t_4326'])
    return (hs.haversine(coords_1, coords_2) * 1000)

df_harris_dist['distance_m'] = df_harris_dist.apply(distancer, axis=1)
df_harris_dist

CPU times: user 7.43 s, sys: 101 ms, total: 7.53 s
Wall time: 7.55 s


Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,distance_m
0,0,H,29.808625,-95.165572,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,29.666204,-95.200325,-1.059765e+07,3.460715e+06,16188.088325
1,1,A,29.759987,-95.051747,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,29.626542,-95.040781,-1.057989e+07,3.455635e+06,14876.225412
2,0,J,30.061842,-95.007067,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,29.859656,-94.908778,-1.056520e+07,3.485523e+06,24394.718559
3,0,G,30.101734,-95.229624,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,23461.770356
4,1,I,29.666211,-95.619547,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,29.760722,-95.340470,-1.061325e+07,3.472830e+06,28928.203189
...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,1,E,29.905114,-95.675370,-1.065053e+07,3.491359e+06,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,28508.660523
499996,1,M,30.028496,-95.755345,-1.065944e+07,3.507213e+06,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,37384.787174
499997,1,A,29.913911,-95.466206,-1.062725e+07,3.492489e+06,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,8524.111836
499998,1,F,30.069060,-95.759268,-1.065987e+07,3.512430e+06,closed_roof_tank,23.4,29.936972,-95.381851,-1.061786e+07,3.495451e+06,39198.942987


Dropping latitude and longitude coordinates in the 4326 projection not used in our GPU visualizaitons (that this data is processed for). Also dropping latitude and longitude for nearest tanks, because this is the data for plotting households. Then, calculating distance in miles, as stipulated by our researcher.

In [22]:
df_harris_dist = df_harris_dist.drop(['lat_h_4326', 'lon_h_4326', 'lat_t_4326', 'lon_t_4326', 'lat_t_3857', 'lon_t_3857'], axis = 1)

In [23]:
df_harris_dist['distance_mi']  = df_harris_dist['distance_m'] / 1609.344
df_harris_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi
0,0,H,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,16188.088325,10.058812
1,1,A,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,14876.225412,9.243658
2,0,J,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,24394.718559,15.158175
3,0,G,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,23461.770356,14.578468
4,1,I,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,28928.203189,17.975152
...,...,...,...,...,...,...,...,...
499995,1,E,-1.065053e+07,3.491359e+06,closed_roof_tank,23.4,28508.660523,17.714460
499996,1,M,-1.065944e+07,3.507213e+06,closed_roof_tank,23.4,37384.787174,23.229830
499997,1,A,-1.062725e+07,3.492489e+06,closed_roof_tank,23.4,8524.111836,5.296638
499998,1,F,-1.065987e+07,3.512430e+06,closed_roof_tank,23.4,39198.942987,24.357094


Then, we categorize each household by its distances from the nearest tank. These boundaries were set by our researcher. Using the numpy library's ```.select()``` function, we can assign different values to each category. Households within 0.5 miles of a tank are marked as ```1```, households between 0.5 miles and one mile are marked as ```2``` and households between one and five miles from a tank are marked as ```3```. All other households are marked as ```4```.

In [24]:
conditions_harris = [(df_harris_dist['distance_mi'] <= 0.5),
              ((df_harris_dist['distance_mi'] > 0.5) & (df_harris_dist['distance_mi'] <= 1)),
              ((df_harris_dist['distance_mi'] > 1) & (df_harris_dist['distance_mi'] <= 5)),
              (df_harris_dist['distance_mi'] > 5)]

values_harris = [1, 2, 3, 4]

df_harris_dist['distance_category'] = np.select(conditions_harris, values_harris)
df_harris_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category
0,0,H,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,16188.088325,10.058812,4
1,1,A,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,14876.225412,9.243658,4
2,0,J,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,24394.718559,15.158175,4
3,0,G,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,23461.770356,14.578468,4
4,1,I,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,28928.203189,17.975152,4
...,...,...,...,...,...,...,...,...,...
499995,1,E,-1.065053e+07,3.491359e+06,closed_roof_tank,23.4,28508.660523,17.714460,4
499996,1,M,-1.065944e+07,3.507213e+06,closed_roof_tank,23.4,37384.787174,23.229830,4
499997,1,A,-1.062725e+07,3.492489e+06,closed_roof_tank,23.4,8524.111836,5.296638,4
499998,1,F,-1.065987e+07,3.512430e+06,closed_roof_tank,23.4,39198.942987,24.357094,4


##### Processing the data for GPU visualizations
Next, we process this data specifically for creating visualizations of it with the GPUs through the Cuxfilter library. 

The Datashader plotting library that Cuxfilter uses to create our visualization through the use of Graphical Processing Units (GPUs) is optimized for working with large dataframes. This comes with a couple constraints, however. One of these is that Datashader only takes numerical inputs when creating the custom charts the user can interact with, like the multiselect chart or the range slider. This means that instead of being able to categorize each household by whether or not its head of household is eldery by labelling it with ```strings``` as ```'Elderly'``` or ```'No elderly'```, we must label it numerically. Therefore, we must convert each age code to a number that indicates whether or not that household has an elderly head of household.

This is done with the numpy library's ```.where()``` function, which uses if-else conditions to assign values in a new column. In the code below, if the age_code is ```J```, ```K```, ```L``` or ```M```, the household is marked as ```1```, meaning elderly (this is based on the InfoUSA data dictionary), and marked as ```2```, not elderly, for all other values. 

In [27]:
df_harris_dist['is_elderly'] = np.where(((df_harris_dist['age_code'] == 'J') | (df_harris_dist['age_code'] == 'K') |
                                       (df_harris_dist['age_code'] == 'L') | (df_harris_dist['age_code'] == 'M')), 1, 2)
df_harris_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,0,H,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,16188.088325,10.058812,4,2
1,1,A,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,14876.225412,9.243658,4,2
2,0,J,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,24394.718559,15.158175,4,1
3,0,G,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,23461.770356,14.578468,4,2
4,1,I,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,28928.203189,17.975152,4,2
...,...,...,...,...,...,...,...,...,...,...
499995,1,E,-1.065053e+07,3.491359e+06,closed_roof_tank,23.4,28508.660523,17.714460,4,2
499996,1,M,-1.065944e+07,3.507213e+06,closed_roof_tank,23.4,37384.787174,23.229830,4,1
499997,1,A,-1.062725e+07,3.492489e+06,closed_roof_tank,23.4,8524.111836,5.296638,4,2
499998,1,F,-1.065987e+07,3.512430e+06,closed_roof_tank,23.4,39198.942987,24.357094,4,2


To remain consistent the same structure as above, even though the ```has_child``` column is already numerical, we changed the values so that ```1``` indicates that the household has children, ```2``` indicates that the household has no children, and ```0``` indicates that the point is a tank. Previously, ```0``` indicated no children and ```1``` indicated children. In all our categorical variable columns, ```0``` indicates that the point is a tank, so we wanted to remain consistent.

In [28]:
df_harris_dist['has_child'] = np.where(df_harris_dist['has_child'] == 1, 1, 2)
df_harris_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,2,H,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,16188.088325,10.058812,4,2
1,1,A,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,14876.225412,9.243658,4,2
2,2,J,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,24394.718559,15.158175,4,1
3,2,G,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,23461.770356,14.578468,4,2
4,1,I,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,28928.203189,17.975152,4,2
...,...,...,...,...,...,...,...,...,...,...
499995,1,E,-1.065053e+07,3.491359e+06,closed_roof_tank,23.4,28508.660523,17.714460,4,2
499996,1,M,-1.065944e+07,3.507213e+06,closed_roof_tank,23.4,37384.787174,23.229830,4,1
499997,1,A,-1.062725e+07,3.492489e+06,closed_roof_tank,23.4,8524.111836,5.296638,4,2
499998,1,F,-1.065987e+07,3.512430e+06,closed_roof_tank,23.4,39198.942987,24.357094,4,2


In addition, the Cuxfilter library only pulls coordinates from two columns: on latitude and one longitude column. This means that all the points displayed in the dashboard must be in the same column. Therefore, to plot tanks and households on the same dashboard, we append the dataframe with the coordinates for each tank to the dataframe with the coordinates for each household. To do so, the columns must be the same across both columns. Therefore, we renamed the ```lat_h_3857``` and ```lon_h_3857``` columns in the ```df_harris_dist``` dataframe to ```lat_3857``` and ```lon_3857```. When the ```df_tanks_harris``` dataframe is appended to this one, we will have general latitude and longitude columns including coordinate information for all the households and tanks in Harris County.

In [29]:
df_harris_dist.rename(columns = {'lat_h_3857': 'lat_3857', 'lon_h_3857': 'lon_3857'}, inplace = True)

In order for the tanks to display on Cuxfilter when using the distance range slider, we set the distance to the maximum distance between a household and a tank. This is because the distance column in the final merged dataframe used in our visualizations will represent the distance between a household and the tank nearest to it. However, for tanks, there is no associated distance, and when users play with the distance range slider, tanks will not appear on the visualization. We get around this by setting the distance to the maximum distance between a household and a tank. This is a limited solution potentially solveable by calculating the distance for each tank to the nearest household and including those values.

We add the ```has_child```, ```distance_category``` and ```is_elderly``` columns to the ```df_tanks_harris``` dataframe, setting all their values to ```0``` to indicate that the point is a tank when plotted on the dashboard.

In [30]:
df_harris_dist['distance_mi'].max()

34.365988346819115

In [31]:
df_tanks_harris = df_tanks_harris.drop(['state', 'county', 'lat_t_4326', 'lon_t_4326', 'geometry'], axis = 1)
df_tanks_harris['has_child'] = 0
df_tanks_harris['distance_category'] = 0
df_tanks_harris['is_elderly'] = 0
df_tanks_harris['distance_mi'] = 35
df_tanks_harris.rename(columns = {'lat_t_3857': 'lat_3857', 'lon_t_3857': 'lon_3857'}, inplace = True)
df_tanks_harris

Unnamed: 0,tank_type,diameter,lat_3857,lon_3857,has_child,distance_category,is_elderly,distance_mi
59,closed_roof_tank,23.4,-10617860.0,3495451.0,0,0,0,35
195,closed_roof_tank,17.4,-10577760.0,3453122.0,0,0,0,35
214,external_floating_roof_tank,10.2,-10613250.0,3472830.0,0,0,0,35
650,narrow_closed_roof_tank,5.4,-10613740.0,3477134.0,0,0,0,35
699,closed_roof_tank,10.2,-10577710.0,3453203.0,0,0,0,35
765,narrow_closed_roof_tank,6.0,-10579890.0,3455635.0,0,0,0,35
831,closed_roof_tank,29.4,-10597650.0,3460715.0,0,0,0,35
876,closed_roof_tank,25.2,-10566680.0,3480927.0,0,0,0,35


In [32]:
df_harris_merged = df_harris_dist.append(df_tanks_harris, ignore_index = True)
df_harris_merged

  df_harris_merged = df_harris_dist.append(df_tanks_harris, ignore_index = True)


Unnamed: 0,has_child,age_code,lat_3857,lon_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,2,H,-1.059378e+07,3.478974e+06,closed_roof_tank,29.4,16188.088325,10.058812,4,2
1,1,A,-1.058111e+07,3.472736e+06,narrow_closed_roof_tank,6.0,14876.225412,9.243658,4,2
2,2,J,-1.057614e+07,3.511502e+06,closed_roof_tank,9.6,24394.718559,15.158175,4,1
3,2,G,-1.060091e+07,3.516634e+06,closed_roof_tank,23.4,23461.770356,14.578468,4,2
4,1,I,-1.064432e+07,3.460716e+06,external_floating_roof_tank,10.2,28928.203189,17.975152,4,2
...,...,...,...,...,...,...,...,...,...,...
500003,0,,-1.061374e+07,3.477134e+06,narrow_closed_roof_tank,5.4,,35.000000,0,0
500004,0,,-1.057771e+07,3.453203e+06,closed_roof_tank,10.2,,35.000000,0,0
500005,0,,-1.057989e+07,3.455635e+06,narrow_closed_roof_tank,6.0,,35.000000,0,0
500006,0,,-1.059765e+07,3.460715e+06,closed_roof_tank,29.4,,35.000000,0,0


Finally, we save this as a parquet file so we can use it in our visualizations.

In [34]:
df_harris_merged.to_parquet(DATA_DIR + '/harris_dist.parquet')

#### Charleston County
The same process from above is repeated for Charleston County.

##### Finding the distance between each household and the nearest tank

In [36]:
gdf_charleston = gpd.GeoDataFrame(
    df_charleston, geometry=gpd.points_from_xy(df_charleston.lon_h_4326, df_charleston.lat_h_4326))
gdf_charleston = gdf_charleston[['geometry']]
gdf_charleston

Unnamed: 0,geometry
0,POINT (-79.67752 32.85307)
1,POINT (-79.55708 32.81762)
2,POINT (-79.64034 32.90420)
3,POINT (-79.77891 32.65396)
4,POINT (-80.16826 32.83829)
...,...
99995,POINT (-79.98393 32.85990)
99996,POINT (-80.10890 32.97698)
99997,POINT (-79.82950 33.02784)
99998,POINT (-79.58440 32.80504)


In [37]:
gdf_tanks = gpd.GeoDataFrame(
    df_tanks, geometry=gpd.points_from_xy(df_tanks.lon_t_4326, df_tanks.lat_t_4326))
gdf_tanks

Unnamed: 0,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,geometry
0,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,POINT (-91.18830 30.50199)
1,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,POINT (-90.39588 29.99019)
2,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,POINT (-83.78372 34.22175)
3,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,POINT (-87.92625 37.90602)
4,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,POINT (-106.64843 35.04534)
...,...,...,...,...,...,...,...
977,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,POINT (-90.73297 42.41190)
978,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,POINT (-106.29307 42.86233)
979,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,POINT (-89.57383 36.60867)
980,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,POINT (-71.37108 41.83177)


In [38]:
%%time
df_closest_tanks_charleston = nearest_neighbor(gdf_charleston, gdf_tanks)
df_closest_tanks_charleston

CPU times: user 6.25 s, sys: 33.7 ms, total: 6.28 s
Wall time: 6.3 s


Unnamed: 0,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,geometry
0,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
1,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
2,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
3,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
4,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
...,...,...,...,...,...,...,...
99995,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
99996,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
99997,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)
99998,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,POINT (-79.94482 32.83093)


In [39]:
df_closest_tanks_charleston = df_closest_tanks_charleston.reset_index(drop = True)
df_closest_tanks_charleston = df_closest_tanks_charleston.reset_index(drop = True)

In [40]:
df_charleston_dist = df_charleston.merge(df_closest_tanks_charleston, left_index=True, right_index = True)
df_charleston_dist = df_charleston_dist.drop(['geometry_x', 'geometry_y'], axis = 1)
df_charleston_dist

Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857
0,0,H,32.853065,-79.677524,-8.869661e+06,3.875817e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
1,1,A,32.817618,-79.557081,-8.856254e+06,3.871121e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
2,1,L,32.904196,-79.640338,-8.865522e+06,3.882594e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
3,0,A,32.653958,-79.778913,-8.880948e+06,3.849462e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
4,0,I,32.838292,-80.168261,-8.924290e+06,3.873860e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1,G,32.859898,-79.983927,-8.903770e+06,3.876722e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
99996,1,I,32.976982,-80.108903,-8.917682e+06,3.892249e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
99997,1,C,33.027838,-79.829499,-8.886579e+06,3.899000e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06
99998,0,M,32.805037,-79.584401,-8.859295e+06,3.869454e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06


In [41]:
%%time

def distancer(row):
    coords_1 = (row['lat_h_4326'], row['lon_h_4326'])
    coords_2 = (row['lat_t_4326'], row['lon_t_4326'])
    return (hs.haversine(coords_1, coords_2) * 1000)

df_charleston_dist['distance_m'] = df_charleston_dist.apply(distancer, axis=1)
df_charleston_dist

CPU times: user 1.48 s, sys: 24.6 ms, total: 1.51 s
Wall time: 1.51 s


Unnamed: 0,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,distance_m
0,0,H,32.853065,-79.677524,-8.869661e+06,3.875817e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,25092.832386
1,1,A,32.817618,-79.557081,-8.856254e+06,3.871121e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,36261.311576
2,1,L,32.904196,-79.640338,-8.865522e+06,3.882594e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,29581.640586
3,0,A,32.653958,-79.778913,-8.880948e+06,3.849462e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,25060.142859
4,0,I,32.838292,-80.168261,-8.924290e+06,3.873860e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,20891.898394
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1,G,32.859898,-79.983927,-8.903770e+06,3.876722e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,4870.441508
99996,1,I,32.976982,-80.108903,-8.917682e+06,3.892249e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,22324.804878
99997,1,C,33.027838,-79.829499,-8.886579e+06,3.899000e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,24397.885235
99998,0,M,32.805037,-79.584401,-8.859295e+06,3.869454e+06,spherical_tank,14.4,32.830928,-79.944823,-8.899417e+06,3.872884e+06,33803.525512


In [42]:
df_charleston_dist = df_charleston_dist.drop(['lat_h_4326', 'lon_h_4326', 'lat_t_4326', 'lon_t_4326', 'lat_t_3857', 'lon_t_3857'], axis = 1)

In [43]:
df_charleston_dist['distance_mi']  = df_charleston_dist['distance_m'] / 1609.344
df_charleston_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi
0,0,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179
3,0,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651
4,0,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624
...,...,...,...,...,...,...,...,...
99995,1,G,-8.903770e+06,3.876722e+06,spherical_tank,14.4,4870.441508,3.026352
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143
99998,0,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537


In [44]:
conditions_charleston = [(df_charleston_dist['distance_mi'] <= 0.5),
              ((df_charleston_dist['distance_mi'] > 0.5) & (df_charleston_dist['distance_mi'] <= 1)),
              ((df_charleston_dist['distance_mi'] > 1) & (df_charleston_dist['distance_mi'] <= 5)),
              (df_charleston_dist['distance_mi'] > 5)]

values_charleston = [1, 2, 3, 4]

df_charleston_dist['distance_category'] = np.select(conditions_charleston, values_charleston)
df_charleston_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category
0,0,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963,4
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734,4
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179,4
3,0,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651,4
4,0,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624,4
...,...,...,...,...,...,...,...,...,...
99995,1,G,-8.903770e+06,3.876722e+06,spherical_tank,14.4,4870.441508,3.026352,3
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991,4
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143,4
99998,0,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537,4


##### Processing the data for GPU visualizations

In [46]:
df_charleston_dist['is_elderly'] = np.where(((df_charleston_dist['age_code'] == 'J') | (df_charleston_dist['age_code'] == 'K') |
                                       (df_charleston_dist['age_code'] == 'L') | (df_charleston_dist['age_code'] == 'M')), 1, 2)
df_charleston_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,0,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963,4,2
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734,4,2
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179,4,1
3,0,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651,4,2
4,0,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624,4,2
...,...,...,...,...,...,...,...,...,...,...
99995,1,G,-8.903770e+06,3.876722e+06,spherical_tank,14.4,4870.441508,3.026352,3,2
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991,4,2
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143,4,2
99998,0,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537,4,1


In [47]:
df_charleston_dist['has_child'] = np.where(df_charleston_dist['has_child'] == 1, 1, 2)
df_charleston_dist

Unnamed: 0,has_child,age_code,lat_h_3857,lon_h_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,2,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963,4,2
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734,4,2
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179,4,1
3,2,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651,4,2
4,2,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624,4,2
...,...,...,...,...,...,...,...,...,...,...
99995,1,G,-8.903770e+06,3.876722e+06,spherical_tank,14.4,4870.441508,3.026352,3,2
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991,4,2
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143,4,2
99998,2,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537,4,1


In [48]:
df_charleston_dist.rename(columns = {'lat_h_3857': 'lat_3857', 'lon_h_3857': 'lon_3857'}, inplace = True)

In [49]:
df_charleston_dist['distance_mi'].max()

32.90482833288073

In [50]:
df_tanks_charleston = df_tanks_charleston.drop(['state', 'county', 'lat_t_4326', 'lon_t_4326', 'geometry'], axis = 1)
df_tanks_charleston['has_child'] = 0
df_tanks_charleston['distance_category'] = 0
df_tanks_charleston['is_elderly'] = 0
df_tanks_charleston['distance_mi'] = 35
df_tanks_charleston.rename(columns = {'lat_t_3857': 'lat_3857', 'lon_t_3857': 'lon_3857'}, inplace = True)
df_tanks_charleston

Unnamed: 0,tank_type,diameter,lat_3857,lon_3857,has_child,distance_category,is_elderly,distance_mi
962,spherical_tank,14.4,-8899417.0,3872884.0,0,0,0,35


In [51]:
df_charleston_merged = df_charleston_dist.append(df_tanks_charleston, ignore_index = True)
df_charleston_merged

  df_charleston_merged = df_charleston_dist.append(df_tanks_charleston, ignore_index = True)


Unnamed: 0,has_child,age_code,lat_3857,lon_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,2,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963,4,2
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734,4,2
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179,4,1
3,2,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651,4,2
4,2,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624,4,2
...,...,...,...,...,...,...,...,...,...,...
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991,4,2
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143,4,2
99998,2,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537,4,1
99999,2,J,-8.921631e+06,3.886782e+06,spherical_tank,14.4,21980.681980,13.658163,4,1


In [52]:
df_charleston_merged.to_parquet(DATA_DIR + '/charleston_dist.parquet')

In [53]:
df = pd.read_parquet(DATA_DIR + '/charleston_dist.parquet')
df

Unnamed: 0,has_child,age_code,lat_3857,lon_3857,tank_type,diameter,distance_m,distance_mi,distance_category,is_elderly
0,2,H,-8.869661e+06,3.875817e+06,spherical_tank,14.4,25092.832386,15.591963,4,2
1,1,A,-8.856254e+06,3.871121e+06,spherical_tank,14.4,36261.311576,22.531734,4,2
2,1,L,-8.865522e+06,3.882594e+06,spherical_tank,14.4,29581.640586,18.381179,4,1
3,2,A,-8.880948e+06,3.849462e+06,spherical_tank,14.4,25060.142859,15.571651,4,2
4,2,I,-8.924290e+06,3.873860e+06,spherical_tank,14.4,20891.898394,12.981624,4,2
...,...,...,...,...,...,...,...,...,...,...
99996,1,I,-8.917682e+06,3.892249e+06,spherical_tank,14.4,22324.804878,13.871991,4,2
99997,1,C,-8.886579e+06,3.899000e+06,spherical_tank,14.4,24397.885235,15.160143,4,2
99998,2,M,-8.859295e+06,3.869454e+06,spherical_tank,14.4,33803.525512,21.004537,4,1
99999,2,J,-8.921631e+06,3.886782e+06,spherical_tank,14.4,21980.681980,13.658163,4,1
