# Using Machine Learning to Calculate Shortest Distance Between Two Points
### Calculating the shortest distances between all households with children and storage tanks in the United States 

### Import statements
Need to install dask-geopandas and pygeos as well. Once you install these packages once, you can comment them out (you don't need to reinstall again).

In [1]:
# pip install dask-geopandas==0.1.0a4

In [2]:
# pip install pygeos

In [1]:
import dask
import pandas as pd
from dask import dataframe as dd
import dask_geopandas
import geopandas as gpd
import numpy as np
import os



### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/procesing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/procesing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/procesing```.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Reading InfoUSA data
This is a pre-processed file including the demographic information for all households provided in the InfoUSA dataset. To understand this processing in more detail, visit processing notebook **01_merging_files**.

In [3]:
%%time
df_hh = pd.read_parquet(DATA_DIR + '/infousa_merged.parquet')
df_hh.head()

CPU times: user 54 ms, sys: 54.7 ms, total: 109 ms
Wall time: 139 ms


Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857
0,16965,42269,IN,3,0,C,39.230097,-76.864096,-8556472.0,4754685.0
1,79667,8484,NV,5,1,C,44.024061,-96.665285,-10760730.0,5469166.0
2,88819,35578,ID,1,1,I,34.490381,-112.402712,-12512610.0,4094840.0
3,16748,25538,PA,10,1,K,34.74522,-88.55372,-9857755.0,4129311.0
4,43449,11049,NJ,1,1,C,44.178941,-83.250028,-9267351.0,5493176.0


Next, we filter for only households with children. This is because the original InfoUSA data has over 192 million rows, and we needed to narrow it down- our researcher focuses on areas dense in children, so this made sense. However, since the test, synthetic data we are using in this notebook has under 1 million observations, we do not need to filter it. We left the original code chunk commented-out below.

In [4]:
# %%time
# df_hh = df_hh[(df_hh['has_child'] == 1)]
# df_hh

### Use Dask to transform pandas dataframe to a geopandas dataframe
For the code we use to calculate the shortest distance from each household to a tank, we must convert our dataframe ```df_hh``` to a GeoDataFrame. However, as the original InfoUSA dataframe has 53 million rows and 10 columns, converting it without using Dask is not feasible. We attempted it, and ran the code for three hours and it was still not done. Hence, we turned to Dask, an open-source Python library for parallel computing. It allows us to efficiently execute the transformation of our dataframe to a GeoDataFrame, even when working with over 53 million rows. 

To use Dask, we first converted our dataframe to a Dask dataframe, using Dask's ```.from_pandas()``` method. This method takes in our pandas dataframe along with the ```npartitions``` parameter, which is used to specify the number of 'sections' the dask dataframe will be split into.

Note: our synthetic data has around 1 million rows, not 53 million.

In [5]:
df_dask = dd.from_pandas(df_hh, npartitions = 500)

Then, we specify what manipulation to the dask dataframe ```df_dask``` to compute. In this case, we use Dask Geopandas' ```.points_from_xy()``` method to convert the pandas dask dataframe into a geopandas dask dataframe.

In [6]:
%%time
df_dask['geometry'] = dask_geopandas.points_from_xy(df_dask, 'lon_h_4326', 'lat_h_4326')

CPU times: user 119 ms, sys: 16 ms, total: 135 ms
Wall time: 134 ms


After, we convert the dask geodataframe into a geopandas dataframe:

In [7]:
%%time
gdf = dask_geopandas.from_dask_dataframe(df_dask)

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 9.63 ms


Calling compute puts all the above code into action. Dask executes each set of commands on each partition, as specified above. This returns GeoDataFrame ```gdf_hh```, with over 53 million rows, in less than 20 seconds.

Note: using the synthetic data we get less than 1 million rows.

In [8]:
%%time
gdf_hh = gdf.compute()

CPU times: user 2.19 s, sys: 204 ms, total: 2.39 s
Wall time: 2.3 s


In [9]:
gdf_hh = gdf_hh.reset_index(drop = True)
gdf_hh

Unnamed: 0,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326,lon_h_4326,lat_h_3857,lon_h_3857,geometry
0,16965,42269,IN,3,0,C,39.230097,-76.864096,-8.556472e+06,4.754685e+06,POINT (-76.86410 39.23010)
1,30012,21146,KY,13,1,F,48.614617,-112.948037,-1.257332e+07,6.209721e+06,POINT (-112.94804 48.61462)
2,17646,37432,TX,5,0,D,27.867549,-99.856049,-1.111592e+07,3.232285e+06,POINT (-99.85605 27.86755)
3,84890,35036,DC,10,1,D,38.594432,-96.557246,-1.074870e+07,4.663743e+06,POINT (-96.55725 38.59443)
4,21340,52238,KS,6,0,G,46.269855,-84.546093,-9.411628e+06,5.823700e+06,POINT (-84.54609 46.26985)
...,...,...,...,...,...,...,...,...,...,...,...
72104,84204,08082,MI,2,0,L,36.850431,-101.866311,-1.133971e+07,4.418279e+06,POINT (-101.86631 36.85043)
72105,54558,13455,IN,5,0,F,47.798602,-91.304507,-1.016397e+07,6.073415e+06,POINT (-91.30451 47.79860)
72106,28494,22004,NJ,14,0,K,28.708379,-98.583914,-1.097431e+07,3.338581e+06,POINT (-98.58391 28.70838)
72107,88376,52039,VT,15,0,E,43.975508,-90.549634,-1.007994e+07,5.461653e+06,POINT (-90.54963 43.97551)


Filtering for only the ```geometry``` column, as it is the only one we need to run the code below.

In [10]:
gdf_hh = gdf_hh[['geometry']]
gdf_hh

Unnamed: 0,geometry
0,POINT (-76.86410 39.23010)
1,POINT (-112.94804 48.61462)
2,POINT (-99.85605 27.86755)
3,POINT (-96.55725 38.59443)
4,POINT (-84.54609 46.26985)
...,...
72104,POINT (-101.86631 36.85043)
72105,POINT (-91.30451 47.79860)
72106,POINT (-98.58391 28.70838)
72107,POINT (-90.54963 43.97551)


### Reading AST data
We read in our AST data. This is a pre-processed file containing the coordinates of each tank and the national risk index associated to each tank for six different natural hazards. This was done in processing notebook **04_risk_per_tank**. We also filter for only the columns we need.

In [11]:
df_tanks = gpd.read_file(DATA_DIR + '/tanks_risk_score.shp')
df_tanks.head()

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,on_floodpl,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,adj_risk,geometry
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-10151030.0,3568241.0,22033,0,4.149297,9.661013,14.415955,43.776313,9.471153,39.822684,20.216069,20.216069,POINT (-91.18830 30.50199)
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-10062820.0,3502289.0,22089,0,1.208395,6.264728,13.189863,13.190995,17.68582,12.877608,10.736235,10.736235,POINT (-90.39588 29.99019)
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9326761.0,4058617.0,13139,0,5.628088,12.104342,5.312985,31.912282,-1.0,7.696209,10.442318,10.442318,POINT (-83.78372 34.22175)
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.92625,-9787905.0,4566158.0,18129,0,4.926164,10.959311,2.206652,12.846449,-1.0,8.284501,6.53718,6.53718,POINT (-87.92625 37.90602)
4,New Mexico,closed_roof_tank,16.2,35.04534,-106.64843,-11872050.0,4170044.0,35001,0,18.185426,9.373074,-1.0,15.079099,-1.0,14.347347,9.497491,9.497491,POINT (-106.64843 35.04534)


### Converting pandas dataframe to GeoDataFrame
This time we do it without using Dask, as the original tanks dataset has under 100,000 observations.

In [12]:
gdf_tanks = gpd.GeoDataFrame(
    df_tanks, geometry=gpd.points_from_xy(df_tanks.lon_t_4326, df_tanks.lat_t_4326))

In [13]:
gdf_tanks = gdf_tanks.drop(['state', 'tank_type', 'diameter', 'county', 'on_floodpl', 'adj_risk'], axis = 1)
gdf_tanks.head()

Unnamed: 0,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,geometry
0,30.501991,-91.188296,-10151030.0,3568241.0,4.149297,9.661013,14.415955,43.776313,9.471153,39.822684,20.216069,POINT (-91.18830 30.50199)
1,29.990189,-90.395876,-10062820.0,3502289.0,1.208395,6.264728,13.189863,13.190995,17.68582,12.877608,10.736235,POINT (-90.39588 29.99019)
2,34.221754,-83.783722,-9326761.0,4058617.0,5.628088,12.104342,5.312985,31.912282,-1.0,7.696209,10.442318,POINT (-83.78372 34.22175)
3,37.906023,-87.92625,-9787905.0,4566158.0,4.926164,10.959311,2.206652,12.846449,-1.0,8.284501,6.53718,POINT (-87.92625 37.90602)
4,35.04534,-106.64843,-11872050.0,4170044.0,18.185426,9.373074,-1.0,15.079099,-1.0,14.347347,9.497491,POINT (-106.64843 35.04534)


### Finding the closest tank to each household
To find the tanks nearest to each household, we use an algorithm developed by the University of Helsinki. This code is copyrighted and licensed under the Creative Commons Attribution-ShareAlike 4.0 International licence and is available to the public to share and adapt, as long as it is attributed correctly and re-shared if edits are made. The material can be found [here](https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html). From this algorithm, we removed the code that calculates the distance between the two points. The reasoning for this is explained in further detail below.

These functions use the sklearn neighbors module, specifically the ```BallTree``` method, to use machine learning to identify the closest tank to each household. It returns a GeoDataFrame with the same number of indices inputted households GeoDataFrame, where each row corresponds to the row with the same index in the households GeoDataFrame. It also retains all the original columns in the inputted tanks GeoDataFrame.

In [14]:
from sklearn.neighbors import BallTree
import numpy as np

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15)

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * (np.pi / 180), geom.y * (np.pi / 180))).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)

    return closest_points

Here, you can see the outputted dataframe has 72,109 rows- the same number of rows as the inputted ```gdf_hh``` GeoDataFrame, and the same columns as the inputted ```df_tanks``` GeoDataFrame. Tank at index 0 in ```df_closest_tanks``` is the tank nearest to household at index 0 in ```df_hh```, which is in the same order as ```gdf_hh``` and so on. 

Note: the original InfoUSA data frame outputted below should have 2,335,208 million rows.

In [15]:
%%time
df_closest_tanks = nearest_neighbor(gdf_hh, gdf_tanks)
df_closest_tanks

CPU times: user 4.35 s, sys: 9.06 ms, total: 4.36 s
Wall time: 4.37 s


Unnamed: 0,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,geometry
0,39.274244,-76.590036,-8.525964e+06,4.761031e+06,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,POINT (-76.59004 39.27424)
1,48.648099,-112.369065,-1.250887e+07,6.215361e+06,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,POINT (-112.36906 48.64810)
2,28.454539,-98.189295,-1.093038e+07,3.306403e+06,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,POINT (-98.18930 28.45454)
3,37.799390,-96.876770,-1.078427e+07,4.551125e+06,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,POINT (-96.87677 37.79939)
4,43.768952,-84.049854,-9.356387e+06,5.429756e+06,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,POINT (-84.04985 43.76895)
...,...,...,...,...,...,...,...,...,...,...,...,...
72104,36.903803,-101.602346,-1.131032e+07,4.425707e+06,1.449422,13.905716,-1.000000,17.652499,-1.000000,6.549785,6.592904,POINT (-101.60235 36.90380)
72105,44.937631,-93.049842,-1.035826e+07,5.611708e+06,1.804142,34.801849,-1.000000,37.084831,-1.000000,9.058043,13.791478,POINT (-93.04984 44.93763)
72106,28.462872,-98.188533,-1.093030e+07,3.307458e+06,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,POINT (-98.18853 28.46287)
72107,43.835515,-91.258938,-1.015890e+07,5.440023e+06,1.120444,14.250152,-1.000000,12.685188,-1.000000,10.498259,6.425674,POINT (-91.25894 43.83552)


In [16]:
df_closest_tanks = df_closest_tanks.drop(['geometry'], axis = 1)
df_closest_tanks.head()

Unnamed: 0,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk
0,39.274244,-76.590036,-8525964.0,4761031.0,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723
1,48.648099,-112.369065,-12508870.0,6215361.0,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151
2,28.454539,-98.189295,-10930380.0,3306403.0,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496
3,37.79939,-96.87677,-10784270.0,4551125.0,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414
4,43.768952,-84.049854,-9356387.0,5429756.0,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702


Therefore, merging the two ```df_closest_tanks``` and ```df_hh_lat_lon``` will create a new dataframe, ```df_dist``` with the coordinates of each household corresponding to that of the tank nearest to it. This information is what we use to calculate distance. We create new dataframe ```df_hh_lat_lon``` from ```df_hh``` and only keep the latitude and longitude of each household, as these are the only two columns necessary to merge with ```df_closest_tanks``` in order to compute the distance between the household coordinates and the tank coordinates for each household.

In [17]:
df_closest_tanks = df_closest_tanks.reset_index()

In [18]:
df_hh_lat_lon = df_hh[['lat_h_4326', 'lon_h_4326']]
df_hh_lat_lon = df_hh_lat_lon.reset_index()
df_hh_lat_lon

Unnamed: 0,index,lat_h_4326,lon_h_4326
0,0,39.230097,-76.864096
1,1,44.024061,-96.665285
2,2,34.490381,-112.402712
3,3,34.745220,-88.553720
4,4,44.178941,-83.250028
...,...,...,...
72104,14995,48.505775,-95.731777
72105,14996,35.782166,-86.150851
72106,14997,35.332055,-111.938138
72107,14998,35.731249,-99.680925


In [19]:
%%time
df_dist = df_hh_lat_lon.merge(df_closest_tanks, left_index=True, right_index = True)
df_dist = df_dist.drop(['index_x', 'index_y'], axis = 1)
df_dist

CPU times: user 7.11 ms, sys: 6.06 ms, total: 13.2 ms
Wall time: 12 ms


Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk
0,39.230097,-76.864096,39.274244,-76.590036,-8.525964e+06,4.761031e+06,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723
1,44.024061,-96.665285,48.648099,-112.369065,-1.250887e+07,6.215361e+06,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151
2,34.490381,-112.402712,28.454539,-98.189295,-1.093038e+07,3.306403e+06,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496
3,34.745220,-88.553720,37.799390,-96.876770,-1.078427e+07,4.551125e+06,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414
4,44.178941,-83.250028,43.768952,-84.049854,-9.356387e+06,5.429756e+06,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72104,48.505775,-95.731777,36.903803,-101.602346,-1.131032e+07,4.425707e+06,1.449422,13.905716,-1.000000,17.652499,-1.000000,6.549785,6.592904
72105,35.782166,-86.150851,44.937631,-93.049842,-1.035826e+07,5.611708e+06,1.804142,34.801849,-1.000000,37.084831,-1.000000,9.058043,13.791478
72106,35.332055,-111.938138,28.462872,-98.188533,-1.093030e+07,3.307458e+06,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496
72107,35.731249,-99.680925,43.835515,-91.258938,-1.015890e+07,5.440023e+06,1.120444,14.250152,-1.000000,12.685188,-1.000000,10.498259,6.425674


To compute the distance between the two sets of coordinates (the household ones and the ones of the nearest tank), we use the haversine library. This library calculates the distance between two coordinates in EPSG 4326 projection, in kilometers. We multiplied the value by 1,000 to find the distance in meters.

In [20]:
import haversine as hs

In [21]:
%%time
import pandas as pd
from geopy import distance

def distancer(row):
    coords_1 = (row['lat_h_4326'], row['lon_h_4326'])
    coords_2 = (row['lat_t_4326'], row['lon_t_4326'])
    return (hs.haversine(coords_1, coords_2) * 1000)

df_dist['distance_m'] = df_dist.apply(distancer, axis=1)
df_dist.head()



CPU times: user 1.35 s, sys: 28.5 ms, total: 1.38 s
Wall time: 2.17 s


Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m
0,39.230097,-76.864096,39.274244,-76.590036,-8525964.0,4761031.0,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723,24103.37
1,44.024061,-96.665285,48.648099,-112.369065,-12508870.0,6215361.0,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151,1307560.0
2,34.490381,-112.402712,28.454539,-98.189295,-10930380.0,3306403.0,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496,1503771.0
3,34.74522,-88.55372,37.79939,-96.87677,-10784270.0,4551125.0,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,819369.6
4,44.178941,-83.250028,43.768952,-84.049854,-9356387.0,5429756.0,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702,78579.4


Dropping latitude and longitude for nearest tanks, because this is the data for plotting households. Then, calculating distance in miles, as stipulated by our researcher.

In [22]:
df_dist = df_dist.drop(['lat_t_4326', 'lon_t_4326'], axis = 1)

In [23]:
df_dist['distance_mi']  = df_dist['distance_m'] / 1609.344
df_dist.head()

Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi
0,39.230097,-76.864096,-8525964.0,4761031.0,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723,24103.37,14.97714
1,44.024061,-96.665285,-12508870.0,6215361.0,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151,1307560.0,812.48009
2,34.490381,-112.402712,-10930380.0,3306403.0,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496,1503771.0,934.40005
3,34.74522,-88.55372,-10784270.0,4551125.0,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,819369.6,509.132686
4,44.178941,-83.250028,-9356387.0,5429756.0,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702,78579.4,48.826977


Then, we categorize each household by its distances from the nearest tank. These boundaries were set by our researcher. Using the numpy library's ```.select()``` function, we can assign different values to each category. Households within 0.5 miles of a tank are marked as ```1```, households between 0.5 miles and one mile are marked as ```2``` and households between one and five miles from a tank are marked as ```3```. All other households are marked as ```4```.

In [24]:
import numpy as np
conditions = [(df_dist['distance_mi'] <= 0.5),
              ((df_dist['distance_mi'] > 0.5) & (df_dist['distance_mi'] <= 1)),
              ((df_dist['distance_mi'] > 1) & (df_dist['distance_mi'] <= 5)),
              (df_dist['distance_mi'] > 5)]



values = [1, 2, 3, 4]


df_dist['distance_category'] = np.select(conditions, values)
df_dist.head()

Unnamed: 0,lat_h_4326,lon_h_4326,lat_t_3857,lon_t_3857,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi,distance_category
0,39.230097,-76.864096,-8525964.0,4761031.0,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723,24103.37,14.97714,4
1,44.024061,-96.665285,-12508870.0,6215361.0,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151,1307560.0,812.48009,4
2,34.490381,-112.402712,-10930380.0,3306403.0,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496,1503771.0,934.40005,4
3,34.74522,-88.55372,-10784270.0,4551125.0,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,819369.6,509.132686,4
4,44.178941,-83.250028,-9356387.0,5429756.0,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702,78579.4,48.826977,4


Then, we merge the ```df_closest_tanks_hh``` dataframe with the ```df_hh``` dataframe to add back in the demographic data for each household, which we will use in our visualizations, and drop unnecessary columns to prepare for GPU visualizations.

In [25]:
df_hh = df_hh.reset_index()
df_dist = df_dist.reset_index()

In [26]:
df = df_hh.merge(df_dist, left_index = True, right_index = True)
df

Unnamed: 0,index_x,zip,county_fips,state,child_num,has_child,age_code,lat_h_4326_x,lon_h_4326_x,lat_h_3857,...,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_m,distance_mi,distance_category
0,0,16965,42269,IN,3,0,C,39.230097,-76.864096,-8.556472e+06,...,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,2.410337e+04,14.977140,4
1,1,79667,08484,NV,5,1,C,44.024061,-96.665285,-1.076073e+07,...,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,1.307560e+06,812.480090,4
2,2,88819,35578,ID,1,1,I,34.490381,-112.402712,-1.251261e+07,...,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,1.503771e+06,934.400050,4
3,3,16748,25538,PA,10,1,K,34.745220,-88.553720,-9.857755e+06,...,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,8.193696e+05,509.132686,4
4,4,43449,11049,NJ,1,1,C,44.178941,-83.250028,-9.267351e+06,...,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,7.857940e+04,48.826977,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72104,14995,54131,54228,WA,4,0,B,48.505775,-95.731777,-1.065681e+07,...,1.449422,13.905716,-1.000000,17.652499,-1.000000,6.549785,6.592904,1.375328e+06,854.589407,4
72105,14996,89254,16088,TN,12,1,A,35.782166,-86.150851,-9.590269e+06,...,1.804142,34.801849,-1.000000,37.084831,-1.000000,9.058043,13.791478,1.172872e+06,728.789107,4
72106,14997,10415,38214,AL,6,1,L,35.332055,-111.938138,-1.246090e+07,...,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,1.503874e+06,934.463905,4
72107,14998,66502,30474,ID,11,0,A,35.731249,-99.680925,-1.109643e+07,...,1.120444,14.250152,-1.000000,12.685188,-1.000000,10.498259,6.425674,1.151930e+06,715.776171,4


Dropping unnecessary columns and renaming

In [27]:
df = df.drop(['index_x', 'index_y', 'has_child', 'lat_h_4326_x', 'lon_h_4326_x', 'lat_t_3857', 'lon_t_3857', 'state', 'county_fips', 'zip','distance_m'], axis = 1)
df = df.rename(columns = {'lat_h_4326_y': 'lat_h_4326', 'lon_h_4326_y': 'lon_h_4326'})
df

Unnamed: 0,child_num,age_code,lat_h_3857,lon_h_3857,lat_h_4326,lon_h_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category
0,3,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4
1,5,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4
2,1,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4
3,10,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4
4,1,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72104,4,B,-1.065681e+07,6.191414e+06,48.505775,-95.731777,1.449422,13.905716,-1.000000,17.652499,-1.000000,6.549785,6.592904,854.589407,4
72105,12,A,-9.590269e+06,4.270689e+06,35.782166,-86.150851,1.804142,34.801849,-1.000000,37.084831,-1.000000,9.058043,13.791478,728.789107,4
72106,6,L,-1.246090e+07,4.209098e+06,35.332055,-111.938138,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.463905,4
72107,11,A,-1.109643e+07,4.263704e+06,35.731249,-99.680925,1.120444,14.250152,-1.000000,12.685188,-1.000000,10.498259,6.425674,715.776171,4


### Processing for for GPU visualizations
Next, we process this data specifically for creating visualizations of it with the GPUs through the Cuxfilter library. 

#### Renaming ```df``` and ```df_tanks``` latitude and longitude coordinate columns

The Cuxfilter library only pulls coordinates from two columns: on latitude and one longitude column. This means that all the points displayed in the dashboard must be in the same column. Therefore, to plot tanks and households on the same dashboard, we append the dataframe with the coordinates for each tank to the dataframe with the coordinates for each household. To do so, the columns must be the same across both columns. Therefore, we renamed the ```lat_h_3857``` and ```lon_h_3857``` columns in the ```df``` dataframe to ```lat_3857``` and ```lon_3857```. We did the same to ```lat_h_4326``` and  ```lon_h_4326```, in addition to the columns in ```df_tanks```. When the ```df_tanks``` dataframe is appended to this one, we will have general latitude and longitude columns including coordinate information for all the households and tanks in Harris County.

In [28]:
# Renaming
df.rename(columns = {'lat_h_3857': 'lat_3857'}, inplace = True)
df.rename(columns = {'lon_h_3857': 'lon_3857'}, inplace = True)

df.rename(columns = {'lat_h_4326': 'lat_4326'}, inplace = True)
df.rename(columns = {'lon_h_4326': 'lon_4326'}, inplace = True)

df_tanks.rename(columns = {'lat_t_3857': 'lat_3857'}, inplace = True)
df_tanks.rename(columns = {'lon_t_3857': 'lon_3857'}, inplace = True)

df_tanks.rename(columns = {'lat_t_4326': 'lat_4326'}, inplace = True)
df_tanks.rename(columns = {'lon_t_4326': 'lon_4326'}, inplace = True)

#### Defining ```is_elderly```

In addition, the Datashader plotting library that Cuxfilter uses to create our visualization through the use of Graphical Processing Units (GPUs) is optimized for working with large dataframes. This comes with a couple constraints, however. One of these is that Datashader only takes numerical inputs when creating the custom charts the user can interact with, like the multiselect chart or the range slider. This means that instead of being able to categorize each household by whether or not its head of household is eldery by labelling it with ```strings``` as ```'Elderly'``` or ```'No elderly'```, we must label it numerically. Therefore, we must convert each age code to a number that indicates whether or not that household has an elderly head of household.

This is done with the numpy library's ```.where()``` function, which uses if-else conditions to assign values in a new column. In the code below, if the age_code is ```J```, ```K```, ```L``` or ```M```, the household is marked as ```1```, meaning elderly (this is based on the InfoUSA data dictionary), and marked as ```2```, not elderly, for all other values. 

In [29]:
df['is_elderly'] = np.where(((df['age_code'] == 'J') | (df['age_code'] == 'K') | 
                                    (df['age_code'] == 'L') | (df['age_code'] == 'M')), 1, 2)

df.head()

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3,C,-8556472.0,4754685.0,39.230097,-76.864096,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723,14.97714,4,2
1,5,C,-10760730.0,5469166.0,44.024061,-96.665285,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151,812.48009,4,2
2,1,I,-12512610.0,4094840.0,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496,934.40005,4,2
3,10,K,-9857755.0,4129311.0,34.74522,-88.55372,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,509.132686,4,1
4,1,C,-9267351.0,5493176.0,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702,48.826977,4,2


For the tanks dataset, we are setting the ```is_elderly``` column equal to ```0```; 0 will represent a tank in the GPU visualizations.

In [30]:
df_tanks['is_elderly'] = 0
# df_tanks

#### Appending ```df``` and ```df_tanks```

Here we are appending the household data and tanks data to each other. Once this data is merged, we are using the ```.drop()``` function to drop some of the columns. In this function, we specify the parameter ```axis``` equal to 1 because we are dropping columns, not index columns.

In [31]:
df_merged = df.append(df_tanks, ignore_index=True)
df_merged.head()

  df_merged = df.append(df_tanks, ignore_index=True)


Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,...,distance_mi,distance_category,is_elderly,state,tank_type,diameter,county,on_floodpl,adj_risk,geometry
0,3.0,C,-8556472.0,4754685.0,39.230097,-76.864096,7.975933,32.552888,23.8258,45.694335,...,14.97714,4.0,2,,,,,,,
1,5.0,C,-10760730.0,5469166.0,44.024061,-96.665285,6.89767,11.518197,-1.0,4.25265,...,812.48009,4.0,2,,,,,,,
2,1.0,I,-12512610.0,4094840.0,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,...,934.40005,4.0,2,,,,,,,
3,10.0,K,-9857755.0,4129311.0,34.74522,-88.55372,1.289903,15.693502,3.403131,19.224199,...,509.132686,4.0,1,,,,,,,
4,1.0,C,-9267351.0,5493176.0,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,...,48.826977,4.0,2,,,,,,,


In [32]:
df_merged = df_merged.drop(['state', 'tank_type', 'diameter', 'county', 'on_floodpl', 'adj_risk', 'geometry'], axis = 1)
df_merged.head()

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8556472.0,4754685.0,39.230097,-76.864096,7.975933,32.552888,23.8258,45.694335,3.543941,17.889439,21.913723,14.97714,4.0,2
1,5.0,C,-10760730.0,5469166.0,44.024061,-96.665285,6.89767,11.518197,-1.0,4.25265,-1.0,9.396389,5.344151,812.48009,4.0,2
2,1.0,I,-12512610.0,4094840.0,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.0,10.415246,5.679496,934.40005,4.0,2
3,10.0,K,-9857755.0,4129311.0,34.74522,-88.55372,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9267351.0,5493176.0,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.0,8.982879,10.13702,48.826977,4.0,2


#### Filling NaN Values

Since we appended ```df_tanks``` to ```df```, we have lots of ```NaN``` values in columns that were in ```df``` but not in ```df_tanks```, like ```age_code```, ```distance_category``` and ```distance_mi```. In the GPU visualizations we will be using this dataframe for, we would like to be able to identify which points are tanks, as this dataframe has a general ```lat_3857``` and ```lon_3857``` column for the coordinates of both households and tanks. 

First, we fill ```age_code``` NaN values, which represents tanks. We chose ```Z``` because it is far from the other letters indicating ```age_code``` and won't be mistaken for a certain age category.

In [33]:
df_merged['age_code'] = df_merged['age_code'].fillna('Z')
df_merged

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4.0,2
1,5.0,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4.0,2
2,1.0,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4.0,2
3,10.0,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73086,,Z,-1.010035e+07,5.222881e+06,42.411899,-90.732966,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,,,0
73087,,Z,-1.183249e+07,5.291041e+06,42.862335,-106.293070,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,,,0
73088,,Z,-9.971313e+06,4.384699e+06,36.608666,-89.573830,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,,,0
73089,,Z,-7.944992e+06,5.135812e+06,41.831766,-71.371080,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,,,0


Next, we fill NaN values in the ```distance_category``` column with ```0``` to indicate that the point is a tank, like we did earlier with the ```is_elderly``` column. The other values indicate that the point is a household and is a certain distance category from the nearest tank.

In [34]:
df_merged['distance_category'] = df_merged['distance_category'].fillna(0)
df_merged

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4.0,2
1,5.0,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4.0,2
2,1.0,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4.0,2
3,10.0,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73086,,Z,-1.010035e+07,5.222881e+06,42.411899,-90.732966,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,,0.0,0
73087,,Z,-1.183249e+07,5.291041e+06,42.862335,-106.293070,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,,0.0,0
73088,,Z,-9.971313e+06,4.384699e+06,36.608666,-89.573830,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,,0.0,0
73089,,Z,-7.944992e+06,5.135812e+06,41.831766,-71.371080,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,,0.0,0


Finally, we fill NaN values in the ```distance_mi``` column. 

The distance column in the final merged dataframe will represent the distance between a household and tank. However, for the tanks, there is no associated distance-- when we do the range slider for distance, only households in a certain distance range shown will change. Therefore, we want to insert a number into the distance column that will not actually interfere with the other distances. In this case, we are find the maximum distance and fill the ```distance_mi``` column with a number a little bit higher than that (the maximum distance is around 2807 miles, so we will fill the column in with 2810 miles).

In [35]:
df_merged['distance_mi'].max()

2806.5844393984503

In [36]:
df_merged['distance_mi'] = df_merged['distance_mi'].fillna(2810)
df_merged

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4.0,2
1,5.0,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4.0,2
2,1.0,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4.0,2
3,10.0,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73086,,Z,-1.010035e+07,5.222881e+06,42.411899,-90.732966,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,2810.000000,0.0,0
73087,,Z,-1.183249e+07,5.291041e+06,42.862335,-106.293070,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,2810.000000,0.0,0
73088,,Z,-9.971313e+06,4.384699e+06,36.608666,-89.573830,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,2810.000000,0.0,0
73089,,Z,-7.944992e+06,5.135812e+06,41.831766,-71.371080,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,2810.000000,0.0,0


### Exporting to parquet file
Finally, we export this dataframe as a parquet file. It will be used in our visualizations. See visualization notebook **06_all_us_dist** for an example.

In [37]:
df_merged.to_parquet(DATA_DIR + '/distances_all_hh.parquet')

In [38]:
df = pd.read_parquet(DATA_DIR + '/distances_all_hh.parquet')
df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4.0,2
1,5.0,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4.0,2
2,1.0,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4.0,2
3,10.0,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73086,,Z,-1.010035e+07,5.222881e+06,42.411899,-90.732966,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,2810.000000,0.0,0
73087,,Z,-1.183249e+07,5.291041e+06,42.862335,-106.293070,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,2810.000000,0.0,0
73088,,Z,-9.971313e+06,4.384699e+06,36.608666,-89.573830,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,2810.000000,0.0,0
73089,,Z,-7.944992e+06,5.135812e+06,41.831766,-71.371080,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,2810.000000,0.0,0
