# Using Cuxfilter to Plot Coordinates on a Map of the United States
### A map of all United States households and tanks with a distance range slider
This visualization uses a the Cuxfilter library and GPUs to plot households and tanks on a dashboard, with range slider feature to show households that are a certain range of distance away from the nearest tank. There are also multiselect charts to view households within a certain age group and households within a specific distance category.

### Import statements

In [6]:
import pandas as pd
import numpy as np
import os

import cuxfilter
from cuxfilter.layouts import feature_and_five_edge, double_feature
import cudf

### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/visualizations```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/visualizations``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/visualizations```.

In [3]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('visualizations', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Reading in household and proximity data (pre-processed)
This is a preprocessed file with distance between households across the whole US (that have both children and eldery people) and tanks already calculated in miles. This dataframe also includes information on age group of the head of household, the latitude and longitudes of the tanks and households, distance category, and tank_type.

The ```lat_3857``` and ```lon_3857``` coordinates will be the points we plot on our cuxfilter dashboard, and the rest of the variables are used for the range slider and multiselect tools the user can interact with.

In [9]:
df = pd.read_parquet(DATA_DIR + '/distances_all_hh.parquet')
df

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
0,3.0,C,-8.556472e+06,4.754685e+06,39.230097,-76.864096,7.975933,32.552888,23.825800,45.694335,3.543941,17.889439,21.913723,14.977140,4.0,2
1,5.0,C,-1.076073e+07,5.469166e+06,44.024061,-96.665285,6.897670,11.518197,-1.000000,4.252650,-1.000000,9.396389,5.344151,812.480090,4.0,2
2,1.0,I,-1.251261e+07,4.094840e+06,34.490381,-112.402712,0.509994,8.385226,5.089364,9.677147,-1.000000,10.415246,5.679496,934.400050,4.0,2
3,10.0,K,-9.857755e+06,4.129311e+06,34.745220,-88.553720,1.289903,15.693502,3.403131,19.224199,-1.000000,8.517747,8.021414,509.132686,4.0,1
4,1.0,C,-9.267351e+06,5.493176e+06,44.178941,-83.250028,2.205648,26.709422,5.171142,17.753031,0.000000,8.982879,10.137020,48.826977,4.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73086,,Z,-1.010035e+07,5.222881e+06,42.411899,-90.732966,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,2810.000000,0.0,0
73087,,Z,-1.183249e+07,5.291041e+06,42.862335,-106.293070,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,2810.000000,0.0,0
73088,,Z,-9.971313e+06,4.384699e+06,36.608666,-89.573830,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,2810.000000,0.0,0
73089,,Z,-7.944992e+06,5.135812e+06,41.831766,-71.371080,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,2810.000000,0.0,0


Then, we filter for only households with elderly people. This was done because the using the original InfoUSA and AST datasets, we needed to filter the data in order to have enough memory to plot them all on one visualization. Note that we filter for households with eldery people or where ```is_elderly``` is ```0```, which indicates that the point is a tank (since we want to plot tanks on the visualization).

Note: this is not necessary when using the synthetic data because it is much smaller, but we will do it anyway for demonstration purposes.

In [10]:
df = df[(df['is_elderly'] == 1) | (df['is_elderly'] == 0)]
df.head()

Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly
3,10.0,K,-9857755.0,4129311.0,34.74522,-88.55372,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,509.132686,4.0,1
6,13.0,M,-8814048.0,4606405.0,38.190735,-79.17794,1.089966,12.1688,4.443469,13.055235,13.572454,20.548916,10.81314,1315.124916,4.0,1
7,1.0,J,-9394150.0,3980900.0,33.64251,-84.389088,10.106869,20.938982,4.376207,39.504421,-1.0,25.022227,16.658118,383.766024,4.0,1
10,6.0,L,-7933237.0,5400470.0,43.578667,-71.265476,0.701085,16.941807,-1.0,13.247944,-1.0,6.62092,6.251959,1132.882631,4.0,1
13,2.0,J,-8947433.0,4439898.0,37.005674,-80.376161,1.074121,6.642421,0.699773,9.833078,-1.0,3.128118,3.562919,974.195013,4.0,1


### Labelling age codes with ```.select()```
The Datashader plotting library that Cuxfilter uses to create our visualization through the use of Graphical Processing Units (GPUs) is optimized for working with large dataframes. This comes with a couple constraints, however. One of these is that Datashader only takes numerical inputs when creating the custom charts the user can interact with, like the multiselect chart or the range slider. This means that instead of being able to categorize each household by its age body by labelling it with ```strings``` as ```'65+'``` or ```'70-74'```, we must label it numerically. Therefore, we must convert each age code to a number that we can use for Datashader to understand.

This is done with the numpy library's ```.select()``` function, which uses a list of conditions to assign values in a new column, ```age_group```. In the code below, the the age codes  ```J```, ```K```, ```L``` and ```M``` are all mapped to specific numeric values in the ```values``` list. Age code ```Z``` is labelled as ```0``` to indicate that the point is a tank. We only categorize elderly age codes because earlier we filtered for only households with elderly.

In [12]:
conditions = [(df['age_code'] == 'Z'), (df['age_code'] == 'J'),
             (df['age_code'] == 'K'), (df['age_code'] == 'L'),
             (df['age_code'] == 'M')]


values = [0, 1, 2, 3, 4]

df['age_group'] = np.select(conditions, values)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age_group'] = np.select(conditions, values)


Unnamed: 0,child_num,age_code,lat_3857,lon_3857,lat_4326,lon_4326,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,distance_mi,distance_category,is_elderly,age_group
3,10.0,K,-9857755.0,4129311.0,34.74522,-88.55372,1.289903,15.693502,3.403131,19.224199,-1.0,8.517747,8.021414,509.132686,4.0,1,2
6,13.0,M,-8814048.0,4606405.0,38.190735,-79.17794,1.089966,12.1688,4.443469,13.055235,13.572454,20.548916,10.81314,1315.124916,4.0,1,4
7,1.0,J,-9394150.0,3980900.0,33.64251,-84.389088,10.106869,20.938982,4.376207,39.504421,-1.0,25.022227,16.658118,383.766024,4.0,1,1
10,6.0,L,-7933237.0,5400470.0,43.578667,-71.265476,0.701085,16.941807,-1.0,13.247944,-1.0,6.62092,6.251959,1132.882631,4.0,1,3
13,2.0,J,-8947433.0,4439898.0,37.005674,-80.376161,1.074121,6.642421,0.699773,9.833078,-1.0,3.128118,3.562919,974.195013,4.0,1,1


### Transforming to cuxfilter dataframe

This transforms the pandas dataframe into a cuDF dataframe, then from a cuDF dataframe into a Cuxfilter dataframe. This makes it possible to plot these dataframes using the Cuxfilter library.

In [15]:
cdf = cudf.DataFrame.from_pandas(df)

In [16]:
cux_df = cuxfilter.DataFrame.from_dataframe(cdf)

### Defining label maps
The Datashader plotting library that Cuxfilter uses to create our visualization through the use of Graphical Processing Units (GPUs) is optimized for working with large dataframes. Here, we're plotting over 12 million of them. However, an aspect of Datashader is that it only takes numerical inputs when creating the range slider and multiselect charts. This means that instead of being able to categorize each household by age group by labelling it with ```strings``` like ```'70-74```, we must label it numerically. Hence, our column ```age_group``` has numerical indicators. ```0``` indicates that the point is a tank, and the rest indicate that the point is a household and the head of household is of a specific age group seen in the ```label_age_group``` dictionary below. The same structure is true for our ```distance_category``` column. For the ```distance_category``` column, ```0``` indicates that the point is a tank, and the rest indicate that the point is a household which is a specific distance category away from a tank as seen in the ```label_distance_category``` dictionary. 

The label maps below associate each numerical value in our dataframe to a ```string``` label which is displayed on the range slider and multiselect. The ```colors``` list provides the hex codes for the coloring of each point in the map when it is displayed. 

Note: Using the synthetic data in this repo, we are only plotting around 20 thousand points. But using the original InfoUSA and AST datasets, this visualization would plot around 12 million of them.

In [17]:
label_age_group = {0: 'Tank', 1: '65+ (inferred)', 
             2: '65-69 (reported)', 3: '70-74', 
             4: '75+ (reported)'}

In [20]:
label_distance_category = {0: 'Tank', 1: '0.5 mi away from tank', 
             2: '1 mi away from tank', 3: '5 mi away from tank', 
             4: 'More than 5 mi away from tank'}

In [18]:
from bokeh.palettes import Spectral11
colors = list(Spectral11)
colors.reverse()

### Defining cuxfilter charts

This code defines the charts for our interactive dashboard. The ```points``` chart is the main map with households and tanks plotted. We are using latitude and longitude coordinates in the EPSG 3857 coordinate system, as it is the one used by the Cuxfilter library. The points for tanks and households are colored differently by setting the ```aggregate_col``` parameter as the ```distance_mi``` column in our dataframe. The ```aggregate_fn``` parameter, set to either ```max```, ```min``` or ```mean``` specifies which aggregation of the data to perform when coloring the points. 

This column has four categories, as described above, and the ```colors``` list has 11 different colors, used to color each point by its distance to the nearest tank as seen in the ```distance_mi``` column. 

The next three charts define the interactive range sliders and multiselects the user can interact with. In each of these lines, the ```.multi_select``` specifies that the chart is a multiselect chart, while the ```.range_slider``` specifies that the chart is a range slider. In each of these chart definitions, we specify the column name from our dataframe that the chart should pull from, and the label map Cuxfilter should use to create the chart. For example, the ```distance_category``` chart is a multiselect chart that pulls from the ```distance_category``` column. The options on that multiselect chart are labelled according to the ```label_map_distance``` label map.

In [21]:
points = cuxfilter.charts.scatter(x='lat_3857', y='lon_3857', pixel_shade_type='linear', color_palette = colors, aggregate_fn = 'mean', aggregate_col = 'distance_mi', tile_provider="CartoDark", title = 'Households with Elderly and Children in Proximity to Tanks',
                                  legend = True)

age_group = cuxfilter.charts.multi_select('age_group', label_map=label_age_group)

distance_category = cuxfilter.charts.multi_select('distance_category', label_map=label_distance_category)

distance = cuxfilter.charts.range_slider('distance_mi')

Finally, we use the ```.dashboard``` method to put these charts together as an interactive dashboard for the user. We first specify the main charts which will be displayed in the ```layout``` we choose, then specify the charts that will be displayed on the dashboard's ```sidebar```.

In [22]:
d = cux_df.dashboard([points], sidebar = [age_group, distance_category, distance])
# d = cux_df.dashboard([points], sidebar = [distance_category, distance])

### Displaying interactive dashboard

Running the commands below displays the interactive dashboard. The user can use the multiselect charts to view specific subsets of points, and use ```ctrl``` click to view multiple catgories in one multiselect at a time. The user can also use the range slider chart to view households within a certain distance range from a storage tank.

Using these interactive tools creates a copy of the data every time a user interacts with them, which may cause a memory allocation error. You can run ```nvidia-smi``` to see how much memory you are using on the GPUs.

In [23]:
d.show()
d.app(sidebar_width=290) # run the dashboard within the notebook cell

Dashboard running at port 44565
