# Filtering Through Data, Converting Coordinate Systems, Using ```.sjoin()```
### Processing AST data for further use

### Import statements

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import geopandas as gpd
import os

### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/procesing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/procesing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/procesing```.

In [3]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Reading AST data
Here, we are reading in the synthetic AST data we created.

Note: the original AST dataset has ~98,000 observations, but our synthetic AST dataset only has 1% of that.

In [4]:
df_tanks = gpd.read_file(DATA_DIR + '/source_files/ast_files/ast_synthetic.shp')
df_tanks.head(n=3)

Unnamed: 0,tile_name,minx_polyg,miny_polyg,maxx_polyg,maxy_polyg,nw_corner_,nw_corne_1,se_corner_,se_corne_1,object_cla,diameter (,merged_bbo,bbox_withi,Category1,Category2,Category3,Category4,Category5,state,geometry
0,m_3009139_ne_15_060_20190726,283,280,291,315,30.502086,-91.188319,30.501896,-91.188273,closed_roof_tank,4.8,0,0,0.0,0.0,0.0,0.0,0.0,Louisiana,"POLYGON ((-91.18832 30.50209, -91.18827 30.502..."
1,m_2909005_ne_15_060_20190707,7130,2364,7181,2414,29.990328,-90.396031,29.990051,-90.395721,closed_roof_tank,30.0,1,0,0.0,0.0,0.0,0.0,0.0,Louisiana,"POLYGON ((-90.39603 29.99033, -90.39572 29.990..."
2,m_3408350_ne_17_060_20191023,4910,5661,4944,5696,34.221846,-83.783836,34.221662,-83.783608,closed_roof_tank,20.4,1,0,0.0,0.0,0.0,0.0,0.0,Georgia,"POLYGON ((-83.78384 34.22185, -83.78361 34.221..."


### Filtering through the data

This original dataset provided to us by our research has columns we will not use for the purposes of our visualizations. To minimize memory consumption and maximize runtime efficiency, we only keep the columns necessary for our visualizations.

In [5]:
df_tanks = df_tanks[['nw_corner_', 'nw_corne_1', 'se_corner_', 'se_corne_1', 'object_cla', 'diameter (', 'state', 'geometry']]
df_tanks.head(n=3)

Unnamed: 0,nw_corner_,nw_corne_1,se_corner_,se_corne_1,object_cla,diameter (,state,geometry
0,30.502086,-91.188319,30.501896,-91.188273,closed_roof_tank,4.8,Louisiana,"POLYGON ((-91.18832 30.50209, -91.18827 30.502..."
1,29.990328,-90.396031,29.990051,-90.395721,closed_roof_tank,30.0,Louisiana,"POLYGON ((-90.39603 29.99033, -90.39572 29.990..."
2,34.221846,-83.783836,34.221662,-83.783608,closed_roof_tank,20.4,Georgia,"POLYGON ((-83.78384 34.22185, -83.78361 34.221..."


### Computing average latitude and longitude coordinates for each tank
The orignal tank locations came in polygon geometries; however, since we are plotting the tanks across the US, plotting all ~98,000 of them as Polygon geometries through GeoViews is a time-consuming and unfeasible process. Thus, we use the four corners of the tank geometries, ```nw_corner_```, ```nw_corne_1```, ```se_corner_``` and ```se_corne_1``` to calculate the center latitude and longitude coordinates for each tank. Like this, we can create a Point geometry for each tank to replace the Polygon geometry and plot all points through GeoViews without running into time issues.

In [6]:
df_tanks['avg_lat'] = (df_tanks['nw_corner_'] + df_tanks['se_corner_'])/2
df_tanks['avg_long'] = (df_tanks['nw_corne_1'] + df_tanks['se_corne_1'])/2
df_tanks.head(n=3)

Unnamed: 0,nw_corner_,nw_corne_1,se_corner_,se_corne_1,object_cla,diameter (,state,geometry,avg_lat,avg_long
0,30.502086,-91.188319,30.501896,-91.188273,closed_roof_tank,4.8,Louisiana,"POLYGON ((-91.18832 30.50209, -91.18827 30.502...",30.501991,-91.188296
1,29.990328,-90.396031,29.990051,-90.395721,closed_roof_tank,30.0,Louisiana,"POLYGON ((-90.39603 29.99033, -90.39572 29.990...",29.990189,-90.395876
2,34.221846,-83.783836,34.221662,-83.783608,closed_roof_tank,20.4,Georgia,"POLYGON ((-83.78384 34.22185, -83.78361 34.221...",34.221754,-83.783722


We then filter again for only relevant columns. We also rename each column name so that they are standardized moving forward. The average latitude and longitude are named ```lat_t_4326``` and ```lon_t_4326```, respectively, to indicate that they are the latitude and longitude coordinates for a tank, in EPSG 4326 projection. This will be important moving forward, when we convert coordinate systems for our visualizations.

In [7]:
df_tanks = df_tanks[['state', 'object_cla', 'diameter (', 'avg_lat', 'avg_long', 'geometry']]

In [8]:
df_tanks.rename(columns = {'avg_lat':'lat_t_4326'}, inplace = True)
df_tanks.rename(columns = {'avg_long':'lon_t_4326'}, inplace = True)
df_tanks.rename(columns = {'object_cla':'tank_type'}, inplace = True)
df_tanks.rename(columns = {'diameter (':'diameter'}, inplace = True)
df_tanks.head()

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,geometry
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,"POLYGON ((-91.18832 30.50209, -91.18827 30.502..."
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,"POLYGON ((-90.39603 29.99033, -90.39572 29.990..."
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,"POLYGON ((-83.78384 34.22185, -83.78361 34.221..."
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.92625,"POLYGON ((-87.92628 37.90604, -87.92622 37.906..."
4,New Mexico,closed_roof_tank,16.2,35.04534,-106.64843,"POLYGON ((-106.64852 35.04541, -106.64834 35.0..."


### Using pyproj and PROJ's transformer to convert from EPSG 4326 to EPSG 3857
A lot of our visualizations need coordinates in EPSG 3857, however these coordinates are in EPSG 4326. Therefore, we use the pyproj interface, which allows us to use the PROJ coordinate transformation software to transform our EPSG 4326 coordinates to EPSG 3857. This creates two new columns in our original dataset with the transformed coordinates.

In [9]:
from pyproj import Proj, Transformer

transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
df_tanks['lat_t_3857'], df_tanks['lon_t_3857'] = transform_4326_to_3857.transform(
                                                df_tanks['lat_t_4326'], df_tanks['lon_t_4326'])
df_tanks.head()

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,geometry,lat_t_3857,lon_t_3857
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,"POLYGON ((-91.18832 30.50209, -91.18827 30.502...",-10151030.0,3568241.0
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,"POLYGON ((-90.39603 29.99033, -90.39572 29.990...",-10062820.0,3502289.0
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,"POLYGON ((-83.78384 34.22185, -83.78361 34.221...",-9326761.0,4058617.0
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.92625,"POLYGON ((-87.92628 37.90604, -87.92622 37.906...",-9787905.0,4566158.0
4,New Mexico,closed_roof_tank,16.2,35.04534,-106.64843,"POLYGON ((-106.64852 35.04541, -106.64834 35.0...",-11872050.0,4170044.0


### Using ```.sjoin()``` to classify tanks by county

#### Converting ```df_tanks``` to a GeoDataFrame
For some of our further processing of data, we need to classify each tank by county. To do so, we will use GeoPandas' ```.sjoin()``` method to identify which county each tank belongs to. Since the ```.sjoin()``` method takes in two GeoDataFrames, we must convert ```df_tanks``` to a GeoDataFrame by using the ```lat_t_4326``` and ```lon_t_4326``` columns to create Point geometries.

To do so, we use GeoPandas' ```.GeoDataFrame``` method. We first pass in ```df_tanks``` (the dataframe we will convert to a GeoDataFrame), then specify which columns to use for the ```POINT``` geometry. In this case, we use ```lon_t_4326``` and ```lat_t_4326```.

In [10]:
df_tanks = gpd.GeoDataFrame(
    df_tanks, geometry=gpd.points_from_xy(df_tanks.lon_t_4326, df_tanks.lat_t_4326))
df_tanks.head()

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,geometry,lat_t_3857,lon_t_3857
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,POINT (-91.18830 30.50199),-10151030.0,3568241.0
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,POINT (-90.39588 29.99019),-10062820.0,3502289.0
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,POINT (-83.78372 34.22175),-9326761.0,4058617.0
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.92625,POINT (-87.92625 37.90602),-9787905.0,4566158.0
4,New Mexico,closed_roof_tank,16.2,35.04534,-106.64843,POINT (-106.64843 35.04534),-11872050.0,4170044.0


#### Reading in county shapefile
To find which tanks are in each county, we use GeoPandas' ```.sjoin()``` method. Using this method, we will perform a spatial join between each county's geometry and the dataframe including Point geometries for each tank in the US. For this, we need a dataframe with geometries for all counties in the US- which we took from the United States Census Bureau's Cartographic Boundary Files (available [here](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)). Then, we filter to exclude counties from Alaska, Hawaii, Puerto Rico, Virgin Islands, American Samoa, Guam, Northern Marian Islands, as there are no tanks in those regions in the AST dataset. We also drop unnecessary columns and rename the column names in a standardized way.

In [11]:
df_counties = gpd.read_file(DATA_DIR + '/source_files/county_shapefiles/counties.shp')
df_counties = df_counties[((df_counties['STATEFP'] != '02') & (df_counties['STATEFP'] != '15') &
                          (df_counties['STATEFP'] != '72') & (df_counties['STATEFP'] != '78') &
                          (df_counties['STATEFP'] != '60') & (df_counties['STATEFP'] != '66') &
                          (df_counties['STATEFP'] != '69'))]
df_counties = df_counties[['NAME', 'GEOID', 'geometry']]
df_counties.rename(columns = {'NAME': 'county', 'GEOID': 'geoid'}, inplace = True)
df_counties.head()

Unnamed: 0,county,geoid,geometry
0,Riley,20161,"POLYGON ((-96.96095 39.28670, -96.96106 39.288..."
1,Ringgold,19159,"POLYGON ((-94.47167 40.81255, -94.47166 40.819..."
2,Carbon,30009,"POLYGON ((-109.79867 45.16734, -109.68779 45.1..."
3,Bear Lake,16007,"POLYGON ((-111.63452 42.57034, -111.63010 42.5..."
4,Buffalo,55011,"POLYGON ((-92.08384 44.41200, -92.08310 44.414..."


#### Iterating through each county in and finding the tanks in that county
Next, since we must find which tanks are in every county in the US, we must iterate through every county in ```df_counties```. For each county, we perform a spatial join between that county GeoDataFrame and the ```df_tanks``` GeoDataFrame. The ```.sjoin()``` function returns a new GeoDataFrame that only includes the geometries that are the intersections of the two original GeoDataFrames. In this case, passing in a GeoDataFrame with the geometry for Harris County and a GeoDataFrame with all the tanks to the ```.sjoin()``` method returns a new GeoDataFrame with all the tanks in Harris County, as it returns all the Point geometries that intersects the Harris County Polygon geometry. This new GeoDataFrame keeps the index for each tank as it was in the original ```df_tanks``` dataframe. This is key- it allows us to take a list of these indices, and then loop over all of them to change the value of the ```county``` column in ```df_tanks``` at each of those indices to ```Harris County```.

However, we need to do this for all counties in the US, so we use for loop. This loop iterates every row of ```df_counties``` finds the intersection between that county and the tanks GeoDataFrame (```df_tanks```), creates a list of the indices of those tanks, mutates the ```county``` column in ```df_tanks``` to label each of those tanks with that county name. 

We intentially labelled each tank with the ```geo_id``` column, because in future processing, we will merge this dataframe with another dataframe based on county FIPS numbers. 

This takes around three minutes, since we are looping through 3,000 counties.

In [12]:
%%time

df_tanks['county'] = ''

for i in range(0, len(df_counties)):
    county = df_counties.iloc[i] ## finding county
    frame = county.to_frame() ## making county to a dataframe, as .iloc[i] returns a series
    row = gpd.GeoDataFrame(frame.T) ## transforming pandas df to geodataframe
    df_intersect = gpd.sjoin(df_tanks, row, how='inner', predicate='intersects') ## finding tanks in that county
    idx = list(df_intersect.index.values) ## finding indices of those tanks
    for num in idx: ## looping over those indices 
        df_tanks['county'].iloc[idx] = row.iloc[0]['geoid']

df_tanks.head()

CPU times: user 36.1 s, sys: 6.6 ms, total: 36.1 s
Wall time: 36.3 s


Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,geometry,lat_t_3857,lon_t_3857,county
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,POINT (-91.18830 30.50199),-10151030.0,3568241.0,22033
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,POINT (-90.39588 29.99019),-10062820.0,3502289.0,22089
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,POINT (-83.78372 34.22175),-9326761.0,4058617.0,13139
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.92625,POINT (-87.92625 37.90602),-9787905.0,4566158.0,18129
4,New Mexico,closed_roof_tank,16.2,35.04534,-106.64843,POINT (-106.64843 35.04534),-11872050.0,4170044.0,35001


### Exporting dataframe to shapefile

Now, since this code has been processed, we are exporting the new tank data into a shapefile that will be used later on. The input to the ```to_file``` function is the path where you want the dataframe to be exported to and the name of the file it will be saved in.

In [13]:
df_tanks.to_file(DATA_DIR + '/ast_master.shp')

In [14]:
df = gpd.read_file(DATA_DIR + '/ast_master.shp')
df

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,22033,POINT (-91.18830 30.50199)
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,22089,POINT (-90.39588 29.99019)
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,13139,POINT (-83.78372 34.22175)
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,18129,POINT (-87.92625 37.90602)
4,New Mexico,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,35001,POINT (-106.64843 35.04534)
...,...,...,...,...,...,...,...,...,...
977,Iowa,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,19061,POINT (-90.73297 42.41190)
978,Wyoming,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,56025,POINT (-106.29307 42.86233)
979,Missouri,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,29143,POINT (-89.57383 36.60867)
980,Rhode Island,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,44007,POINT (-71.37108 41.83177)
