# Performing Spatial Joins to Find Intersecting Geometries
### Classifying tanks by whether or not they lie on floodplains, and merging with National Risk Index Data

### Import statements

In [2]:
import os
import pandas as pd
import geopandas as gpd



### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/procesing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/procesing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/procesing```.

In [3]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('processing', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/data'

### Reading AST Data
This is a pre-processed AST data file, created in processing notebook **02_processing_tanks**.

In [3]:
df_tanks = gpd.read_file(DATA_DIR + '/ast_master.shp')
df_tanks.head(n=3)

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-10151030.0,3568241.0,22033,POINT (-91.18830 30.50199)
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-10062820.0,3502289.0,22089,POINT (-90.39588 29.99019)
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9326761.0,4058617.0,13139,POINT (-83.78372 34.22175)


### Using ```.sjoin()``` to classify tanks by whether or not they are on a floodplain

#### Reading floodplain data from the Federal Emergency Management Agency (FEMA)
Then filtering for only the column we need (geometry) to find which tanks lie on a floodplain. This file was too large to include in this repository, but you can download it [here](https://catalog.data.gov/dataset/national-flood-hazard-layer-nfhl/resource/8c879bbb-00c7-4b67-aef1-c921d4315aee). Then, upload this dataset into the ```/data/source_files/nat_hazard_files/floodplain_files```, and name it ```floodplains.shp```. Alternatively, change the filepath below to be the path to where you are storing the data.

In [4]:
df_floodplains = gpd.read_file(DATA_DIR + '/source_files/nat_hazard_files/floodplain_files/floodplains.shp')
df_floodplains

Unnamed: 0,DFIRM_ID,VERSION_ID,BFE_LN_ID,ELEV,LEN_UNIT,V_DATUM,SOURCE_CIT,GFID,Shape_Leng,geometry
0,37111C,1.1.1.0,37111C_5701,992.0,Feet,NAVD88,37111C_STUDY4,c5ed825d-d798-4326-ac9f-521d5474ca51,0.000402,"LINESTRING (-81.85514 35.55379, -81.85481 35.5..."
1,27139C,1.1.1.0,27139C_480,950.0,Feet,NAVD88,27139C_FIS1,3f07c33a-59bb-42b9-9c05-b07416b15e4d,0.000819,"LINESTRING (-93.60993 44.55477, -93.61023 44.5..."
2,01073C,2.3.3.3,01073C_3287,537.0,Feet,NAVD88,01073C_STUDY1,0a367dee-ddb1-479a-a068-54482dbf5059,0.001161,"LINESTRING (-86.90697 33.56901, -86.90806 33.5..."
3,22087C,1.1.1.0,22087C_215,1.0,Feet,NAVD88,22087C_STUDY13,80996f13-99f4-4826-a737-86b261605b10,0.000073,"LINESTRING (-89.85064 29.86607, -89.85065 29.8..."
4,42047C,1.1.1.0,42047C_510,1580.0,Feet,NAVD88,42047C_STUDY2,dc310fd3-dd81-4716-b01e-d7fc042a15f7,0.001016,"LINESTRING (-78.64175 41.31597, -78.64266 41.3..."
...,...,...,...,...,...,...,...,...,...,...
157390,37185C,1.1.1.0,37185C_39672,285.0,Feet,NAVD88,37185C_STUDY4,71eee819-e974-48cb-8eda-e6b254606e9f,0.002508,"LINESTRING (-78.08776 36.36467, -78.08749 36.3..."
157391,05103C,1.1.1.0,05103C_568,135.0,Feet,NAVD88,05103C_STUDY1,,0.001235,"LINESTRING (-92.84933 33.58808, -92.84990 33.5..."
157392,37193C,1.1.1.0,37193C_10884,1160.0,Feet,NAVD88,37193C_STUDY4,,0.000446,"LINESTRING (-80.91487 36.33171, -80.91481 36.3..."
157393,41017C,2.1.3.0,41017C_433,3608.0,Feet,NAVD88,41017C_LOMC5,a9aceec6-e4a8-41cc-afb3-ae12d56fa2a2,0.000664,"LINESTRING (-121.32028 44.05004, -121.32035 44..."


In [5]:
df_floodplains = df_floodplains[['geometry']]
df_floodplains

Unnamed: 0,geometry
0,"LINESTRING (-81.85514 35.55379, -81.85481 35.5..."
1,"LINESTRING (-93.60993 44.55477, -93.61023 44.5..."
2,"LINESTRING (-86.90697 33.56901, -86.90806 33.5..."
3,"LINESTRING (-89.85064 29.86607, -89.85065 29.8..."
4,"LINESTRING (-78.64175 41.31597, -78.64266 41.3..."
...,...
157390,"LINESTRING (-78.08776 36.36467, -78.08749 36.3..."
157391,"LINESTRING (-92.84933 33.58808, -92.84990 33.5..."
157392,"LINESTRING (-80.91487 36.33171, -80.91481 36.3..."
157393,"LINESTRING (-121.32028 44.05004, -121.32035 44..."


#### Using the GeoPandas' ```.buffer()``` 
This way, tanks within 200 meters from either side of the floodplain will be marked as near a floodplain. As The floodplain data is given in linestring or multilinestring geometries,  buffering it provides a more generalized understanding of the tanks that are near areas of flooding risk, not only the ones directly in the line of risk.

In order to buffer the geometries by 200 meters, it is necessary to convert the coordinate system of the dataframe to EPSG 3857, as the unit of measurement for this coordinate system is the meter. The final coordinate system conversion sets the floodplain dataframe to the same coordinate system as the tank dataframe. This consistency is key in the next few steps. 

In [6]:
%%time
df_floodplains = df_floodplains.to_crs("EPSG:3857")
df_floodplains = df_floodplains.buffer(200)
df_floodplains= df_floodplains.to_crs("EPSG:4326")
df_floodplains

CPU times: user 6.73 s, sys: 341 ms, total: 7.07 s
Wall time: 7.09 s


0         POLYGON ((-81.85592 35.55515, -81.85577 35.555...
1         POLYGON ((-93.61196 44.55519, -93.61196 44.555...
2         POLYGON ((-86.90879 33.56804, -86.90895 33.568...
3         POLYGON ((-89.84902 29.86540, -89.84903 29.865...
4         POLYGON ((-78.64366 41.31530, -78.64380 41.315...
                                ...                        
157390    POLYGON ((-78.08798 36.36610, -78.08644 36.366...
157391    POLYGON ((-92.85142 33.58805, -92.85152 33.588...
157392    POLYGON ((-80.91302 36.33142, -80.91301 36.331...
157393    POLYGON ((-121.31939 44.04892, -121.31943 44.0...
157394    POLYGON ((-84.54443 38.06779, -84.54456 38.067...
Length: 157395, dtype: geometry

The buffer function transforms the geopandas dataframe into a geoseries, but to find which tanks lie on or near floodplains, it is necessary to convert this geoseries into a GeoDataFrame.

In [7]:
gdf_floodplains = gpd.GeoDataFrame(df_floodplains)
gdf_floodplains.rename(columns = {0: 'geometry'}, inplace = True)
gdf_floodplains

Unnamed: 0,geometry
0,"POLYGON ((-81.85592 35.55515, -81.85577 35.555..."
1,"POLYGON ((-93.61196 44.55519, -93.61196 44.555..."
2,"POLYGON ((-86.90879 33.56804, -86.90895 33.568..."
3,"POLYGON ((-89.84902 29.86540, -89.84903 29.865..."
4,"POLYGON ((-78.64366 41.31530, -78.64380 41.315..."
...,...
157390,"POLYGON ((-78.08798 36.36610, -78.08644 36.366..."
157391,"POLYGON ((-92.85142 33.58805, -92.85152 33.588..."
157392,"POLYGON ((-80.91302 36.33142, -80.91301 36.331..."
157393,"POLYGON ((-121.31939 44.04892, -121.31943 44.0..."


#### Finding the tanks that lie on/near floodplains using the sjoin() function

The GeoPandas ```.sjoin()``` function performs a spatial join of two GeoDataFrames. In this case, the ```predicate``` parameter input 'intersects' means that the output is a new GeoDataFrame containing only the rows in one GeoDataFrame whose geometries intersected those in the other GeoDataFrame. In other words, function outputs a GeoDataFrame containing only the tanks that lie within 200 meters of a floodplain.

In [8]:
%%time
df_intersect = gpd.sjoin(df_tanks, gdf_floodplains, how='inner', predicate='intersects')
df_intersect.head()

CPU times: user 194 ms, sys: 5.81 ms, total: 200 ms
Wall time: 199 ms


Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,index_right
124,Kansas,external_floating_roof_tank,24.6,39.151851,-94.633246,-10534520.0,4743446.0,20209,POINT (-94.63325 39.15185),79463
163,Wyoming,narrow_closed_roof_tank,4.8,42.856477,-106.228409,-11825290.0,5290152.0,56025,POINT (-106.22841 42.85648),82843
174,Texas,sedimentation_tank,51.0,33.925536,-98.472102,-10961860.0,4018808.0,48485,POINT (-98.47210 33.92554),149567
393,Indiana,sedimentation_tank,27.0,41.67866,-86.001444,-9573637.0,5112965.0,18039,POINT (-86.00144 41.67866),3493
402,Missouri,narrow_closed_roof_tank,4.2,39.029826,-94.526448,-10522640.0,4725945.0,29095,POINT (-94.52645 39.02983),22071


Drop tanks with the same latitude and longitude, which are therefore duplicates:

In [9]:
df_intersect = df_intersect.drop_duplicates(subset = ['lat_t_4326', 'lon_t_4326'])
df_intersect.head()

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,index_right
124,Kansas,external_floating_roof_tank,24.6,39.151851,-94.633246,-10534520.0,4743446.0,20209,POINT (-94.63325 39.15185),79463
163,Wyoming,narrow_closed_roof_tank,4.8,42.856477,-106.228409,-11825290.0,5290152.0,56025,POINT (-106.22841 42.85648),82843
174,Texas,sedimentation_tank,51.0,33.925536,-98.472102,-10961860.0,4018808.0,48485,POINT (-98.47210 33.92554),149567
393,Indiana,sedimentation_tank,27.0,41.67866,-86.001444,-9573637.0,5112965.0,18039,POINT (-86.00144 41.67866),3493
402,Missouri,narrow_closed_roof_tank,4.2,39.029826,-94.526448,-10522640.0,4725945.0,29095,POINT (-94.52645 39.02983),22071


#### Using ```df_intersect``` to classify the tanks in the original dataset

In [10]:
idx = list(df_intersect.index.values)

In [11]:
%%time
df_tanks['on_floodplain'] = 0

for num in idx:
    df_tanks['on_floodplain'].iloc[num] = 1
    
df_tanks

CPU times: user 5.42 ms, sys: 0 ns, total: 5.42 ms
Wall time: 5.42 ms


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,on_floodplain
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,22033,POINT (-91.18830 30.50199),0
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,22089,POINT (-90.39588 29.99019),0
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,13139,POINT (-83.78372 34.22175),0
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,18129,POINT (-87.92625 37.90602),0
4,New Mexico,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,35001,POINT (-106.64843 35.04534),0
...,...,...,...,...,...,...,...,...,...,...
977,Iowa,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,19061,POINT (-90.73297 42.41190),0
978,Wyoming,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,56025,POINT (-106.29307 42.86233),0
979,Missouri,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,29143,POINT (-89.57383 36.60867),0
980,Rhode Island,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,44007,POINT (-71.37108 41.83177),0


Now, our original tanks dataframe, ```df_tanks``` has a column indicating whether or not that tank is near a floodplain.

### Reading National Risk Index Data, taken from FEMA
We want to classify each tank by its risk from a variety of natural hazards. To do this, we read in NRI data, filter for only the columns we want, as stipulated by our researcher, and rename them for standardization purposes. The NRI data was taken from the Federal Emergency Management Agency, and is available [here](https://hazards.fema.gov/nri/data-resources).

In [18]:
df_nri = gpd.read_file(DATA_DIR + '/source_files/nat_hazard_files/nri_files/NRI_GDB_Counties.gdb')
df_nri.head()

Unnamed: 0,NRI_ID,STATE,STATEABBRV,STATEFIPS,COUNTY,COUNTYTYPE,COUNTYFIPS,STCOFIPS,POPULATION,BUILDVALUE,...,WNTW_EALA,WNTW_EALT,WNTW_EALS,WNTW_EALR,WNTW_RISKS,WNTW_RISKR,NRI_VER,Shape_Length,Shape_Area,geometry
0,C21115,Kentucky,KY,21,Johnson,County,115,21115,23356,1924008000.0,...,4.235939,47363.199731,19.448529,Relatively Moderate,14.131237,Relatively Low,November 2021,190441.334565,1098944000.0,"MULTIPOLYGON (((-9196369.959 4562386.043, -919..."
1,C21117,Kentucky,KY,21,Kenton,County,117,21117,159720,18773380000.0,...,44.606252,64259.532691,21.530408,Relatively Moderate,12.47004,Relatively Low,November 2021,140730.907028,704249200.0,"MULTIPOLYGON (((-9407183.321 4735315.123, -940..."
2,C21119,Kentucky,KY,21,Knott,County,119,21119,16346,1170376000.0,...,0.023091,30809.75462,16.851393,Relatively Low,14.46627,Relatively Low,November 2021,211206.226178,1448900000.0,"MULTIPOLYGON (((-9233790.126 4509476.801, -923..."
3,C21121,Kentucky,KY,21,Knox,County,121,21121,31883,2135773000.0,...,0.082573,61427.308851,21.209328,Relatively Moderate,19.585915,Relatively Moderate,November 2021,237214.255701,1572984000.0,"MULTIPOLYGON (((-9305143.376 4432946.710, -930..."
4,C21123,Kentucky,KY,21,Larue,County,123,21123,14193,1221343000.0,...,246.668438,12870.385216,12.597091,Relatively Low,7.715952,Very Low,November 2021,226736.66586,1088060000.0,"MULTIPOLYGON (((-9520186.985 4516660.323, -952..."


In [19]:
df_nri = df_nri[['STCOFIPS', 'ERQK_RISKS', 'SWND_RISKS', 'HRCN_RISKS', 'TRND_RISKS', 'CFLD_RISKS', 'RFLD_RISKS']]
df_nri.rename(columns = {'STCOFIPS': 'county', 'ERQK_RISKS': 'erqk_risks', 'SWND_RISKS': 'swnd_risks', 'HRCN_RISKS': 'hrcn_risks', 
                         'TRND_RISKS': 'trnd_risks', 'CFLD_RISKS': 'cfld_risks', 'RFLD_RISKS': 'rfld_risks'}, inplace = True)
df_nri

Unnamed: 0,county,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks
0,21115,1.953248,10.756017,1.668058,9.136885,,14.575572
1,21117,3.346640,19.688303,1.875497,24.280149,,8.279166
2,21119,2.281739,12.431024,1.129109,10.174559,,8.755275
3,21121,4.385020,17.589118,1.962140,19.273345,,14.443835
4,21123,2.042402,11.899304,2.473315,9.216597,,4.055177
...,...,...,...,...,...,...,...
3137,56037,2.070342,2.848189,,2.191509,,3.318171
3138,56039,4.292420,3.143585,,6.133900,,2.734316
3139,56041,3.206560,4.959357,,4.118598,,3.201339
3140,56043,3.156933,6.009518,,8.577072,,4.954794


Then, we merge this ```df_nri``` dataframe with our ```df_tanks``` dataframe, based on the ```county``` column. Therefore, tanks are associated with risks from each natural hazard based on the county in which it is located.

### Merging AST and NRI data using pandas' ```.merge()```

In [28]:
df_tank_risks = df_tanks.merge(df_nri, on = 'county', how = 'left')
df_tank_risks

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,on_floodplain,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,22033,POINT (-91.18830 30.50199),0,4.149297,9.661013,14.415955,43.776313,9.471153,39.822684
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,22089,POINT (-90.39588 29.99019),0,1.208395,6.264728,13.189863,13.190995,17.685820,12.877608
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,13139,POINT (-83.78372 34.22175),0,5.628088,12.104342,5.312985,31.912282,,7.696209
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,18129,POINT (-87.92625 37.90602),0,4.926164,10.959311,2.206652,12.846449,,8.284501
4,New Mexico,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,35001,POINT (-106.64843 35.04534),0,18.185426,9.373074,,15.079099,,14.347347
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977,Iowa,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,19061,POINT (-90.73297 42.41190),0,1.575536,17.648163,4.544047,21.537919,,12.580429
978,Wyoming,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,56025,POINT (-106.29307 42.86233),0,3.312025,2.867939,,10.280441,,6.010181
979,Missouri,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,29143,POINT (-89.57383 36.60867),0,17.807754,23.810359,8.253384,24.042775,,18.432187
980,Rhode Island,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,44007,POINT (-71.37108 41.83177),0,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062


### Calculating average risk and handling NaN values
We noticed the NRI data had a significant amount of NaN values, indicated that there is no information for that cell. Dropping all the rows with NaN values would eliminate two-thirds of our data. However, we noticed that NaN values were generally inputted for counties that had little to no risk for that specific natural hazards- counties in the center of the US had NaN values for the coastal flooding risk, for example. Therefore, after discussion with our researcher, we decided to calculate the average risk using ```0``` as the value for all NaN values, but then fill all the NaN values as ```-1```, an implausible number, to indicate in our visualizations that these values were not recorded in the NRI data.

To do this, we made a copy of the original ```df_tank_risks```, which had all the tank information, with natural hazard risks associated to each tank. Then, we filled NaN values of the copy of that dataframe, ```df_tank_risks_calc```, with the value ```0```. This is the dataframe we used to calculate the average risk for each tank, by adding all the risk indices and dividing it by the number of natural hazards (6). We also calculated ```adj_risk```, which is the average risk for the tank, adjusted for whether or not that tank lies near a floodplain. For this column, we added five points to the ```avg_risk``` if the tank was near a floodplain, using the ```on_floodplain``` column.

We also dropped all columns other than ```avg_risk``` and ```adj_risk```, because we will then merge this dataframe with the original ```df_tank_risks``` dataframe so that for each tank, we have risk indices for each individual natural hazards, along with these computed risks.

In [36]:
df_tank_risks_calc = df_tank_risks
df_tank_risks_calc = df_tank_risks_calc.fillna(0)
df_tank_risks_calc['avg_risk'] = (df_tank_risks_calc['erqk_risks'] + df_tank_risks_calc['swnd_risks'] + 
                               df_tank_risks_calc['hrcn_risks'] + df_tank_risks_calc['trnd_risks'] + 
                               df_tank_risks_calc['cfld_risks'] + df_tank_risks_calc['rfld_risks']) / 6
df_tank_risks_calc['adj_risk'] = df_tank_risks_calc['avg_risk'] + (5 * df_tank_risks_calc['on_floodplain'])
df_tank_risks_calc = df_tank_risks_calc[['avg_risk', 'adj_risk']]
df_tank_risks_calc

Unnamed: 0,avg_risk,adj_risk
0,20.216069,20.216069
1,10.736235,10.736235
2,10.442318,10.442318
3,6.537180,6.537180
4,9.497491,9.497491
...,...,...
977,9.647682,9.647682
978,3.745098,3.745098
979,15.391077,15.391077
980,12.418334,12.418334


In [38]:
df_tank_risks_merged = pd.merge(df_tank_risks, df_tank_risks_calc, left_index = True, right_index = True)
df_tank_risks_merged

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,on_floodplain,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,adj_risk
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,22033,POINT (-91.18830 30.50199),0,4.149297,9.661013,14.415955,43.776313,9.471153,39.822684,20.216069,20.216069
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,22089,POINT (-90.39588 29.99019),0,1.208395,6.264728,13.189863,13.190995,17.685820,12.877608,10.736235,10.736235
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,13139,POINT (-83.78372 34.22175),0,5.628088,12.104342,5.312985,31.912282,,7.696209,10.442318,10.442318
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,18129,POINT (-87.92625 37.90602),0,4.926164,10.959311,2.206652,12.846449,,8.284501,6.537180,6.537180
4,New Mexico,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,35001,POINT (-106.64843 35.04534),0,18.185426,9.373074,,15.079099,,14.347347,9.497491,9.497491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977,Iowa,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,19061,POINT (-90.73297 42.41190),0,1.575536,17.648163,4.544047,21.537919,,12.580429,9.647682,9.647682
978,Wyoming,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,56025,POINT (-106.29307 42.86233),0,3.312025,2.867939,,10.280441,,6.010181,3.745098,3.745098
979,Missouri,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,29143,POINT (-89.57383 36.60867),0,17.807754,23.810359,8.253384,24.042775,,18.432187,15.391077,15.391077
980,Rhode Island,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,44007,POINT (-71.37108 41.83177),0,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,12.418334


Finally, we fill in the NaN values for the risk data with -1, as discussed previously, and save this dataframe as a shapefile.

In [39]:
values = {"erqk_risks": -1, "swnd_risks": -1, "hrcn_risks": -1, "trnd_risks": -1, "cfld_risks": -1, "rfld_risks": -1}
df_tank_risks_merged = df_tank_risks_merged.fillna(value=values)
df_tank_risks_merged

Unnamed: 0,state,tank_type,diameter,lat_t_4326,lon_t_4326,lat_t_3857,lon_t_3857,county,geometry,on_floodplain,erqk_risks,swnd_risks,hrcn_risks,trnd_risks,cfld_risks,rfld_risks,avg_risk,adj_risk
0,Louisiana,closed_roof_tank,4.8,30.501991,-91.188296,-1.015103e+07,3.568241e+06,22033,POINT (-91.18830 30.50199),0,4.149297,9.661013,14.415955,43.776313,9.471153,39.822684,20.216069,20.216069
1,Louisiana,closed_roof_tank,30.0,29.990189,-90.395876,-1.006282e+07,3.502289e+06,22089,POINT (-90.39588 29.99019),0,1.208395,6.264728,13.189863,13.190995,17.685820,12.877608,10.736235,10.736235
2,Georgia,closed_roof_tank,20.4,34.221754,-83.783722,-9.326761e+06,4.058617e+06,13139,POINT (-83.78372 34.22175),0,5.628088,12.104342,5.312985,31.912282,-1.000000,7.696209,10.442318,10.442318
3,Indiana,narrow_closed_roof_tank,4.8,37.906023,-87.926250,-9.787905e+06,4.566158e+06,18129,POINT (-87.92625 37.90602),0,4.926164,10.959311,2.206652,12.846449,-1.000000,8.284501,6.537180,6.537180
4,New Mexico,closed_roof_tank,16.2,35.045340,-106.648430,-1.187205e+07,4.170044e+06,35001,POINT (-106.64843 35.04534),0,18.185426,9.373074,-1.000000,15.079099,-1.000000,14.347347,9.497491,9.497491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977,Iowa,closed_roof_tank,19.2,42.411899,-90.732966,-1.010035e+07,5.222881e+06,19061,POINT (-90.73297 42.41190),0,1.575536,17.648163,4.544047,21.537919,-1.000000,12.580429,9.647682,9.647682
978,Wyoming,sedimentation_tank,24.0,42.862335,-106.293070,-1.183249e+07,5.291041e+06,56025,POINT (-106.29307 42.86233),0,3.312025,2.867939,-1.000000,10.280441,-1.000000,6.010181,3.745098,3.745098
979,Missouri,closed_roof_tank,8.4,36.608666,-89.573830,-9.971313e+06,4.384699e+06,29143,POINT (-89.57383 36.60867),0,17.807754,23.810359,8.253384,24.042775,-1.000000,18.432187,15.391077,15.391077
980,Rhode Island,closed_roof_tank,43.8,41.831766,-71.371080,-7.944992e+06,5.135812e+06,44007,POINT (-71.37108 41.83177),0,9.400549,11.049468,5.819224,19.608082,7.130619,21.502062,12.418334,12.418334


### Saving this as a shapefile

In [40]:
df_tank_risks_merged.to_file(DATA_DIR + '/tanks_risk_score.shp')

  df_tank_risks_merged.to_file(DATA_DIR + '/tanks_risk_score.shp')
