# This notebook demonstrates the process of:
- Accessing heat risk data from your own AWS S3 bucket 
- Reading it into pandas dataframe
- Merging with geospatial auxiliary data on local machine
- Data manipulation 
- Visualization data on a map

## There are two tasks for this example:
1. What is the heat risk distribution in California at tract level?
2. What is the heat risk distribution in United States at the state level?

## Prerequisites
This notebook assumes the following:
- Understanding of the general idea of the model and methodology behind the datasets: https://firststreet.org/research-library/heat-model-methodology
    - The FSF-EHM utilizes several existing methods from the heat science community combined with scalable computational techniques and satellite imagery to produce new high-resolution heat hazards across the contiguous United States (CONUS). U.S. Federal Open Data sources support the production of a high resolution extreme heat product that allows individuals, communities, businesses, and governments to better understand and prepare for their heat risks both today and 30 years into the future.


- Have an established AWS account that allows for the creation of an access key and has access to the First Street data.  This process is described in the ```How to access data on AWS.docx``` document.  Follow the directions for processing the data via the Amazon S3 Buckets.

Note: For the purposes of this this notebook, you may copy and paste the values from the provided `credentials.json` file instead of using your own AWS account credentials.

- Obtain the geospatial auxiliary data (https://catalog.data.gov/dataset/tiger-line-shapefile-2022-state-california-ca-census-tract) and downloaded to your local machine.
    The data should be in the directory `..\\Climate Risk data\\Auxiliary Data\\`. 
        If this is not the location of your data, modify the `auxilliaryDataDirectory` variable below to reflect the appropriate directory.
    
- Packages installed:
    * pandas
    * getpass
    * geopandas
    * matplotlib
    * folium
    * mapclassify

---
## Task 1 - What is the heat risk distribution in California at the tract level?

In [None]:
# Prompt the user for their secret key, access key, and URI for the S3 bucket
# Use getpass package to hide credentials
import getpass

tract_S3_URI = getpass.getpass('Enter tract S3 URI:') # this can be from the credentials file (tract_S3_URI)
key = getpass.getpass('Enter your aws_access_key_id:') # this can be from the credentials file
secret = getpass.getpass('Enter your aws_secret_access_key:') # this can be from the credentials file

In [None]:
#read heat risk tract data from S3 bucket into pandas dataframe
import pandas as pd

heatRiskTractData = pd.read_csv(
    tract_S3_URI, # S3 object URI
    storage_options={
        "key": key, # aws_access_key_id
        "secret": secret # aws_secret_access_key
    })

heatRiskTractData.head()

In [None]:
# Define the location of our auxiliary data
auxilliaryDataDirectory = "..\\Climate Risk data\\Auxiliary Data\\"

In [None]:
# Read the tract geospatial data for California and rename the column
import geopandas as gpd

auxilliaryData_CA = gpd.read_file(f"{auxilliaryDataDirectory}tl_2022_06_tract.shp")
auxilliaryData_CA = auxilliaryData_CA.rename(columns = {'GEOID':'fips'}) # used to match the same column name 'fips' for merging purposes later
auxilliaryData_CA.head()

In [None]:
# Adjust fips format and merge the auxilliary and the heat risk datasets.
# Use the FIPS code as the common key
heatRiskTractData.fips = heatRiskTractData['fips'].astype(str).str.zfill(11)
mergedHeatRiskTractData = pd.merge(heatRiskTractData, auxilliaryData_CA, on = 'fips')
mergedHeatRiskTractData.head()

The original dataset quantifies the climate risk by assigning each properties within the geographical unit to one risk factor from 1 - 10, with 10 the most severe. In order to simplify the calculation, we create a new variable to derive the weighted risk by adding up the number of properties times factor index, then divided by the total counts of the properties to get the average risk.

In [None]:
# Derive the weighted average risk and store it in the dataframe
mergedHeatRiskTractData['average_risk'] = 0
for i in range (1,11):
    mergedHeatRiskTractData['average_risk'] += mergedHeatRiskTractData[f'count_heatfactor{i}'] * i
mergedHeatRiskTractData['average_risk'] /= mergedHeatRiskTractData['count_property']

# Subset with the attributes needed
final_data = mergedHeatRiskTractData[['fips', 'average_risk', 'geometry']]

# Display the final data
final_data

In [None]:
# Prepare a georeferenced dataframe
crs = {'init':'EPSG:4326'} # EPSG:4326 is a popular standard coordinate system 
georeferencedData = gpd.GeoDataFrame(final_data, crs = crs, geometry = final_data.geometry)

In [None]:
# Plot the results
georeferencedData.plot(column = 'average_risk', cmap = 'OrRd',
                       legend = True, legend_kwds={'shrink': 0.5, 'label':'Risk'},
                       markersize = 10)

---
## Task 2 - What is the heat risk distribution in United States at states level?

In [None]:
# Prompt the user for the URI for the state's S3 data bucket
# Note: This can be from the credentials file (state_S3_URI)
state_S3_URI = getpass.getpass('Enter state_S3_URI:')

In [None]:
# Read heat risk state data from S3 bucket into pandas dataframe
USHeatRiskData = pd.read_csv(
    state_S3_URI, #S3 object URI
    storage_options={
        "key": key, # aws_access_key_id
        "secret": secret # aws_secret_access_key
    })

# Display the first few rows of the state data
USHeatRiskData.head()

In [None]:
# Import the geopandas library to deal with GIS shape data
import geopandas as gpd

# Read data that has the geospatial information for sub-county areas
auxilliaryData_US = gpd.read_file(f"{auxilliaryDataDirectory}cb_2018_us_state_500k.shp")
auxilliaryData_US = auxilliaryData_US.to_crs("EPSG:4326")
auxilliaryData_US = auxilliaryData_US.rename(columns = {'GEOID':'fips'})

# Display the first few rows of the geospatial data
auxilliaryData_US.head()

In [None]:
# Calculate the weighted average risk for each sub-county area in WV
USHeatRiskData['average_risk'] = 0
for i in range (1,11):
    USHeatRiskData['average_risk'] += USHeatRiskData[f'count_heatfactor{i}'] * i
USHeatRiskData['average_risk'] /= USHeatRiskData['count_property']

In [None]:
# Merge both datasets with the same column "fips" and extract the columns we are interested in
USHeatRiskData.fips = USHeatRiskData['fips'].astype(str).str.zfill(2)
result = pd.merge(USHeatRiskData, auxilliaryData_US, on = 'fips')
final_data = result[['fips', 'name', 'average_risk', 'geometry']]

# Display the first few rows of the final data
final_data.head()

In [None]:
# Prepare the geodataframe
crs = {'init':'EPSG:4326'}
georeferencedData = gpd.GeoDataFrame(final_data, crs = crs, geometry = final_data.geometry)

In [None]:
# Create an interactive map to explore the data
georeferencedData.explore(
    column = "average_risk",  # make choropleth based on "average_risk" column
    tooltip = "name",  # show "name" value in tooltip (on hover)
    popup = True,  # show all values in popup (on click)
    tiles = "CartoDB positron",  # use the "CartoDB positron" style tiles
    cmap = "OrRd",  # use "OrRd" matplotlib colormap
    style_kwds = dict(color = "black"),# use black outline
    legend_kwds = dict(caption = "Heat Risk") # rename legend
)

## This concludes this example