# Example of student assignment for Climate Risk Analysis
## Getting students exposed to climate risk data and performing analysis to extract insight from it.
- The original dataset utilizes a sophisticated model to quantify the risk of four different hazards (flood, fire, heat, wind) across the United States, broken down into national, state, county, sub-county, and tract layers. 
- This assignment involves the practice of data access, data manipulation, data enrichment, data visualization, and fundamental geospatial analysis.
- It has a wide range of complexity of data usage. Instructors are welcome to challenge students to think of new research questions and perform the analysis to find solutions.
## There are two tasks illustrated in this example:
1. What are the top five states that have the highest flood risk?
2. What is the geospatial distribution of flood risk across the sub-counties in West Virginia?
# Prerequisites
This notebook assumes the following:
- A general understanding of the model and methodology behind the datasets.  See https://firststreet.org/methodology/

    - The core of the First Street Foundation Flood Model (FSF-FM) is built upon a complex of hydraulic and hydrology models. Earth and climate projection data seek to account for the cause and effect of inland and coastal floodings. Probabilistic flooding scenarios from climate projection analysis are established and ingested into the FSF-FM to produce realistic flood hazard layers for the current and future. The FSF-FM mainly consists of four major components: 
        * inland (e.g., pluvial and fluvial) flood modeling
        * coastal flood modeling
        * computing (flood model execution)
        * post-processing.
- Subscribe to the dataset on AWS to get access: https://aws.amazon.com/marketplace/seller-profile?id=b777a8d0-ad41-4190-b94a-27e18e87e17f.  This process is explained in the ```How to access data on AWS.docx``` document.
    - The flood data has been downloaded to your local machine in the directory `..\Climate Risk Data\Flood-Risk-Data`.
        Note: If this is not the location of your data, modify the `dataDirectory` variable below.
- Obtain auxiliary data source for geospatial information: https://catalog.data.gov/dataset/tiger-line-shapefile-2019-state-west-virginia-current-county-subdivision-state-based
    - Retrieve the "TIGER/Line Shapefile, 2019, state, West Virginia, Current County Subdivision State-based" dataset. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB).
    - Download this dataset as a zip file and extract it into the directory `..\Climate Risk data\Auxiliary Data`
        Note: If this is not the location of your data, modify the `auxilliaryDataDirectory` variable below.

- The following packages need to be installed into your python environment:
    * pandas
    * geopandas
    * matplotlib
    * seaborn
    * contextily
    * folium
    * mapclassify


---
## Task 1
### Read data from data source on the local machine

In [None]:
# Define the locations of the data files9
dataDirectory = "..\\Climate Risk data\\Flood-Risk-Data\\"
auxilliaryDataDirectory = "..\\Climate Risk data\\Auxiliary Data\\"

In [None]:
# Import the pandas library and read the data
import pandas as pd

stateFloodSummary = pd.read_csv(f"{dataDirectory}fsf_flood_state_summary.csv")

### Quick exploratory data analysis
Explore the data to gain some familiarity with its structure and format

In [None]:
stateFloodSummary.head()

In [None]:
stateFloodSummary.describe()

In [None]:
stateFloodSummary.dtypes

In [None]:
stateFloodSummary.shape

In [None]:
# This shows the number of missing values in each column
stateFloodSummary.isnull().sum()

### Data manipulation and preprocessing

The original dataset quantifies the climate risk by assigning each properties within the geographical unit to one risk factor from 1 - 10, with 10 the most severe. In order to simplify the calculation, we create a new variable to derive the weighted average risk by adding up the number of properties times factor index, then divided by the total number of properties, in this case 10.

In [None]:
# Deriving the weighted average for each state because our weights match the floodfactor indexes
stateFloodSummary['average_risk'] = 0
for i in range (1,11):
    stateFloodSummary['average_risk'] += stateFloodSummary[f'count_floodfactor{i}'] * i
stateFloodSummary['average_risk'] /= stateFloodSummary['count_property']

# Sort the data by average risk from the highest to the lowest
sorted_data = stateFloodSummary.sort_values(by='average_risk', ascending = False)

# Display the top 10 states with the highest average risk
sorted_data.head(10)

### Data Visualization

Utilize the visualization packages ```matplotlib``` and ```seaborn``` to present bar charts that show the top five states that have the highest flood risk with the sorted data we just created. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#create bar chart to show the top five states that have the highest flood risk. 
plt.figure(figsize=(8,5))
palette = sns.color_palette("Blues", as_cmap = True)
ax = sns.barplot(data = sorted_data.head(), 
                 x = 'name', 
                 y = 'average_risk', 
                 hue="average_risk", 
                 palette = palette, 
                 legend = False)
plt.title('Top five States with highest flood risk')
ax.set(xlabel=None)
plt.tight_layout()

# This will save the plot as a .png file
# plt.savefig("Top Five States with highest flood risk.png")

---
## Task 2

### Read the subcounty data

In [None]:
floodSummary = pd.read_csv(f"{dataDirectory}fsf_flood_cousub_summary.csv")

### Quick exploratory data analysis
Explore the data to gain some familiarity with its structure and format

In [None]:
floodSummary.head()

In [None]:
floodSummary.describe()

In [None]:
floodSummary.dtypes

In [None]:
floodSummary.shape

In [None]:
floodSummary.isnull().sum()

In [None]:
#drop rows that have a null value in them
floodSummary = floodSummary.dropna()
floodSummary.isnull().sum()

### Data manipulation and enrichment

Data enrichment can be extremely critical and valuable. It focuses specifically on the addition of new and supplemental information to existing datasets. In this notebook, since we would like to visualize the risk distribution on the map, we need the geospatial information ("geometry") to do so. Therefore, we downloaded "tl_2019_54_cousub.shp" file, read it into geopandas, and merged it with risk data on the same FIPS column. 

In [None]:
# Import the geopandas library to deal with GIS shape data
import geopandas as gpd

# Read data that has the geospatial information for sub-county areas in West Virginia
geospatialData = gpd.read_file(f"{auxilliaryDataDirectory}tl_2019_54_cousub.shp")
geospatialData = geospatialData.to_crs("EPSG:4326")
geospatialData = geospatialData.rename(columns = {'GEOID':'fips'})

# Display the first few rows of the geospatial data
geospatialData.head()

In [None]:
# Extract the data from our main dataset that is for West Virginia 
floodSummary.fips = floodSummary['fips'].astype(str)
floodSummary = floodSummary[floodSummary.fips.str.startswith('54')]  # 54 is the FIPS code for West Virginia

# Calculate the weighted average risk for each sub-county area in WV
floodSummary['average_risk'] = 0
for i in range (1,11):
    floodSummary['average_risk'] += floodSummary[f'count_floodfactor{i}'] * i
floodSummary['average_risk'] /= floodSummary['count_property']

In [None]:
# Merge both datasets with the same column "fips"
result = pd.merge(floodSummary, geospatialData, on = 'fips')

# Extract the columns we need to display the final data
final_data = result[['fips', 'name', 'average_risk', 'geometry']]
final_data

### Data Visualization
Knowing the standard of the data that you're planning to use is important, as different mapping services operate on different coordinate reference systems (CRS). EPSG:4326 is a popular standard CRS based on the WGS84 projection.
   - More information about the CRS: https://8thlight.com/insights/geographic-coordinate-systems-101

In [None]:
# Geodataframe preparation
crs = {'init':'EPSG:4326'} 
geo_df = gpd.GeoDataFrame(final_data, crs = crs, geometry = final_data.geometry)

In [None]:
# Show an interactive map
geo_df.explore()

In [None]:
# Plot the flood risks
fig, ax = plt.subplots(figsize = (10,10))
geo_df.plot(column = 'average_risk', ax = ax, cmap = 'Blues',
            legend = True, legend_kwds={'shrink': 0.5, 'label':'Risk'}, 
            markersize = 10)
ax.set_title('West Virginia subcounties Flood Risk')
plt.show()

# This will save the plot as a .png file
#plt.savefig('WV_cousub_floodrisk.png')

In [None]:
# Plot the data on top of a map
fig, ax = plt.subplots(figsize = (10,10))
df_wm = geo_df.to_crs(epsg=3857)
if df_wm is not None:
    df_wm.plot(column = 'average_risk',
               ax = ax, 
               cmap = 'Blues',
               legend = True, 
               legend_kwds={'shrink': 0.5, 'label':'Risk'}, 
               markersize = 10)
    ax.set_title('West Virginia subcounties Flood Risk')

# Import contextily to be able to import an extenal map source
import contextily as ctx

# Add a background map to the plot
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
plt.show()

## There are many other potential analyses for students to explore and discover!
......