# Modeling

In this notebook, we model the urban heat island effect given information related demographics and the environment. We test the performance of our model, including dissecting performance across demographics. We then interpret our model to derive insights about disparities in urban heat islands.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
from sklearn.model_selection import train_test_split
import shap
# import requisite tree library
# import hyperparameter tuning library, if necessary

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Data Pre-processing

BSA is black sky albedo, which tends to increases with SUHII. NDBI is normalized difference built-up index. NDVI is normalized difference vegetation index. Spatial lag is the average SUHII in a given urban area, minus the point of interest.

In [None]:
features = ['Black', 'Hispanic', 'White', 'Below Poverty', 
            'Population Density', 'BSA', 
            'NDBI', 'NDVI', 'Spatial Lag']
label = 'UHI'

In [None]:
data = pd.read_csv('data.csv', usecols=features+[label])

In [None]:
pd.DataFrame({'Total Missing': data.isna().sum(), 
              'Percent Missing': (data.isna().sum() / len(data)) * 100 })

In [None]:
# We can drop the missing values or interpolate
data = data.interpolate(method='nearest')

## Model Training

Train your model below.

## Model Performance

Compute the RMSE and $R^2$ of your model. Compute the residuals of your model. Create a scatter plot of the residuals by race and poverty status.

### Discussion: Model Performance

Based on the RMSE, how did the model perform? Based on the R2, how well did the model fit the data? How does model performance impact your communication of the model to your team? How did model performance vary across groups? Discuss your findings.

**Answer:**

## Model Interpretation

Here, we concentrate on the SHAP values. Feel free to visualize feature importance, permutation importance, and partial dependence plots (PDPs) as well. PDPs may be produced through the traditional mechanism or via SHAP values.

### SHAP Values

Compute the SHAP values. Create a SHAP bar and beeswarm plot. Create SHAP scatter plots by race and poverty with a heatmap to account for environmental factors.

### Discussion: SHAP Values

According to the SHAP values, which features are the most predictive of SUHII? If you computed permutation importance on the test set, compare with SHAP. According to the SHAP scatter plot, how does race and poverty vary with SUHII with respect to environmental factors? What did you expect, and are these results different?

**Answer:**