# Ames Housing Step-by-step - Exercise 7

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folders 'ames_housing_pieter\' and 'utils_pieter\' folder to your own project folder.

### Import packages

In [None]:
# Load packages and assign to a shorter alias.
import pandas as pd
import numpy as np

# Pieter's utils package.
import utils_pieter as up

## Exercise 7 - Interpretable Machine Learning - Using SHAP values to explain contribution of variables to prediction of Sale Price

Some suggestions to enter in chatGPT:
* How to import random forrest regression model in python?
* What hyperparameters do I have access to in this model?
* Can you suggest a dictionary with some common ranges to use with these hyperparameters?
* How to use this model in conjunction with gridsearchcv?

Of course, these questions are already quite specific and when you start your questions might be more generic, like "How do I do check the performance on a range of hyperparameters". The examples above show that also as you develop your knowledge, ChatGPT remains a very good source for suggesting snippets of code.

### Exercise A - Construct a data frame holding the imputed numerical data of the Ames Housing data set. Optionally, as a challenge, also include the one hot encoded neighborhoods. Perform a train/test split on the data frame you constructed. Since, the SHAP calculation are computer intensive, take the first 500 observations from the resulting training set (call it, `df_X_fraction`) and take the first 500 elements of the outcome variable from the resulting training set (call it, `ps_y_fraction`).

In [None]:
# One way to obtain the two objects is to run scenario B in Exercise 5.3b

# We take a fraction from the data to speed up the exercise. SHAP analysis are computer intensive.
df_X_fraction = df_X_train.iloc[0:500]
ps_y_fraction = ps_y_log_train[0:500]

### Exercise B - Copy/paste the content from the cell below to your notebook.

In [None]:
# Import module.
from sklearn.ensemble import RandomForestRegressor


# Hyperparameter grid:
dc_hyperparameter_ranges = {

    'bootstrap': [True,False], # Do we bootstrap samples, or not.
    'max_depth': [20],         # Maximum depth of each tree
    'min_samples_leaf': [4],   # Minimum number of samples required to be at a leaf node
    'min_samples_split': [4],  # Minimum number of samples required to split an internal node
    'n_estimators': [1000],    # Number of trees
    'max_variables': ['auto'],  # Number of variables to consider at each split
    'random_state': [42]       # Random state for reproducibility
}

# Perform a gridsearch on the random forest model:
gridsearch = GridSearchCV(
    estimator  = RandomForestRegressor(),
    param_grid = dc_hyperparameter_ranges,
    scoring    = 'neg_mean_squared_error',
    cv         = 5
)

# Use subset of training data to do a gridsearch on the random forest model:
gridsearch.fit(df_X_fraction, ps_y_fraction)

### Exercise C - Show the value of the `best_params` attribute of the `gridsearch` object. What do the attributes `best_params_` and `best_estimator_` refer to? Assign the value of the `best_estimator_` attribute to a new object called, `best_model`.

In [None]:
# Best parameters from the set of hyperparameters.
gridsearch.best_params_

In [None]:
# This is the best model. So, we do not need to rerun the model with the best parameters.
best_model = gridsearch.best_estimator_

### The two cell below are extra (no exercise).

In [None]:
up.f_evaluate_results(
    ps_y_true = ps_y_fraction,
    ps_y_pred = gridsearch.predict(df_X_fraction)
)

In [None]:
#Model evaluation
plt.figure(figsize=(5, 5))

plt.scatter(ps_y_fraction, gridsearch.predict(df_X_fraction))
plt.plot([9, 14], [9, 14], color='r', linestyle='-', linewidth=2)
plt.xlabel('Actual Test',size=20)
plt.ylabel('Predicted Test',size=20);

### Exercise D - Copy/paste the content from the cell below to your notebook.

In [None]:
import shap
shap.initjs()

# Create SHAP object.
explainer = shap.Explainer(best_model)

# Create SHAP values.
shap_values = explainer.shap_values(df_X_fraction)

### Exercise E - Waterfall Plot

We can use the waterfall function to visualise the SHAP values of one observation. Copy/paste the content from the cell below to your notebook. The questions refer to this code cell and to the resulting plot.

1 - Define an object to which you assign the index of the data point you want to explain the prediction for. Assign the value to your object such that you can explain the prediction for the first observation in the data.

2 - Replace '...' by the object name you defined above.

3 - What is the value for explainer.expected_value[0]?

4 - Run the cell.

5 - What do you conclude from the resulting figure? Use: (1) the answer from question 3 and (2) the prediction for the first observation in the data.

For reference see also [API Reference of SHAP module](https://shap.readthedocs.io/en/latest/generated/shap.plots.waterfall.html).

In [None]:
# Set index of data point you want to explain the prediction for.
index = 1

# Plot waterfall
shap.plots.waterfall(
    
    shap.Explanation(
        base_values   = explainer.expected_value[0], # Mean prediction for the entire training data.
        values        = shap_values[index],          # Subset of shap values.
        data          = df_X_fraction.iloc[index],   # Subset of training data.
        feature_names = df_X_fraction.columns        # variable names.
))

In [None]:
# ANSWERS

# 1 - index = 1

# 2 - index

# 3 - It is the mean prediction for the entire training data. The answer should be a number; in my case it is 12.01.
# Though in individual cases the number may differ because we sampled the data differently.

# 5 - There will be a unique waterfall plot for every observation in our dataset. They can all be interpreted in
# the same way as above. In each case, the SHAP values tell us how the variables have contributed to the prediction
# when compared to the mean prediction. Large positive/negative values indicate that the variable had a significant
# impact on the model’s prediction. The prediction of 12.27 (in my case) can be constructed from the base value of
# 12.01 (mean prediction) and the contributions of the individual variables in the data, e.g., the value of 0.6416
# for Overal Qual (in my case) add almost 0.26 to the prediction for the log of the sales price.

### Exercise F - Force Plot

Another way to visualise these is using a force plot. You can think of this as a condensed waterfall plot. Copy/paste the content from the cell below to your notebook. The questions refer to this code cell and to the resulting plot.

1 - The force plot is a different representation of the waterfall plot. Apply the questions from exercise 1 to the cell below.

In [None]:
# Set index of data point you want to explain the prediction for. Write the answer to question a, below.
index = 1

# Plot Force Plot.
shap.plots.force(

    base_value    = explainer.expected_value[0], # Mean prediction for the entire training data.
    shap_values   = shap_values[index],          # SHAP values.
    variables      = df_X_fraction.iloc[index],   # Training data.
    feature_names = df_X_fraction.columns        # variable names.
)

In [None]:
# ANSWERS

# See Exercise E.

### Exercise G - Stacked Force Plot

Waterfall and force plots are great for interpreting individual predictions. To understand how our model makes predictions in general we need to aggregate the SHAP values. One way to do this is by using a stacked-force plot. We can combine multiple force plots together to create a stacked force plot. Here we pass all SHAP values in the force plot function; though we can limit it. Each individual force plot is now vertical and stacked side by side. Copy/paste the content from the cell below to your notebook. The questions refer to this code cell and to the resulting plot.

1 - Run the cell and point out the value for explainer.expected_value[0].

2 - Set the dropdown at the top of the figure and to the left of the figure to 'Overall Qual'. What do you conclude from the resulting curves?

In [None]:
# Plot stacked force plot.
shap.plots.force(
    
        base_value    = explainer.expected_value[0], # Mean prediction for the entire training data.
        shap_values   = shap_values,                 # SHAP values.
        variables      = df_X_fraction,               # Training data.
        feature_names = df_X_fraction.columns        # variable names.
)

In [None]:
# ANSWERS

# 1 - The mean prediction (in my case 12.01) can be seen where the blue polygon hits the y-axis.

# 2 - We observe that as 'Overall Qual' increases, the SHAP value for Overall Qual increases.
# In other words, houses with higher 'Overall qual' tend to have higher sales prices.

### Exercise H - Mean SHAP

This next plot will tell us which variables are most important. For each variable, we calculate the mean SHAP value across all observations. Specifically, we take the mean of the absolute values as we do not want positive and negative values to offset each other. There is one bar for each variable. Copy/paste the content from the cell below to your notebook. The questions refer to this code cell and to the resulting plot.

1 - Run the cell. What do you conclude from the resulting chart?

In [None]:
# Plot Mean SHAP.
shap.plots.bar(

    shap_values = shap.Explanation(
        base_values   = explainer.expected_value[0], # Mean prediction for the entire training data.
        values        = shap_values,                 # Subset of shap values.
        data          = df_X_fraction,               # Subset of training data.
        feature_names = df_X_fraction.columns        # variable names.
))



In [None]:
# ANSWERS

# 1 - Variables that have made large positive/negative contributions will have a large mean SHAP value.
# In other words, these are the variables that have had a significant impact on the model’s predictions.
# In this sense, this plot can be used in the same way as a variable importance plot.
# Overall Qual is the biggest explainer of the sale price, by far.

### Exercise I - Beeswarm Plot

Next, we have the single most useful plot. The beeswarm visualises all of the SHAP values. Copy/paste the content from the cell below to your notebook. The questions refer to this code cell and to the resulting plot.

1 - Run the cell. What do you conclude from the resulting chart?

In [None]:
# Plot beeswarm plot.
shap.plots.beeswarm(
    
    shap.Explanation(
        base_values   = explainer.expected_value[0], # Mean prediction for the entire training data.
        values        = shap_values,                 # SHAP values.
        data          = df_X_fraction,               # Training data.
        feature_names = df_X_fraction.columns        # Variable names.
))

In [None]:
# ANSWERS

# 1 - On the y-axis, the values are grouped by variable. For each group, the colour of the points is
# determined by the variable value (i.e. higher variable values are redder).
# We can also start to understand the nature of these relationships. For 'Overall Qual', notice how as
# the variable value increases the SHAP values increase. We saw a similar relationship in the stacked
# force plot. It tells us that larger values for 'Overall Qual' will lead to a higher predicted Sale Price.
