## SHAP Practical

In this practical we look at SHAP explanations both globally and locally.

First we'll need a few imports:

In [None]:
import pandas as pd
import numpy as np
from lightgbm.sklearn import LGBMClassifier
from sklearn.model_selection import train_test_split

import shap
shap.initjs() # need to load JS vis in the notebook for shap to display

# Introduction to Practical

### The Dataset

For this practical we will use the Adult Dataset (more information [here](https://archive.ics.uci.edu/ml/datasets/adult)). This is a binary classification task, predicting whether somebody's income exceeds $50,000 per year based on census data (label is `True` or `False`). This dataset is built into the `shap` package, it has already beeen imported for you and split into `train`, `test` and `display` sets loaded below.

In [None]:
# Load train set
df_train = pd.read_csv('data/train.csv')

# Split features and Target variable
X_train = df_train.drop(columns=['Target'], axis=1)
y_train = df_train['Target']

# Load test set
df_test = pd.read_csv('data/test.csv')

# Split features and Target variable
X_test = df_test.drop(columns=['Target'], axis=1)
y_test = df_test['Target']

# Display features are used in the SHAP summary plots to present the feature names nicely
df_display = pd.read_csv('data/display.csv')

# Split features and Target variable
X_display = df_display.drop(columns=['Target'], axis=1)
y_display = df_display['Target']

### The Model

We consider a LightGBM (tree-based) model in this practical. Run the following code snippet to train the model and test its performance.

In [None]:
# train a LightGBM model
model = LGBMClassifier(n_estimators=500)
model.fit(X_train, y_train)

# compute test accuracy
print("LightGBM test accuracy: %0.3f" % model.score(X_test, y_test))

## Part 1: Setting up our SHAP Explainer

SHAP has a generic explainer that works for any model, called `KernelExplainer`. This explainer requires an arbitrary model and some training data in order to calibrate itself. This fully agnostic method is slow and only computes an approximation of the SHAP values, so `shap` also provides special explainers that are optimised for specific classes of models. 

The `TreeExplainer` is optimised for tree based models, and is what we will use here with LightGBM. `TreeExplainer` also has the advantage that it can "calibrate" itself using properties of trees directly, instead of estimating values from an external dataset, that means that we only need to provide a model and `shap` does the rest.

**Exercise 1:** create a SHAP tree explainer object from the model by running the command `explainer = shap.TreeExplainer(model)`.

Now that we have an explainer object, we can start using it to compute SHAP values on our data.

**Exercise 2:** compute the SHAP values on the test set by running the command `shap_values = explainer.shap_values(X_test)`.

`shap_values` is an array, we can check its shape by running the code snippet below. 

In [None]:
shap_values.shape

We see that `shap_values` is an array containing the SHAP value associated with each of the 8141 samples and 12 features.

If you use the model agnostic `KernelExplainer`, those values will be expressed as probabilities. With the `TreeExplainer` though, the unit will be specific to the library/model you are using. For LightGBM's classifier that is `log_odds` and we can check this with the following command:

In [None]:
explainer.model.tree_output

Now let's look at the `expected_value`: it corresponds to the average output of our model. In our case it will be expressed as a log_odds of the predictions of our model (the probability of class 1):

In [None]:
# manually computing the average log odds on the training set
y_pred = model.predict_proba(X_train) # predict_proba outputs two values (probability for each class)
average_log_odds = np.mean(np.log(y_pred[:, 1] / (1 - y_pred[:, 1])))
print("Average log odds: ", average_log_odds)

# confirming this is equal to the expected_value generated by the shap explainer
base_value = explainer.expected_value
print("SHAP base value: ", base_value)

## Part 2: Local Explanations with SHAP

Now let's choose a sample to generate a local prediction for!

**Exercise 3:** change the value of `i` in the code snippet below to view different datapoints in the test set, along with their target labels and the probability of class 1 (`True`) that the model predicted. Find a point you want to know the local SHAP explanation for.

In [None]:
i = 0
sample = X_test.iloc[i]

print(f"Sample true label: {y_test[i]}")
print(f"Predicted probability of class 1 (True) for sample: {model.predict_proba([sample])[:,1]}")
print('---------Sample---------')
print(sample)

**Exercise 4:** generate a plot displaying the local prediction of the chosen test point by running the command `shap.force_plot(base_value, shap_values[i,:], X_test.iloc[i,:])`. Re-run this command for different values of i to generate explanations for different points. Notice the base value result (the average prediction) and observe how different features affect the probability of the `True` class (the person earning more than $50,000 income).

## Part 3: Global Explanations with SHAP

Now we look at the SHAP explanations for the model overall.

**Exercise 5:** generate a scatter plot of the test set SHAP values by running the command `shap.summary_plot(shap_values, X_test)`. How would you interpret the result for Age? How does the result for Age look different to the result for Capital Loss? 

The result for Age shows a clear colour distinction: blue on the left (younger) meaning being younger decreases the probability of the model outputting `True`, and red on the right. This is a different pattern to Capital Loss, where there is a blue block around 0, and red spread out in both directions elsewhere. This suggests that for low values of Capital Loss (blue), there is little impact on the model output. But when the value for Capital Loss is high (red) this can both positively or negatively affect the outcome.

**Exercise 6:** run the command `shap.summary_plot(shap_values, X_display, plot_type="bar")` to generate a bar chart of the average magnitude SHAP values per feature.

## Part 4: Dependence Plots

SHAP dependence plots show the SHAP values for one particular feature for all samples in the dataset. The vertical dispersion of SHAP values at a single value of the feature is driven by interaction effects between that feature and the model outcome. A second feature (represented by the colour) is chosen to highlight possible interactions between the two features and the model outcome.

Recall that a positive SHAP value denotes an increased probability of class 1 (the probability of earning more than $50,000 per year).

Interpreting the dependence plot for Age, we see that, for lower ages, the SHAP values tend to be low. The graph curves upwards and eventually levels off. This clear trend (with the points clustered around a line) explains why Age has large importance in the model. The Relationship feature shows potential interactions, as we can see that the red points tend to be lower than the blue points.

On the contrary, there is no clear trend to be seen in the Capital Loss dependence plot, explaining why it received a lower importance overall.

Run the following code snippet to generate the SHAP dependence plots for our model.

In [None]:
for name in X_test.columns:
    shap.dependence_plot(name, shap_values, X_test, display_features=X_display)