# 05: Shapley Values 

See sections in Molnar's book on [Shapley Values](https://christophm.github.io/interpretable-ml-book/shapley.html) and [SHAP](https://christophm.github.io/interpretable-ml-book/shap.html) for background information. For actual use, see the [shap package](https://github.com/slundberg/shap).

## Imports

In [1]:
from dataclasses import dataclass, field
from itertools import product
import random

import altair as alt
import numpy as np
import pandas as pd
import pmlb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import PartialDependenceDisplay

In [2]:
# If you're running this code locally, then you can uncomment this to automatically
# save the chart data in files, rather than including the data in the spec. 

!mkdir -p data
alt.data_transformers.enable('json', prefix='data/altair-data')

DataTransformerRegistry.enable('json')

## Data Preparation and Modeling

For this lab, we'll be using a bike rental dataset. This is a regression dataset where the goal is to predict the number of bikes that were rented at a particular day and time. This dataset is from the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). The data processing was guided by [Molnar's IML book](https://christophm.github.io/interpretable-ml-book/bike-data.html).

In [3]:
df = pd.read_csv('https://gist.githubusercontent.com/DanielKerrigan/f324b392dc9a58d8bd8f8d79e1101a12/raw/c3b4760c9facfac26bcab2cd7465c4cab88ef304/bike-hour.csv')

To reduce computation times, we'll drop some of the columns.

In [4]:
df.drop(columns=['yr', 'mnth', 'atemp'], inplace=True)

In [5]:
df.head()

Unnamed: 0,days_since_2011,season,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed,cnt
0,0,1,0,0,6,0,1,0.24,0.81,0.0,16
1,0,1,1,0,6,0,1,0.22,0.8,0.0,40
2,0,1,2,0,6,0,1,0.22,0.8,0.0,32
3,0,1,3,0,6,0,1,0.24,0.75,0.0,13
4,0,1,4,0,6,0,1,0.24,0.75,0.0,1


We'll use the data from 2011 for training.

In [6]:
df_train = df[df['days_since_2011'] < 365].copy()

In [7]:
X_train = df_train.drop(columns=['cnt'])
y_train = df_train['cnt'].values

Next we'll train a random forest model on this dataset. We'll do a grid search with cross-validation to find reasonable hyperparameters.

In [8]:
param_grid = {
    'n_estimators': [10],
    'bootstrap': [True],
    'max_features': ['sqrt', 1.0],
    'max_depth': [6, 12],
    'min_samples_split': [2, 8],
}

cv = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, scoring='neg_mean_squared_error', n_jobs=-1)

cv.fit(X_train, y_train)

In [9]:
cv.best_params_

{'bootstrap': True,
 'max_depth': 12,
 'max_features': 1.0,
 'min_samples_split': 2,
 'n_estimators': 10}

In [10]:
cv.best_score_

-3143.4512307274085

In [11]:
model = cv.best_estimator_

## Shapley Implementation

**Exercise 1:**

First, we will write a function to approximately calculate a feature's Shapley value for a given instance. Our algorithm will be similar to the one that Molnar details in [Section 9.5.3.3](https://christophm.github.io/interpretable-ml-book/shapley.html#estimating-the-shapley-value).

*1a)* Select a random instance from the dataframe `df`. [df.sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html#pandas-dataframe-sample) is useful for this.

*1b)* Select a random set of features, not including the feature that we are calculating the shapley value for (`feature`). [random.randrange()](https://docs.python.org/3/library/random.html#random.randrange) and [random.sample()](https://docs.python.org/3/library/random.html#random.sample) are useful for this.

*1c)* Make a copy of the instance `x`. For the features randomly selected in 1b, replace the value in `x` with the value in the random instance from 1a.

*1d)* Make a copy of the instance from 1c. Replace the value of `feature` with the value from the random instance from 1a.

*1e)* Get the predicted values of the instances from 1c and 1d. Calculate the difference between the predictions.

In [12]:
'''
df - dataframe containing the entire dataset
x - dataframe containing a single instance
model - trained sklearn model
feature - the name of the feature that we are computing the Shapley value for
iterations - number of iterations to run for
'''
def calculate_shapley_value(df, x, model, feature, iterations):
    # keep track of the total from the summation
    value = 0
    
    # list of features besides the one we are computing the shapley value for
    other_features = [f for f in df.columns if f != feature]

    for _ in range(iterations):
        # 1a: get a random instance from the df
        random_instance = df.sample()
        
        # 1b: select a random set of features
        num_features_to_change = random.randrange(len(other_features))
        features_to_change = random.sample(other_features, num_features_to_change)
        
        # 1c: make a copy of the instance x for the randomly selected features,
        # replace the value of that feature in x with the value in random_instance
        z_original = x.copy()
        
        for f in features_to_change:
            z_original[f] = random_instance[f].values
            
        # 1d: make a copy of z_original. replace the value
        # of feature with the value in random_instance
        z_different = z_original.copy()
        z_different[feature] = random_instance[feature].values
        
        
        # 1e: get the predicted values for z_original and z_different.
        # calculate the difference between them
        pred_original = model.predict(z_original)[0]
        pred_different = model.predict(z_different)[0]
        difference = pred_original - pred_different
        
        value += difference
        
    # take the mean
    return value / iterations

In [13]:
calculate_shapley_value(X_train, X_train.iloc[[0]], model, 'hr', 50)

-50.68144163285584

The below `shapley_values` function calculates the shapley value of every feature for every instance in `df`. It returns a dataframe containing the shapley values.

In [14]:
def shapley_values(df, model, iterations):
    rows = []
    
    for i in range(df.shape[0]):
        x = df.iloc[[i]]
        
        row = {}
        
        for feature in df.columns:
            row[feature] = calculate_shapley_value(df, x, model, feature, iterations)
            
        rows.append(row)
        
    return pd.DataFrame(rows)

In [15]:
subset = X_train.sample(100).reset_index(drop=True)

In [16]:
shapley = shapley_values(subset, model, 50)

In [17]:
shapley

Unnamed: 0,days_since_2011,season,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed
0,-2.815161,4.560136,-26.083873,-0.014000,1.951663,-10.283737,-0.223514,-13.120838,-56.602621,-7.636265
1,-1.688119,-3.013772,-90.001023,0.000000,2.594263,-3.996067,-8.109947,-21.813887,-3.176943,1.391765
2,-9.996812,-5.480538,-112.950096,0.200667,-1.049944,-7.387078,4.890793,-5.842176,3.385598,1.322995
3,-10.889281,-5.738438,-72.152826,0.022667,4.253667,-2.826117,7.235033,-6.833868,0.279517,-1.518159
4,11.028830,9.194761,-11.366552,-1.750558,-8.041960,7.815798,8.693378,46.460001,5.711070,-1.176574
...,...,...,...,...,...,...,...,...,...,...
95,3.578285,3.943863,-25.104553,0.000000,2.531870,-4.565778,1.186392,28.098604,-4.642077,-2.322825
96,0.799643,5.934853,-155.542659,0.000000,-0.147764,-0.151071,1.477086,6.559958,-5.303210,0.197507
97,-16.872682,-5.892395,-92.019163,0.000000,-1.718797,-1.322607,4.056109,-24.621600,0.226264,-0.431661
98,7.126300,-8.595986,61.774908,0.148000,1.255995,1.413424,2.810180,-23.307894,5.295075,0.276501


## Visualizations

### Feature Importance Bar Chart

**Exercise 2:** Create a bar chart that shows the feature importance of each feature based on the shapley values.

*2a)* Calculate the mean absolute values for each feature in `shapley`. The [mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) and [abs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.abs.html) functions will be useful. We will want a dataframe that has two columns: one for the feature and one for the value.

In [18]:
feature_importance = pd.DataFrame(shapley.abs().max()).reset_index()
feature_importance.columns = ['feature', 'value']

feature_importance

Unnamed: 0,feature,value
0,days_since_2011,55.663289
1,season,44.969222
2,hr,277.460272
3,holiday,4.808652
4,weekday,26.24438
5,workingday,88.818758
6,weathersit,87.774509
7,temp,101.67717
8,hum,56.602621
9,windspeed,12.778257


*2b)* Plot the feature importances in a bar chart.

In [19]:
alt.Chart(feature_importance).mark_bar().encode(
    y=alt.Y('feature', sort='-x'),
    x=alt.X('value', title='mean absolute shapley value')
)

  for col_name, dtype in df.dtypes.iteritems():


### Dependence Scatter Plot

**Exercise 3:** For a given feature, we can create a scatterplot that shows the relationship between an instance's value for that feature (x-axis) and its shapley value for that feature (y-axis). This works as an alternative to PDPs. Complete the function below to create a dependence plot for the given feature.

*3a)* Create a dataframe containing the feature values and shapley values for the given `feature`. This dataframe should have two columns: feature_value and shapley_value. Each row represents an instance.

*3b)* Return a scatterplot of the dataframe.

In [20]:
def plot_dependence(instances, shapley, feature):
    # 3a: create a dataframe containing the feature values and shapley values 
    dependence = pd.DataFrame({
        'feature_value': instances[feature],
        'shapley_value': shapley[feature]
    })
    
    # 3b: plot the values in a scatterplot
    return alt.Chart(dependence).mark_point().encode(
        x=alt.X('feature_value', title=feature),
        y='shapley_value'
    )

In [21]:
plot_dependence(subset, shapley, 'hr')

In [22]:
plot_dependence(subset, shapley, 'temp')

### Summary Strip Plot

We can create a strip plot that shows every individual shapley value. There will be one row for each feature. In each row, there will be one dot for each instance. The x position of each dot will encode the instance's shapley value. The color of each dot will encode the instance's feature value. We will jitter the dots in the y direction to reduce overlap.

First, we need to transform our data to get it into a dataframe that looks like the table below. In this dataframe, there will be one row for every feature in every instance.

| feature         | shapley_value | feature_value |
|-----------------|---------------|---------------|
| days_since_2011 | 135.0         | 6.453387      |
| days_since_2011 | 198.0         | 2.502707      |
| days_since_2011 | 248.0         | 16.331289     |

We can use the [melt](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html) function in pandas to help with this.

In [23]:
subset.head()

Unnamed: 0,days_since_2011,season,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed
0,251,3,10,0,5,1,2,0.62,1.0,0.0
1,97,2,2,0,5,1,3,0.34,0.76,0.1642
2,96,2,1,0,4,1,1,0.42,0.47,0.1343
3,76,1,0,0,5,1,1,0.42,0.58,0.2836
4,226,3,21,0,1,1,1,0.66,0.65,0.194


In [24]:
values = subset.melt()
values.columns = ['feature', 'feature_value']
values.head()

Unnamed: 0,feature,feature_value
0,days_since_2011,251.0
1,days_since_2011,97.0
2,days_since_2011,96.0
3,days_since_2011,76.0
4,days_since_2011,226.0


We can do the same thing for the dataframe that contains shapley values.

In [25]:
shapley_values = shapley.melt()
shapley_values.head()

Unnamed: 0,variable,value
0,days_since_2011,-2.815161
1,days_since_2011,-1.688119
2,days_since_2011,-9.996812
3,days_since_2011,-10.889281
4,days_since_2011,11.02883


Then we can combine our newly created dataframes.

In [26]:
values['shapley_value'] = shapley_values['value']
values.head()

Unnamed: 0,feature,feature_value,shapley_value
0,days_since_2011,251.0,-2.815161
1,days_since_2011,97.0,-1.688119
2,days_since_2011,96.0,-9.996812
3,days_since_2011,76.0,-10.889281
4,days_since_2011,226.0,11.02883


**Exercise 4:** Using the `values` dataframe, create the strip plot. See the end of 01_altair_questions.ipynb for an example of a strip plot with jittering.

In [27]:
sorted_features = feature_importance.sort_values(by='value', ascending=False)['feature'].values

In [28]:
alt.Chart(values).mark_circle().encode(
    x='shapley_value',
    row=alt.Row('feature', sort=sorted_features, spacing=0, header=alt.Header(labelAngle=0, labelAlign='left')),
    y=alt.Y('jitter:Q', axis=None),
    color=alt.Color('feature_value', scale=alt.Scale(scheme='viridis'), title=None)
).properties(
    height=50,
    width=700
).transform_calculate(
    jitter='random()'
).resolve_scale(
    color='independent'
).configure_legend(
    gradientLength=50
)

  for col_name, dtype in df.dtypes.iteritems():
