[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Explainability - SHAP

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

Remarks:

 - This notebook takes long time to compute on Google Colab.

To Do List:
 - 


## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 29/10/2022 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/IntroductionMachineLearningSystemEngineers/ExplainabilityShap.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor #<! Similar to XGBoost
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Misc
import datetime
import math
import os
from platform import python_version
import random
import warnings
import yaml

# Typing
from typing import Tuple

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

In [None]:
# Configuration
%matplotlib inline

warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme
sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
if runInGoogleColab:
    !pip install git+https://github.com/8080labs/ppscore.git
    !pip install --upgrade shap
    !pip install --upgrade xgboost
    !pip install --upgrade lightgbm

from lightgbm import LGBMClassifier, LGBMRegressor
from xgboost import XGBClassifier, XGBRegressor

# import ppscore as pps #<! See https://github.com/8080labs/ppscore -> pip install git+https://github.com/8080labs/ppscore.git
import shap
shap.initjs()

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


In [None]:
# Parameters

numSplits = 5
numShapSamples = 100

# Data
csvFilePath = r'../DataSets/winequality-red.csv'
csvFileUrl  = r'https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/DataSets/winequality-red.csv'

In [None]:
# Auxiliary Functions

# From https://stackoverflow.com/questions/36728287
def PolynomialFeaturesLabels(input_feature_names, power, include_bias: bool = True):
    '''Basically this is a cover for the sklearn preprocessing function. 
    The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
    a whole bunch of unlabeled columns. 

    Inputs:
    input_df = Your labeled pandas dataframe (list of x's not raised to any power) 
    power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)

    Ouput:
    Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and 
    outputs a labeled pandas dataframe   
    '''
    poly = PolynomialFeatures(power)
    poly.fit(np.random.rand(1, len(input_feature_names)))
    powers_nparray = poly.powers_

    target_feature_names = []
    if include_bias:
        target_feature_names.append("Constant Term")
    for feature_distillation in powers_nparray[1:]:
        intermediary_label = ""
        final_label = ""
        for i in range(len(input_feature_names)):
            if feature_distillation[i] == 0:
                continue
            else:
                variable = input_feature_names[i]
                power = feature_distillation[i]
                intermediary_label = "%s^%d" % (variable,power)
                if final_label == "":         #If the final label isn't yet specified
                    final_label = intermediary_label
                else:
                    final_label = final_label + " x " + intermediary_label
        target_feature_names.append(final_label)
    return target_feature_names



## Case I - Linear Regression

$$ y = 7 + {x}_{1} - {2x}_{2} + {3x}_{3} - {3x}_{4} + {5.5x}_{5} + \epsilon $$

In [None]:
numSamples  = 1000
modelOrder  = 5
noiseStd    = 0.1
mX = 2 * (np.random.rand(numSamples, modelOrder) - 0.5) #<! Zero mean data

vY = 7 + 1 * mX[:, 0] - 2 * mX[:, 1] + 3 * mX[:, 2] - 3 * mX[:, 3] + 5.5 * mX[:, 4] + (noiseStd * np.random.randn(numSamples))


In [None]:
# Linear Regressor
oLS = LinearRegression().fit(mX, vY)
modelScore = oLS.score(mX, vY)

print(f'Model Score (Training): {modelScore}')


### SHAP Analysis

Since the model is linear and the SHAP method build an additive model we assume the result will be similar to the linear coefficients.

The SHAP usually is calculated on sub sample of the data or a clustered version of the data.

In [None]:
# SHAP Model
# Building SHAP model without explicitly saying the model is linear

# oSHAP = shap.KernelExplainer(oLS.predict, mX) #<! All data, slowest, yet most accurate
# oSHAP = shap.KernelExplainer(oLS.predict, shap.kmeans(mX, 50)) #<! Clustering for smaller representation
oSHAP = shap.KernelExplainer(oLS.predict, shap.sample(mX, 100)) #<! Sub Sampling for a random choice from data

### Shape Values

We'll analyze the SHAP values for the sample (Local Interpretability):

$$
\boldsymbol{x}^{\star} = \begin{bmatrix}1\\
1\\
5\\
1\\
1
\end{bmatrix}
$$

In [None]:
# Generate the Sample
vX = np.array([1, 1, 5, 1, 1])

In [None]:
# The prediction of the model for this sample:
print(f'The model prediction for `vX`: {oLS.predict(vX[:, np.newaxis].T)}')

In [None]:
# Compute Shapley values for vX:
vShapleValues = oSHAP.shap_values(vX)

# Display values
for ii in range(len(vX) + 1):
    if ii == 0:
        φ = oSHAP.expected_value
    else:
        φ = vShapleValues[ii - 1]
    print(f'φ_{ii} = {φ: 5.5f}')

Look at the values compared to the linear coefficiants.  
Think of the SHAP values as something that gives you the shift from the expected value, either to increase or deacrese the values.

In [None]:
shap.force_plot(oSHAP.expected_value, vShapleValues, feature_names = ['x_1', 'x_2', 'x_3', 'x_4', 'x_5'])

> One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained. The easiest way to see this is through a waterfall plot that starts our background prior expectation for a home price $\mathbb{E} \left[ f \left( \boldsymbol{x} \right) \right]$, and then adds features one at a time until we reach the current model output $f \left( x \right)$:

**Remark**: Read the waterfall plot from bottom up.

In [None]:
# The wate
shap.waterfall_plot(shap.Explanation(vShapleValues, oSHAP.expected_value, data = vX, feature_names = ['x_1', 'x_2', 'x_3', 'x_4', 'x_5']))

## Case II - Ensemble Tree Model (Regression)

In [None]:
# Generate / Load Data 

if os.path.isfile(csvFilePath):
    dfData = pd.read_csv(csvFilePath)
else:
    dfData = pd.read_csv(csvFileUrl)

In [None]:
dsY = dfData['quality']
dfX = dfData[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]
dfX


In [None]:
dfX.info()

In [None]:
dfX.describe()

In [None]:
hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
sns.histplot(x = dsY, stat = 'count', discrete = True, ax = hA)
hA.set_title('Wine Quality Histogram');

### Define the Pipeline

In [None]:
# The pipeline steps
pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesRegressor', GradientBoostingRegressor())])
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesRegressor', XGBRegressor())])
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesRegressor', LGBMRegressor())]) #<! Fastest
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesRegressor', RandomForestRegressor())])


In [None]:
# Grid of parameters
# Pay attention that the computational complexity is exponential!

# For GradientBoostingRegressor
dParamsGrid = {
    'PolyFeats__degree': ['passthrough', 2, 3],
    'EnsTreesRegressor__loss': ['squared_error', 'absolute_error'],
    'EnsTreesRegressor__n_estimators': [50, 100, 150],
    'EnsTreesRegressor__max_depth': [3, 5, 7],
}

# For XGBRegressor / LGBMRegressor
# dParamsGrid = {
#     'PolyFeats__degree': ['passthrough', 2, 3],
#     'EnsTreesRegressor__n_estimators': [50, 100, 150],
#     'EnsTreesRegressor__max_depth': [3, 5, 7],
# }

# For RandomForestRegressor
# dParamsGrid = {
#     'PolyFeats__degree': [passthrough, 2, 3],
#     'EnsTreesRegressor__criterion': ['squared_error', 'absolute_error'],
#     'EnsTreesRegressor__n_estimators': [50, 100],
#     'EnsTreesRegressor__max_depth': [3, 5],
# }

In [None]:
# Split Data for Optimization of Parameters

iDataBatch = StratifiedKFold(n_splits = numSplits, shuffle = True, random_state = seedNum)

In [None]:
# Define the Grid Search (In pracice for many variables there are much better approachs: Random, Bayesian, etc...)
oGridSearch = GridSearchCV(pBoostTress, dParamsGrid, n_jobs = -1, cv = iDataBatch.split(dfX, dsY))

In [None]:
# Optimization of Hyper Parameters
oGridSearch.fit(dfX, dsY)
print(f'Best parameter (CV Score = {oGridSearch.best_score_}')
print(oGridSearch.best_params_)

In [None]:
optimalEst = oGridSearch.best_estimator_ #<! Basically the whole pipeline
optimalEst

In [None]:
# Show Prediction Results
vYEst = optimalEst.predict(dfX)

plt.plot(dsY, vYEst, '.r');

In [None]:
# Process Data by the Pipeline
# See https://stackoverflow.com/questions/62180278
dfXProcessed = optimalEst[:-1].transform(dfX)

In [None]:
# Extract the Estimator
ensTreeEst = optimalEst[-1]
ensTreeEst.get_params()

### Explain Results by SHAP

In [None]:
# oSHAP = shap.KernelExplainer(optimalEst.predict, shap.sample(dfX, numShapSamples)) #<! Slow, match any model
oSHAP = shap.TreeExplainer(ensTreeEst) #<! Optimized for trees, vey fast!

In [None]:
vShapleValues = oSHAP.shap_values(dfXProcessed)

In [None]:
polyDeg = oGridSearch.best_params_['PolyFeats__degree']
lFeaturesName = PolynomialFeaturesLabels(dfX.columns.to_list(), polyDeg, include_bias = False)

**Variable Importance Plot — Global Interpretability**

A variable importance plot lists the most significant variables in descending order. The top variables contribute more to the model than the bottom ones and thus have high predictive power. 

In [None]:
shap.summary_plot(vShapleValues, dfXProcessed, plot_type = "bar", feature_names = lFeaturesName)

**Variable Importance Plot — Global Interpretability**

 * Feature importance: Variables are ranked in descending order.
 * Impact: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.
 * Original value: Color shows whether that variable is high (in red) or low (in blue) for that observation.
 * Correlation: A high level of the “alcohol” content has a high and positive impact on the quality rating. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. Similarly, we will say the “volatile acidity” is negatively correlated with the target variable.

In [None]:
shap.summary_plot(vShapleValues, dfXProcessed, feature_names = lFeaturesName)

**SHAP Dependence Plot — Global Interpretability**

A partial dependence plot shows the marginal effect of one or two features on the predicted outcome of a machine learning model.  
It tells whether the relationship between the target and a feature is linear, monotonic or more complex.


In [None]:
shap.dependence_plot(lFeaturesName[0], vShapleValues, dfXProcessed, feature_names = lFeaturesName)

**Individual SHAP Value Plot — Local Interpretability**

The explainability for any individual observation is the most critical step to convince your audience to adopt your model.

In [None]:
sampleIdx = 5


shap.force_plot(oSHAP.expected_value, vShapleValues[sampleIdx], feature_names = lFeaturesName)

# Why does the alcohol drives this to the left? Think about the mean value...


### Summary of SHAP

It is helpful to remember the following points:
 * Each feature has a shap value contributing to the prediction.
 * The final prediction = the average prediction + the shap values of all features.
 * The shap value of a feature can be positive or negative.
 * If a feature is positively correlated to the target, a value higher than its own average will contribute positively to the prediction.
 * If a feature is negatively correlated to the target, a value higher than its own average will contribute negatively to the prediction.


 **Remark**: The SHAP values do not identify causality, which is better identified by experimental design or similar approaches.

## Case III - Ensemble Tree Model (Classification)

In [None]:
# The pipeline steps
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesClassifier', GradientBoostingClassifier())])
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesClassifier', XGBClassifier())])
pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesClassifier', LGBMClassifier())]) #<! Fastest
# pBoostTress = Pipeline(steps = [('Scaler', StandardScaler()), ('PolyFeats', PolynomialFeatures(include_bias = False)), ('EnsTreesClassifier', RandomForestClassifier())])

In [None]:
# Grid of parameters
# Pay attention that the computational complexity is exponential!

# For GradientBoostingRegressor
# dParamsGrid = {
#     'PolyFeats__degree': ['passthrough', 2, 3],
#     'EnsTreesClassifier__loss': ['log_loss', 'deviance'],
#     'EnsTreesClassifier__n_estimators': [50, 100, 150],
#     'EnsTreesClassifier__max_depth': [3, 5, 7],
# }

# For LGBMClassifier
dParamsGrid = {
    'PolyFeats__degree': ['passthrough', 2, 3],
    'EnsTreesClassifier__n_estimators': [50, 100, 150],
    'EnsTreesClassifier__max_depth': [3, 5, 7],
}

# For RandomForestRegressor
# dParamsGrid = {
#     'PolyFeats__degree': [passthrough, 2, 3],
#     'EnsTreesClassifier__criterion': ['squared_error', 'absolute_error'],
#     'EnsTreesClassifier__n_estimators': [50, 100],
#     'EnsTreesClassifier__max_depth': [3, 5],
# }

In [None]:
# Define the Grid Search (In pracice for many variables there are much better approachs: Random, Bayesian, etc...)
oGridSearch = GridSearchCV(pBoostTress, dParamsGrid, n_jobs = -1, cv = iDataBatch.split(dfX, dsY), scoring = 'r2')
# oGridSearch = GridSearchCV(pBoostTress, dParamsGrid, n_jobs = -1, cv = iDataBatch.split(dfX, dsY))

In [None]:
# Otpimization of Hyper Parameters
oGridSearch.fit(dfX, dsY)


In [None]:
print(f'Best parameter (CV Score = {oGridSearch.best_score_}')
print(oGridSearch.best_params_)

In [None]:
optimalEst = oGridSearch.best_estimator_ #<! Basically the whole pipeline
optimalEst

In [None]:
# Show Prediction Results
vYEst = optimalEst.predict(dfX)

confMatrix = confusion_matrix(dsY, vYEst, labels = optimalEst[-1].classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = confMatrix, display_labels = optimalEst[-1].classes_)
disp.plot()

In [None]:
plt.plot(dsY, vYEst, '.r')

In [None]:
# Process Data by the Pipeline
# See https://stackoverflow.com/questions/62180278
dfXProcessed = optimalEst[:-1].transform(dfX)

In [None]:
# Extract the Estimator
ensTreeEst = optimalEst[-1]
ensTreeEst.get_params()

In [None]:
# oSHAP = shap.KernelExplainer(optimalEst.predict, shap.sample(dfX, numShapSamples)) #<! Slow, match any model
oSHAP = shap.TreeExplainer(ensTreeEst) #<! Optimized for trees, vey fast! (Currently SHAP doesn't support this)

In [None]:
vShapleValues = oSHAP.shap_values(dfXProcessed)

In [None]:
polyDeg = oGridSearch.best_params_['PolyFeats__degree']
lFeaturesName = PolynomialFeaturesLabels(dfX.columns.to_list(), polyDeg, include_bias = False)

In [None]:
shap.summary_plot(vShapleValues, dfXProcessed, plot_type = "bar", feature_names = lFeaturesName, class_names = optimalEst[-1].classes_)

**?**: How can we improve the regressor in this case?  
**!**: Think about the output values.