# Introduction to SHAP for explaining regression models
## CHAPTER 06 - *Introduction to model interpretability using SHAP*

From **Applied Machine Learning Explainability Techniques** by [**Aditya Bhattacharya**](https://www.linkedin.com/in/aditya-bhattacharya-b59155b6/), published by **Packt**

### Objective

In this notebook, let us get familiar with the SHAP (SHapley  Additive exPlanation) framework for explaining regression models, based on the concepts discussed in Chapter 6 - Introduction to model interpretability using SHAP.

### Installing the modules

Install the following libraries in Google Colab or your local environment, if not already installed.

In [None]:
!pip install --upgrade pandas numpy matplotlib seaborn scikit-learn shap

### Loading the modules

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns

import shap
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder # For transforming categories to integer labels

### About the data

**Red Wine Quality Dataset - Kaggle**

- Original Source - https://archive.ics.uci.edu/ml/datasets/wine+quality
- Kaggle Source - https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009 


### Loading the data

In [2]:
# We will read the training data
data = pd.read_csv('dataset/winequality-red.csv')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
data.shape

(1599, 12)

In [4]:
data.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


### Training the model

In [6]:
# Dropping all irrelevant columns
data.drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket'], inplace = True)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [7]:
# Dropping missing values
data.fillna(0,inplace=True)
data.shape

(891, 8)

In [8]:
# Label Encoding features 
categorical_feat = ['Sex']

# Using label encoder to transform string categories to integer labels
le = LabelEncoder()
for feat in categorical_feat:
    data[feat] = le.fit_transform(data[feat]).astype('int')
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,S
1,1,1,0,38.0,1,0,71.2833,C
2,1,3,0,26.0,0,0,7.925,S
3,1,1,0,35.0,1,0,53.1,S
4,0,3,1,35.0,0,0,8.05,S


In [9]:
# One-Hot Encoding Categorical features
data = pd.get_dummies(data, columns=['Embarked'])
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_0,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,0,0,0,1
1,1,1,0,38.0,1,0,71.2833,0,1,0,0
2,1,3,0,26.0,0,0,7.925,0,0,0,1
3,1,1,0,35.0,1,0,53.1,0,0,0,1
4,0,3,1,35.0,0,0,8.05,0,0,0,1


In [10]:
features = data.drop(columns=['Survived'])
labels = data['Survived']
# Dividing the data into training-test set with 80:20 split ratio
x_train,x_test,y_train,y_test = train_test_split(features,labels,test_size=0.2, random_state=123)

In [11]:
model = XGBClassifier(n_estimators = 500)
model.fit(x_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=500, n_jobs=12,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [12]:
model.score(x_test, y_test)

0.8435754189944135

Seeing the model scores, we do have a decent ML model. Let's define the prediction probability function (*f*) now, which we will used by the LIME framework.

In [13]:
predict_fn = lambda x: model.predict_proba(x)

### Model Explainability using LIME

Now that we have a trained model using the XGBoost algorithm, it can be treated as a black-box algorithm which will be explained by LIME. XGBoost is not an inherently interpretable model and based on the number of estimators, the complexity of the algorithm might vary. Let us see how we can use LIME to explain the outcome of the trained model. 

In [14]:
np.random.seed(123)

In [33]:
# Defining the LIME explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(data[features.columns].astype(int).values,
                                                   mode='classification',
                                                   class_names=['Did not Survive', 'Survived'],
                                                   training_labels=data['Survived'],
                                                   feature_names=features.columns)

In [34]:
# using LIME to get the explanations
i = 1
exp = explainer.explain_instance(data.loc[i,features.columns].astype(int).values, predict_fn, num_features=5)

In [36]:
exp.as_list()

[('Sex <= 0.00', 0.38963239499417573),
 ('Age > 35.00', -0.2537278215441616),
 ('Pclass <= 2.00', 0.2149891469420057),
 ('Fare > 31.00', 0.11918631279740177),
 ('Embarked_C > 0.00', 0.086287033464478)]

As we can see, getting the LIME explanation using the Python framework was easy and required only few lines of code. Now, let's try to understand what the visualization is telling us:
- The left-most bar plot is showing us the prediction probabilities, which can be treated as the model's confidence level in making the prediction. In this case, the model is 100% confident that the particular passenge would 'survive'.
- The second visualization is probably the most important visualization which provides maximum explainability. This visualization tells us that the most important feature with a feature importance score of 38% is the feature 'Sex', followed by 'Age' with a feature importance score of 26%. But as illustrated in orange, for the selected data instance the features 'Sex', Passenger Class or 'PClass', 'Fare' and Port of Embarkation as Cherbourg or 'Embarked_C' contributes towards the model outcome of 'survival' along with their threshold scores learnt from the entire dataset. Whereas, the feature 'Age' was more inclined towards predicting the outcome as 'Did not Survive' as the particular passengers age was 38 and usually passengers above the age of 38 have lower chances of surviving the disaster. The threshold feature values learnt by the LIME model is also inalignment with our own *common sense* and *apriori knowledge*. Even in case of the actual incident of the sinking of the Titanic which happened over 100 years ago, Female and Children were given the first preference to escape the sinking ship using the Life Boats. Similarly, 1st class passengers who had paid higher ticket fares got a higher preference to take the life boats and thus having higher chances of survival. So, the model explanation provided is *human-friendly* and consistent with our prior belief.
- The right-most visualization shows the top 5 features and their respective values in which the features highlighted in orange are contributing towards class 1 while features highlighted in blue is contributing towards class 0.


So, the explanations are not just based on the features but based on the feature value pairs. Although the explanations are local, but it does provide a global perspective of the model.

In [38]:
# Let's use SP-LIME to return explanations on a sample data set 
# and obtain a non-redundant global decision perspective of the black-box model
sp_exp = submodular_pick.SubmodularPick(explainer, 
                                        data[features.columns].values,
                                        predict_fn, 
                                        num_features=5,
                                        num_exps_desired=10)

The above explanation visualizations are obatined using the SP-LIME method and are similar to the LIME algorithm, but this tries to draw diverse samples from the entire training set to provide a global uderstanding of the model. The time-complexity of the SP-LIME algorithm is high and depends on the size of the dataset, but all the examples picked up are quite consistent with logical thinking and hence these explanations are human-friendly!

## Final Thoughts

In this notebook, we just saw a glimpse of the LIME framework. The main objective of this tutorial was to provide some practical exposure to the readers of the chapter on the LIME framework using Python. Throughout part 2 of the book, we will explore many XAI frameworks, but personally I found LIME to be an easy to use framework with minimum lines of code. The best part is using simple visualizations and intuitive human friendly explanations which anyone can understand. Although, it is not perfect and does have some limitations, both in terms of the algorithms and conveying the explanations through the visualizations. But since it provides model-agnostic local explainability along with providing a global perspective, it is a very useful algorithm for any business problem. In the next chapter of the book, we will explore more practical use cases for using LIME on different types of datasets.

## Reference

1. Red Wine Quality Dataset - Kaggle - https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
4. Some of the utility functions and code are taken from the GitHub Repository of the author - Aditya Bhattacharya https://github.com/adib0073