# Explainable AI

Making models is really cool, but in practice, in businesses people often also want to know why a certain prediction was made. Understanding why predictions are made is the field of Explainable AI. It can be as important, and in some cases, even more important as making the most accurate prediction. 

SHAP (SHapley Additive exPlainations) is a game theoretic approach to explain the output of any machine learning model to increase transparency and interpretability of machine learning models. Consider a coooperative game with the same number of players as the name of features. SHAP will disclose the individual contribution of each player (or feature) on the output of the model, for each example or observation.

*Important: while SHAP shows the contribution or the importance of each feature on the prediction of the model, it does not evaluate the quality of the prediction itself.*

SHAP can thus be applied to all kinds of models. SHAP has different ways of working for different kinds of models, in this notebook we will first go through SHAP for tabular data. We will first make an XG Boost model, which is a tree model. We will use the breast_cancer dataset that has 30 variables and 1 target which is binary and shows whether the person has breast cancer or not. SHAP will help us understand which of these 30 variables made the largest difference in a single prediction. If we calculate the mean SHAP values over all these samples, we can say which of the variables are most important.

In [None]:
from sklearn import datasets
import xgboost
import pandas as pd
import numpy as np

In [None]:
#load the dataset
dataset = datasets.load_breast_cancer()

In [None]:
#show as pd DataFrame
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df

In [None]:
#define X and y
X = pd.DataFrame(dataset.data, columns = dataset.feature_names)
y = pd.Series(dataset.target)

In [None]:
from sklearn.model_selection import train_test_split

#define train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

#prepare data for xgboost
d_train = xgboost.DMatrix(X_train, label=y_train)
d_test = xgboost.DMatrix(X_test, label=y_test)

In [None]:
#set parameters
params = {
    "eta": 0.01,
    "objective": "binary:logistic",
    "subsample": 0.5,
    "base_score": np.mean(y_train),
    "eval_metric": "logloss"
}

#train model
model = xgboost.train(params, d_train, 5000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=20)

#predictions = model.predict(d_test)

We have a model! Now we can start using the SHAP values to analyze the model

First we have to define the explainer, because XGBoost is a tree model, we will use TreeExplainer.

In [None]:
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

### Visualize a single prediction

We can visualize a single prediction. 

For this we can use the force plot, which is a way to see the effect of each feature on the prediction, for a given observation. In this plot the positive SHAP values are displayed on the left side and the negative on the right side, as if competing against each other. The highlighted value is the prediction for that observation.



In [None]:
#init javascript in order to display the visuals
shap.initjs()

#item 
observation = 1 #the observation we are checking
print(y.iloc[observation])
shap.force_plot(explainer.expected_value, shap_values[observation,:], X.iloc[observation,:])


We can also show all the predictions in one single plot by not slicing as done in the above example

In [None]:
shap.force_plot(explainer.expected_value, shap_values, X)

### Bar chart of mean importance

This takes the average of the SHAP value magnitudes across the dataset and plots it as a simple bar chart.

In [None]:
shap.summary_plot(shap_values, X, plot_type="bar")

### SHAP Summary Plot

Rather than use a typical feature importance bar chart, we use a density scatter plot of SHAP values for each feature to identify how much impact each feature has on the model output for individuals in the validation dataset. Features are sorted by the sum of the SHAP value magnitudes across all samples. It is interesting to note that the relationship feature has more total model impact than the captial gain feature, but for those samples where capital gain matters it has more impact than age. In other words, capital gain effects a few predictions by a large amount, while age effects all predictions by a smaller amount.

Note that when the scatter points don’t fit on a line they pile up to show density, and the color of each point represents the feature value of that individual.

In [None]:
shap.summary_plot(shap_values, X)


So that's it for the tabular data. We can also use SHAP for images. See the next notebook for SHAP on image data.