In [None]:
%%HTML

<style type="text/css">

div.h2 {
    background-color: #665191; 
    color: white; 
    padding: 5px; 
    padding-right: 300px; 
    font-size: 25px;  
    margin-top: 2px;
    margin-bottom: 10px;
}

div.h3 {
    background-color: white; 
    color: #fe0000; 
    padding: 5px; 
    padding-right: 300px; 
    font-size: 20px; 
    margin-top: 2px;
    margin-bottom: 10px;
}
</style>

In [None]:
# !pip uninstall matplotlib -y
# !pip install matplotlib==3.1.3

import matplotlib
print(matplotlib.__version__)

!pip install mplcyberpunk
import mplcyberpunk

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib 
from matplotlib import gridspec
import seaborn as sns

from sklearn.model_selection import KFold, cross_val_score, train_test_split
import shap
import xgboost as xgb
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve

import joblib
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
plt.style.use("cyberpunk")

# runtime configuration of matplotlib
plt.rc("figure", 
    autolayout=False, 
    figsize=(20, 10),
)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=20,
    titlepad=10,
)

---

### <div class="h2">Introduction</div>

This competition is all about imputation of missing values and there is no target. The task is to predict the missing values. There are different appraches for handling missing values and I will treat this is a regression approach where every column with missing values is the target where the model will be trained on the non-missing values inorder to predict the missing values. [SRK](https://www.kaggle.com/competitions/tabular-playground-series-jun-2022/discussion/328369) has a discussion post about this approach.

My main aim in this notebook is to use different approaches like *SHAP* and *PDP* to create insights into how a ML model like XGBoost works. A model is like a detective that searches for interaction and patterns in the data and uses these findings to make predictions. Since most of these models are by design not interpretable it is wise to use the above mentioned apporaches to check whether the model has learned the right patterns. 

Since the data for this competition is anonymus it is hard to check if the model has learned the right things. Nonetheless the technique itself can be useful in analysing the feature space for feature engineering.


   <a id="toc"></a>
   
1. [Data Overview](#1)
2. [Modelling approach](#2)
3. [Model inspection with SHAP](#3)
4. [Model inspection with PDP](#4)
5. [References](#5)

I will keep adding content; its work in progress.

---
<a id="1"></a>
### <div class="h2">1. Data Overview</div>

In [None]:
data_raw = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', index_col="row_id")

In [None]:
corr = data_raw.loc[:, :].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

fig = plt.figure(figsize=(35, 20), facecolor='#002637', edgecolor='r')
ax = fig.add_subplot()

sns.heatmap(corr, mask=mask, cmap='coolwarm', annot=True, fmt='.2f', cbar=False, ax=ax)
ax.tick_params(axis='x', colors='w', labelsize=12, rotation=90)
ax.tick_params(axis='y', colors='w', labelsize=12)

plt.suptitle("Pearson correlation coefficient", color="white", fontsize=35, y=0.92)
plt.yticks(rotation=0) 
plt.show()

In [None]:
F_4 = []
for i in range(15):
    string = "F_4_" + str(i)
    F_4.append(string)

corr = data_raw.loc[:, F_4].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

fig = plt.figure(tight_layout=True, figsize=(20,10))
spec = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)


ax0 = fig.add_subplot(spec[0, 0])
sns.heatmap(corr, mask=mask, cmap='coolwarm', annot=True, fmt='.2f', cbar=False, ax=ax0)
ax0.tick_params(axis='x', colors='w', labelsize=12, rotation=90)
ax0.tick_params(axis='y', colors='w', labelsize=12, rotation=0)
ax0.set_title("Pearson correlation coefficient", fontsize=14, fontweight ='bold', y=1.01)


ax0 = fig.add_subplot(spec[0, 1])

null = pd.DataFrame(data_raw.loc[:, F_4].isnull().sum())
dtypes = pd.DataFrame(data_raw.loc[:, F_4].dtypes)
data_info = dtypes.merge(null, left_index=True, right_index=True)
data_info.columns = ["type", "nulls"]
data_info.sort_values(by=["nulls"], ascending=False, inplace=True)
data_info["% missing"] = 100*data_info["nulls"]/data_raw.shape[0]

sns.scatterplot(x=data_info.index, y="% missing", data=data_info, ax=ax0)
ax0.set_title("Percentage missing values", fontsize=14, fontweight ='bold', y=1.01)
ax0.tick_params(axis='x', colors='w', labelsize=12, rotation=90)
    
plt.suptitle("F_4", ha="center", y=1.03, fontweight ='bold', fontsize=24) 
plt.show()


💡 **INSIGHTS**
- From the first plot one can conclude that there is correlation between `F_2` features and correlation between `F_4` features. Otherwise there is no correlation among other features.
- In the second I have zoomed in on the feature space `F_4`. Here we can see that `F_4_11` correlates the highest with `F_4_8`. Overall there is some sort of correlation between features. 
- From the third plot one can conclude that all features of `F_4` has missing values with `F_4_2` the most.

---
<a id="2"></a>
### <div class="h2">2. Modelling approach</div>

To use the SHAP package we first need to train a model since SHAP is a model agnostic approach designed to explain any given black-box model. If you are applying SHAP to a real-world problem you should follow best practices. Specifically, you should ensure your model performs well on both a training and validation set. The better your model the more reliable your results will be. 

For the sake of demonstration I have chosen `F_4_8` as target and the rest of `F_4` features as predictors and applied XGBoost to train a model. As a quick check on this model, I have calculated the `R-square` of the validation set which about`90%`. The model should be fine to demonstrate the SHAP package. The code below can be easly modified to train a different model or maybe more than one model.


In [None]:
F_4 = []
for i in range(15):
    string = "F_4_" + str(i)
    F_4.append(string)

# Substitute ["F_4_8"] with F_4 if you want to train a model for all F_14 features.
for col in ["F_4_8"]:
    target = col
    features = [col for col in data_raw.loc[:, F_4].columns if col != target]
    
    train = data_raw[~data_raw[target].isnull()]
    test = data_raw[data_raw[target].isnull()]

    X = train.loc[:, features]
    y = train.loc[:, target]

    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=1307)

    model = xgb.XGBRegressor(learning_rate=0.2, max_depth=8, eval_metric="rmse", n_jobs=-1, n_estimators=100)
    model.fit(X_train, y_train)

    # save the model to disk
    filename = col
    joblib.dump(model, filename)

    #predict for the validation set
    result = model.score(X_val, y_val)
    print('R-square value for column: ' + col + ' is ' + result.astype(str))

    predicted = model.predict(X_val)
    rmse = np.sqrt(((predicted - y_val) ** 2).mean())
    print('RMSE value for column: ' + col + ' is ' + rmse.astype(str) + "\n")

    # Using a random sample of the dataframe for better time computation
    X_sampled = X_val.sample(20000, random_state=1307)
    joblib.dump(X_sampled, filename + "_X_val_sampled")

    # explain the model's predictions using SHAP values
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_sampled)
    joblib.dump(shap_values, filename + "_shap_values")

    #Get SHAP interaction values
    shap_interaction = explainer.shap_interaction_values(X_sampled)
    joblib.dump(shap_interaction, filename + "_shap_interaction")

---
<a id="3"></a>
### <div class="h2">3. Model inspection with SHAP</div>

The focus will be on applying the SHAP package and interpreting the results. `SHAP values` are used to explain individual predictions made by a model. It does this by giving the contributions of each factor to the final prediction. `SHAP interaction` values extend on this by breaking down the contributions into their main and interaction effects. We can use these to highlight and visualise interactions in data. It can also be a useful tool to understand how your model makes predictions.

Specifically, we start with a plot of the features with the most impact on the predictions. These features are the most important for the model to make predictions. What is missing in this plot is how the feature influences the predictions. This is where the second plot comes in handy. The second plot is a combination of feature importance with feature interaction with respect to the target.

The third plot is the aboslute mean of SHAP interaction effects. The diagonal values are the main effects of the features and off-diagonal values are the interaction effects. The off-diagonal values can be used to check if there are hidden interactions in the dataset.

In [None]:
X_sampled = joblib.load('../input/files-f-4-8/F_4_8_X_val_sampled')
shap_values = joblib.load('../input/files-f-4-8/F_4_8_shap_values')
shap_interaction = joblib.load('../input/files-f-4-8/F_4_8_shap_interaction')

def plot_shap(X_sampled, shap_values, shap_interaction, i: str):

    fig = plt.figure(tight_layout=True, figsize=(15,50))
    spec = gridspec.GridSpec(ncols=2, nrows=2, figure=fig)


    ax0 = fig.add_subplot(spec[0, 0])
    shap.summary_plot(shap_values, features=X_sampled, feature_names=X_sampled.columns, max_display=10, plot_type="bar", 
    plot_size=[15,10], axis_color="white", show=False, cmap='green', color_bar=False)
    ax0.set_title(f'Top Influencers\n\nFeatures with most impact on the model output', fontsize=12, y=1.03, fontweight ='bold')
    ax0.tick_params(axis='both', colors='white', labelsize=10)    
    ax0.set_xlabel('mean(|SHAP value|)', fontsize=10)

    ax1 = fig.add_subplot(spec[0, 1])
    shap.summary_plot(shap_values, X_sampled, axis_color="white", max_display=10, plot_size=[15,10], show=False)
    ax1.set_title(f'Directionality\n\nShows how values of a feature influences the models output', fontsize=12, y=1.03, fontweight ='bold')
    ax1.tick_params(axis='both', colors='white', labelsize=10)  
    ax1.set_xlabel('SHAP value', fontsize=10)


    # Get absolute mean of matrices
    mean_shap = np.abs(shap_interaction).mean(0)
    df = pd.DataFrame(mean_shap, index=X_sampled.columns, columns=X_sampled.columns)

    # times off diagonal by 2
    df.where(df.values == np.diagonal(df),df.values*2, inplace=True)

    #fig = plt.figure(figsize=(35, 20), facecolor='#002637', edgecolor='r')
    ax2 = fig.add_subplot(spec[1, :])
    sns.heatmap(df.round(decimals=3), cmap='coolwarm', annot=True, fmt='.6g', cbar=False, ax=ax2)
    ax2.tick_params(axis='x', colors='w', labelsize=9, rotation=90)
    ax2.tick_params(axis='y', colors='w', labelsize=9, rotation=0)
    ax2.set_title("SHAP interaction values", color="white", fontsize=12, y=1.01, fontweight ='bold')

    
    plt.suptitle(i, ha="center", y=1.0, fontweight ='bold', fontsize=14)
    plt.show()  
  
plot_shap(X_sampled, shap_values, shap_interaction, "F_4_8")

💡 **INSIGHTS**

- From the first plot one can conclude that feature `F_4_11` is the most influential feature of the model and `F_4_7` is the least influential.
- The second plot suggests the higher the value of `F_4_11` the lower the impact on the predictions. The opposite is true for `F_4_4`.
- From the third plot one can observe that the main effect is large for features `F_4_11` (`0.497`) and `F_4_13` (`0.256`). This tells us that these features tend to have large positive or negative main effects. In other words, these features tend to have a significant impact on the model’s predictions. 
- Similarly, we can see that interaction effects for (`F_4_11`, `F_4_13`) --> (`0.114`) is the highest. So it is intersting to check how this interaction looks at local level. This is demonstrated in the figure below.

In [None]:
# F_4 = []
# for i in range(0, 15):
#     string = "F_4_" + str(i)
#     F_4.append(string)
    
# i=8
# model = joblib.load("F_4_"+str(i))
# X_sampled = joblib.load("F_4_"+str(i)+"_X_val_sampled")
# shap_values = joblib.load("F_4_"+str(i)+"_shap_values")
# shap_interaction = joblib.load("F_4_"+str(i)+"_shap_interaction")
# F_4.remove("F_4_"+str(i))

fig = plt.figure(tight_layout=True, figsize=(15,10))
spec = gridspec.GridSpec(ncols=2, nrows=2, figure=fig)


ax0 = fig.add_subplot(spec[0, 0])
shap.dependence_plot("F_4_11", shap_values, X_sampled, axis_color='w', show=False, ax=ax0)
ax0.set_title(f'SHAP effect', fontsize=10)


ax0 = fig.add_subplot(spec[0, 1])
f1="F_4_11"
f2="F_4_13"
shap.dependence_plot((f1, f2), shap_interaction, X_sampled, axis_color="w", show=False, ax=ax0)
ax0.yaxis.label.set_color('white')          
ax0.xaxis.label.set_color('white')          
ax0.tick_params(axis='both', colors='white')    
ax0.set_title(f'SHAP Interaction effect', fontsize=10)

ax2 = fig.add_subplot(spec[1, :])
points = ax2.scatter(data_raw.F_4_11, data_raw.F_4_13, c=data_raw.F_4_8, cmap="jet", lw=0)
plt.colorbar(points)
ax2.tick_params(axis='both', colors='white')    #setting up X-axis tick color to white
ax2.set_xlabel("F_4_11", fontsize=12, fontweight="bold", color="white")
ax2.set_ylabel("F_4_13", fontsize=12, fontweight="bold", color="white")
ax2.set_title(f'Scatterplot', fontsize=10)
ax2.text(-5, 0, "1", fontsize=18, fontweight="bold", verticalalignment='top', rotation="horizontal", color="k")
ax2.text(15, 0, "2", fontsize=18, fontweight="bold", verticalalignment='top', rotation="horizontal", color="k")
ax2.text(41, 1, "Target", fontsize=18, verticalalignment='top', rotation="horizontal", color="w")

plt.show()

💡 **INSIGHTS**
- From the first plot one can conclude that as the value of `F_4_11` increases it tends to have a negative impact on the predictions.
- Plot of the SHAP interaction value between `F_4_11` and `F_4_13` shows that high `F_4_11` values has a negative impact on the predictions confered by high values of `F_4_13`.
- The scatter plot implies that `F_4_11` and `F_4_13` have a weak negative correlation and that there is also interaction between these two features with respect to the target. That is region `1` has higher target values compared to region `2`. 

---
<a id="4"></a>
### <div class="h2">4. Model inspection with PDP</div>


Coming soon

---
<a id="5"></a>
### <div class="h2">5. References</div>


Domain Knowledge References

1.https://github.com/slundberg/shap/blob/master/notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20LightGBM.ipynb
2.https://towardsdatascience.com/analysing-interactions-with-shap-8c4a2bc11c2a#:~:text=SHAP%20values%20are%20used%20to,their%20main%20and%20interaction%20effects
