# Understanding feature importance with Shap

The goal of this notebook is to clean up the data a little bit, i.e get it in a usable format, make some predictions using a pretty vanila CatBoost algorithm and explore the model with Shap.

In [None]:
# Libraries
import os.path

import numpy as np
import pandas as pd

from datetime import timedelta 
from datetime import datetime

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, MinMaxScaler
from category_encoders import TargetEncoder

import shap

from catboost import CatBoostClassifier, Pool

from hyperopt import fmin, hp, tpe

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [None]:
# import data
train = pd.read_csv("../input/cat-in-the-dat-ii/train.csv", index_col='id')
test = pd.read_csv("../input/cat-in-the-dat-ii/test.csv", index_col='id')
sample = pd.read_csv("../input/cat-in-the-dat-ii/sample_submission.csv")

# 1. Feature Preperation

In [None]:
## I'm not going to spend much time worrying about the preparing the data at this time
## I'll reuse some code from other excellent notebooks to save time

In [None]:
# Source: https://www.kaggle.com/vikassingh1996/don-t-underestimate-the-power-of-a-logistic-reg

'''Variable Description'''
def description(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['PercMissing'] = df.isnull().sum().values / df.isnull().count().values
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.iloc[0].values
    summary['Second Value'] = df.iloc[1].values
    summary['Third Value'] = df.iloc[2].values
    return summary
print('**Variable Description of  train Data:**')
description(train)

### Some intial Comments

There's lots of interesting features here...

1. id, target
    * Row identifier and target variable. Not much to be said here. 
    * Might be interesting to know what the target variable distribution is

2. binary variables
    * All of them are missing roughly 3% of their data
    * Some are integers and some are text
    * As CatBoost can deal with categorical text variables, I just need to figure out how best to deal with the missing data
        * Replace null's with the mode maybe? Or make them thier own seperate feature? 

3. nominal variables
    * 9 features
        * nom_0 to nom_4 have very low cardinality - One Hot Encoding might be the best approach to dealing with these variables
        * nom_5 to nom_9 have very high cardinality (loads of categories) - I'm not sure what the best approach with dealing with them is yet.

4. ordinal features
    * 5 features
        * 4 of them look ok. One has pretty high cardinality so will have to be processed a little differently.
        * Need to check if CatBoost deals with ordinal data in any specific way or if I should process these running the model.

5. time features
    * day and month

# 2. Dealing with Nulls

In [None]:
# Source: https://www.kaggle.com/vikassingh1996/don-t-underestimate-the-power-of-a-logistic-reg

## To start, let's just replace all null values with the mode of that column
def replace_nan(data):
    for column in data.columns:
        if data[column].isna().sum() > 0:
            data[column] = data[column].fillna(data[column].mode()[0])


replace_nan(train)
replace_nan(test)

In [None]:
target = train.pop('target')
target.shape

# 3. Feature Encoding

While the CatBoost algorithm doesn't necessarily need and encoding (it has built in tools itself to deal with categories), to get the best out of the Shap package we need to make sure each feature is numerical.

So, on that note:

In [None]:
## Source: https://www.kaggle.com/carlodnt/catboost-shap-fastai

# bin_3
train['bin_3'] = train['bin_3'].apply(lambda x: 0 if x == 'F' else 1)
test['bin_3'] = test['bin_3'].apply(lambda x: 0 if x == 'F' else 1)

# bin_4
train['bin_4'] = train['bin_4'].apply(lambda x: 0 if x == 'N' else 1)
test['bin_4'] = test['bin_4'].apply(lambda x: 0 if x == 'N' else 1)

# ord_1
train.ord_1.replace(to_replace = ['Novice', 'Contributor','Expert', 'Master', 'Grandmaster'],
                         value = [0, 1, 2, 3, 4], inplace = True)
test.ord_1.replace(to_replace = ['Novice', 'Contributor','Expert', 'Master', 'Grandmaster'],
                         value = [0, 1, 2, 3, 4], inplace = True)

# ord_2
train.ord_2.replace(to_replace = ['Freezing', 'Cold', 'Warm', 'Hot','Boiling Hot', 'Lava Hot'],
                         value = [0, 1, 2, 3, 4, 5], inplace = True)
test.ord_2.replace(to_replace = ['Freezing', 'Cold', 'Warm', 'Hot','Boiling Hot', 'Lava Hot'],
                         value = [0, 1, 2, 3, 4, 5], inplace = True)

# ord_3
train.ord_3.replace(to_replace = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'],
                         value = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], inplace = True)
test.ord_3.replace(to_replace = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'],
                         value = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], inplace = True)

# ord_4
train.ord_4.replace(to_replace = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I','J', 'K', 'L', 'M', 'N', 'O', 
                                     'P', 'Q', 'R','S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
                         value = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
                                  22, 23, 24, 25], inplace = True)
test.ord_4.replace(to_replace = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I','J', 'K', 'L', 'M', 'N', 'O', 
                                     'P', 'Q', 'R','S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
                         value = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
                                  22, 23, 24, 25], inplace = True)

high_card = ['nom_0','nom_1','nom_2','nom_3','nom_4','nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9','ord_5']
for col in high_card:
    enc_nom = (train.groupby(col).size()) / len(train)
    train[f'{col}'] = train[col].apply(lambda x: hash(str(x)) % 5000)
    test[f'{col}'] = test[col].apply(lambda x: hash(str(x)) % 5000)

In [None]:
train.shape, test.shape

# 4. CatBoost Algorithm

In [None]:
# create a training and validation set
X_train, X_validation, y_train, y_validation = train_test_split(train, target, train_size=0.8, random_state=42)

X_test = test.copy()

In [None]:
X_train.shape, X_validation.shape, y_train.shape, y_validation.shape, X_test.shape

In [None]:
categorical_features_indices = np.where(train.dtypes != np.float)[0]
categorical_features_indices

In [None]:
## Source: https://www.kaggle.com/lucamassaron/catboost-in-action-with-dnn

# Initializing a CatBoostClassifier with best parameters
best_params = {'bagging_temperature': 0.8,
               'depth': 5,
               'iterations': 500,
               'l2_leaf_reg': 30,
               'learning_rate': 0.05,
               'random_strength': 0.8}

In [None]:
 model = CatBoostClassifier(
        **best_params,
        loss_function='Logloss',
        eval_metric='AUC',         
#         task_type="GPU",
        nan_mode='Min',
        verbose=False
    )

In [None]:
model.fit(
        X_train, y_train,
        verbose_eval=100, 
        early_stopping_rounds=50,
        cat_features=categorical_features_indices,
        eval_set=(X_validation, y_validation),
        use_best_model=False,
        plot=True
);

# 5. Model explanation with Shap

In [None]:
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(model)

In [None]:
shap_values = explainer.shap_values(Pool(X_train, y_train, cat_features=categorical_features_indices))

First, let's take some rows at random and look at the shapley contribution each feature made to each prediction.

In [None]:
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

**Row 0** output is lower than the model baseline. This means the row would be predicted as being `0`. What's contributing to this? Well `nom_7` and `nom_8` appear to have the largest negative impact on the output, while `ord_3` has the largest positive impact. Outside of `ord_3` no other features are having any meaningful positive impact on the predcition. 

To get a deeper insight into what's actually going on here, decoding the features back into their original states might shed some more light.

In [None]:
shap.force_plot(explainer.expected_value, shap_values[4,:], X_train.iloc[4,:])

**Row 4** output is higher than the baseline. So, what's contributing to this? `ord_3` once again is having a significant positive impact on the output. Interestingly, it is very close in value to the same feature in `row 0`. Unlike `row 0`, `nom_7` and `nom_8` have a far lower impact on the model predictions. Again decoding the features back into their original states might shed some more light into whats going on here.

This would be a long, long process to run through this exercise with many rows. So to speed up the insights, lets stack a number of them horizontally and see what we get.

In [None]:
# visualize the training set predictions
shap.force_plot(explainer.expected_value, shap_values[0:50,:], X_train.iloc[0:50,:])

This is a really nice plot! It gives a great visual representation of how different values affect predictions. It also highlights certain rows to look into deeper using the single row view above.

### Feature Importance

In [None]:
# feature importance plot
shap.summary_plot(shap_values, X_train, plot_type="bar")

Ok, so the most importance features in this model are `ord_3`, `ord_5` and `ord_2` while the least important features are `bin_1` and `bin_4`. This is useful information, but let's use Shap to go deeper.

In [None]:
# summarize the effects of all the features
shap.summary_plot(shap_values, X_train)

### So, whats going on here? 
First, a quick recap as to what's going on in this plot.

The plot shows a cumulation of many dots. The dots have three characteristics: 
* The vertical location of the dots show what feature they are depicting
* Color shows whether that feature value was high or low for that row of the dataset
* Horizontal location shows whether the effect of that value caused a higher or lower prediction.


### Let's look at some features in particular.

#### ord_3

**Context:** This feature has 14 categories, letters `a` to `o`. For the purpose of the model, they were ordinally encoded with `a` becoming 1 up to `o` becoming 14.

This feature has the largest impact on the model and interestingly the higher the value, the more positive the contribution the feature gave to the model prediction. So any rows with letters, `m`, `n`, or `o` are more likely to associate with target variable 1 while `a`, `b`, or `c` are more likely to be associated with target variable 0.

#### bin_1

**Context:** This is a binary feature of `0's` and `1's`.

Unlike `ord_3` this feature has very little impact on the models predictions. When this feature is 1, it has slightly negative impact on predictions, i.e. they are more likely to be 0. And conversely, when this feature is 0, is has a slighly positive impact on predictions.

# Conclusion

I found this to be a very useful exercise in understanding the data better and I have a number of new avenues lined up to explore as a result.

Any and all comments are welcome!