# Interpret LightGBM Classifier with SHAP: Spaceship Titanic Dataset

## Introduction

This notebook demonstrates how to utilize SHAP (SHapley Additive exPlanations) to interpret complex gradient-boosted models, specifically LightGBM. The purpose is to make the results of these models more interpretable to both yourself and stakeholders. SHAP provides a unified framework to explain the output of any machine learning model, helping you understand the contribution of each feature to the predictions.

## Table of Contents

1. [Data Preparation](#data-preparation)
   - [Import Data and Modules](#import-data-and-modules)
   - [Feature Engineering](#feature-engineering)
   - [Data Cleaning](#data-cleaning)
2. [Modeling](#modeling)
   - [Data Preprocessing for Modeling](#data-preprocessing-for-modeling)
   - [Model Training](#model-training)
   - [Model Evaluation](#model-evaluation)
   - [Submit Predictions](#submit-predictions)
3. [Interpretability](#interpretability)
   - [SHAP Analysis](#shap-analysis)

# 1) Data Preparation -------------------------
<a id="data-preparation"></a>

## Data Description

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

> * **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.<br>
> * **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.<br>
> * **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.<br>
> * **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.<br>
> * **Destination** - The planet the passenger will be debarking to.<br>
> * **Age** - The age of the passenger.<br>
> * **VIP** - Whether the passenger has paid for special VIP service during the voyage.<br>
> * **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.<br>
> * **Name** - The first and last names of the passenger.<br>
> * **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.<br>

## Import Data and Modules
<a id="import-data-and-modules"></a>

In [1]:
# base packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# modeling and evaluation
import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
import shap
import os

In [2]:
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train['set'] = "train"
test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
test['set'] = "test"
df = pd.concat([train, test], ignore_index=False)

## Feature Engineering
<a id="feature-engineering"></a>

### 1. Create Group Size Feature Using Passenger ID Groupings

In [3]:
df['PassengerGroup'] = df['PassengerId'].str[:4]

group_counts = df['PassengerGroup'].value_counts().sort_index()

df['GroupSize'] = df.groupby('PassengerGroup')['PassengerId'].transform('count')

In [4]:
df['GroupSize'].value_counts()

GroupSize
1    7145
2    2590
3    1506
4     616
5     380
7     329
6     252
8     152
Name: count, dtype: int64

### 2. Feature Engineer Cabin Data

#### 2.1) Break Out Cabin Into Deck / Number / Side

In [5]:
def split_cabin(cabin):
    if pd.isna(cabin):
        return pd.Series([None, None, None])
    parts = cabin.split('/')
    deck = parts[0]
    number = parts[1]
    side = 'Port' if parts[2] == 'P' else 'Starboard'
    return pd.Series([deck, number, side])

df[['cabin_deck', 'cabin_number', 'cabin_side']] = df['Cabin'].apply(split_cabin)

#### 2.2) Bin Cabin Number

In [6]:
# Convert the cabin_number column to numeric
df['cabin_number'] = pd.to_numeric(df['cabin_number'], errors='coerce')

# Convert the maximum value to an integer
max_cabin_number = int(df['cabin_number'].max())

# Bin the cabin_number column into groups of 300 starting at 0
bins = range(0, max_cabin_number + 300, 300)
labels = range(1, len(bins))  # Numeric labels starting from 1
df['cabin_region'] = pd.cut(df['cabin_number'], bins=bins, labels=labels, right=False)

#### 2.3) Deck Level

In [7]:
# Mapping deck levels to their respective categories
deck_level_mapping = {
    'A': 'level_low',
    'B': 'level_low',
    'C': 'level_low',
    'D': 'level_mid',
    'E': 'level_mid',
    'F': 'level_mid',
    'G': 'level_top',
    'T': 'level_top'
}

# Apply the mapping to create the cabin_deck_level feature
df['cabin_deck_level'] = df['cabin_deck'].map(deck_level_mapping)

In [8]:
df[['cabin_deck_level', 'cabin_side', 'cabin_region']].head()

Unnamed: 0,cabin_deck_level,cabin_side,cabin_region
0,level_low,Port,1
1,level_mid,Starboard,1
2,level_low,Starboard,1
3,level_low,Starboard,1
4,level_mid,Starboard,1


### 3. Convert Age to Adult Feature

In [9]:
df['adult'] = df['Age'].apply(lambda x: True if x > 18 else False)

In [10]:
#Plot for age_category
df['adult'].value_counts()

adult
True     9940
False    3030
Name: count, dtype: int64

## Data Cleaning
<a id="data-cleaning"></a>

### 1. Drop Columns

In [11]:
drop_cols = ['Cabin', 'cabin_number','cabin_deck', 'Age','Name', 'PassengerGroup']

df.drop(columns=drop_cols, inplace=True)

In [12]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,set,GroupSize,cabin_side,cabin_region,cabin_deck_level,adult
0,0001_01,Europa,False,TRAPPIST-1e,False,0.0,0.0,0.0,0.0,0.0,False,train,1,Port,1,level_low,True
1,0002_01,Earth,False,TRAPPIST-1e,False,109.0,9.0,25.0,549.0,44.0,True,train,1,Starboard,1,level_mid,True
2,0003_01,Europa,False,TRAPPIST-1e,True,43.0,3576.0,0.0,6715.0,49.0,False,train,2,Starboard,1,level_low,True
3,0003_02,Europa,False,TRAPPIST-1e,False,0.0,1283.0,371.0,3329.0,193.0,False,train,2,Starboard,1,level_low,True
4,0004_01,Earth,False,TRAPPIST-1e,False,303.0,70.0,151.0,565.0,2.0,True,train,1,Starboard,1,level_mid,False


### 2. Deal with Missing Values

#### 2.1) Identify Missing Values

In [13]:
missing_values = df.drop(columns=['Transported']).isnull().sum()

# Filter columns with missing values
missing_values = missing_values[missing_values > 0]

# Display the columns with missing values and their counts
print("Columns with missing values (excluding 'Transported'):")
print(missing_values)

Columns with missing values (excluding 'Transported'):
HomePlanet          288
CryoSleep           310
Destination         274
VIP                 296
RoomService         263
FoodCourt           289
ShoppingMall        306
Spa                 284
VRDeck              268
cabin_side          299
cabin_region        299
cabin_deck_level    299
dtype: int64


#### 2.2) Imput Continuous Values

In [14]:
# Continuous columns
continuous_columns = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Impute missing values with the median for continuous columns
for column in continuous_columns:
    median_value = df[column].median()
    df[column] = df[column].fillna(median_value)

#### 2.3) Impute Categorical Values

In [15]:
# Define the list of categorical columns
categorical_columns = ['HomePlanet', 'CryoSleep', 'Destination', 
                       'VIP', 'cabin_deck_level', 'cabin_side', 'cabin_region', 'adult']

# Impute missing values with the mode for categorical columns
for column in categorical_columns:
    mode_value = df[column].mode()[0]
    df[column] = df[column].fillna(mode_value).infer_objects(copy=False)

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


# 2) Modeling -------------------------
<a id="modeling"></a>

## Data Preprocessing for Modeling
<a id="data-preprocessing-for-modeling"></a>

### 1. Encode Categorical Features

In [16]:
label_encoders = {}
categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 
                       'VIP', 'cabin_deck_level', 'cabin_side', 'cabin_region', 'adult']

for col in categorical_features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

### 2. Split the data into training and test set

In [17]:
# Split the data into training and validation sets
train_df = df[df['set'] == 'train']
test_df = df[df['set'] == 'test']

X = train_df.drop(columns=['Transported', 'set', 'PassengerId'])
y = train_df['Transported'].astype(int)

In [18]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training
<a id="model-training"></a>

In [19]:
# Define the parameter grid
param_grid = {
    'num_leaves': [31, 50],
    'min_data_in_leaf': [20, 50],
    'max_depth': [5, 10],
    'learning_rate': [0.01, 0.05],
    'n_estimators': [50, 100],
    'boosting_type': ['gbdt'],
    'objective': ['binary'],
    'metric': ['binary_logloss'],
}

In [20]:
# Initialize LightGBM classifier
lgb_model = lgb.LGBMClassifier()

In [21]:
# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=lgb_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=-1)

# Fit grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters found: {best_params}")
print(f"Best cross-validation score: {best_score}")

[LightGBM] [Info] Number of positive: 2333, number of negative: 2303
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.012931 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1308
[LightGBM] [Info] Number of data points in the train set: 4636, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503236 -> initscore=0.012942
[LightGBM] [Info] Start training from score 0.012942
[LightGBM] [Info] Number of positive: 2333, number of negative: 2303
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016241 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1308
[LightGBM] [Info] Number of data points in the train set: 4636, number of used features: 14
[LightGBM] [Info] [binary:

## Model Evaluation
<a id="model-evaluation"></a>

In [22]:
# Train the final model with the best parameters
best_lgb_model = lgb.LGBMClassifier(**best_params)
best_lgb_model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 3500, number of negative: 3454
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001632 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1308
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503307 -> initscore=0.013230
[LightGBM] [Info] Start training from score 0.013230


In [23]:
# Predictions
y_pred = best_lgb_model.predict(X_val)

# Evaluation metrics
accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.7975848188614146
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.76      0.79       861
           1       0.78      0.84      0.81       878

    accuracy                           0.80      1739
   macro avg       0.80      0.80      0.80      1739
weighted avg       0.80      0.80      0.80      1739



## Submit Predictions
<a id="submit-predictions"></a>

# 3) Interpretability -------------------------
<a id="interpretability"></a>

## SHAP Analysis
<a id="shap-analysis"></a>