# Interpret LightGBM Classifier with SHAP: Spaceship Titanic Dataset

## Table of Contents

1. [Data Preparation](#data-preparation)
   - [Basic Exploratory Data Analysis (EDA)](#basic-exploratory-data-analysis) 
   - [Import Data and Modules](#import-data-and-modules)
   - [Data Cleaning](#data-cleaning)
   - [Feature Engineering](#feature-engineering)
   - [Advanced Exploratory Data Analysis (EDA)](#advanced-exploratory-data-analysis)
2. [Modeling](#modeling)
   - [Data Preprocessing for Modeling](#data-preprocessing-for-modeling)
   - [Model Training](#model-training)
   - [Model Evaluation](#model-evaluation)
3. [Interpretability](#interpretability)
   - [SHAP Analysis](#shap-analysis)

# 1) Data Preparation
<a id="data-preparation"></a>

## Import Data and Modules
<a id="import-data-and-modules"></a>

In [1]:
# base packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# modeling and evaluation
import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import shap
import os

In [2]:
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
test = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
df = train.merge(test)

## Basic Exploratory Data Analysis (EDA)
<a id="basic-exploratory-data-analysis"></a>

#### 1. Data Description

> * **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.<br>
> * **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.<br>
> * **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.<br>
> * **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.<br>
> * **Destination** - The planet the passenger will be debarking to.<br>
> * **Age** - The age of the passenger.<br>
> * **VIP** - Whether the passenger has paid for special VIP service during the voyage.<br>
> * **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.<br>
> * **Name** - The first and last names of the passenger.<br>
> * **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.<br>

#### 2. View Merged Dataframe

In [3]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


#### 3. View Data Structure

In [4]:
df.shape

(8693, 14)

#### 4. Check for Missing Values

## Data Cleaning
<a id="data-cleaning"></a>

In [5]:
df.shape

(8693, 14)

## Feature Engineering
<a id="feature-engineering"></a>

## 1e. Advanced Exploratory Data Analysis (EDA)
<a id="advanced-exploratory-data-analysis"></a>

# 2) Modeling
<a id="modeling"></a>

### Data Preprocessing for Modeling
<a id="data-preprocessing-for-modeling"></a>

### Model Training
<a id="model-training"></a>

### Model Evaluation
<a id="model-evaluation"></a>

# 3) Interpretability
<a id="interpretability"></a>

### SHAP Analysis
<a id="shap-analysis"></a>