<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h1 style="color:rgb(228, 12, 33); text-align:center;"> ðŸ›¸Among the Elite: Top 200 Spaceship Titanic Survival Analysis ðŸ“ˆ </h1>

<p style="text-align:center;"><img src="https://free4kwallpapers.com/uploads/originals/2021/07/23/spaceship-wallpaper.jpg" height="300" /></p>

<p style="color:rgb(68, 68, 76);">Hello Kaggle community!</p>

<p style="color:rgb(68, 68, 76);">Welcome to my notebook where I embark on a journey to predict the survival outcomes of passengers onboard the Spaceship Titanic. This competition challenges us to apply machine learning to create a model that predicts which passengers survived the spaceship disaster.</p>

<p style="color:rgb(68, 68, 76);">Our journey will begin with a comprehensive exploratory data analysis(EDA) to uncover the hidden patterns and correlations within the data. During this stage, we will also identify and appropriately handle any outliers that might skew our predictions. A crucial part of this process is dealing with missing data - we'll be employing robust strategies to fill in these gaps and ensure our dataset is complete and ready for modeling.</p>

<p style="color:rgb(68, 68, 76);">In the pursuit of creating a powerful model, we will venture into the realm of feature engineering. Each feature will be meticulously analyzed, and we'll create new ones based on our insights to enhance the predictive power of our model.</p>

<p style="color:rgb(68, 68, 76);">The crux of our solution is a binary classification problem: predicting survival as '1' or '0'. To tackle this, we will employ a potent trio of machine learning models: CatBoost, XGBoost, and LightGBM. These models have shown remarkable results in various binary classification tasks and we're optimistic about their performance in this competition.</p>

<p style="color:rgb(172, 28, 44);">This notebook is a testament to the power of meticulous data analysis, innovative feature engineering, and the combined strength of CatBoost, XGBoost, and LightGBM. It showcases the journey that led us to rank within the prestigious <b>Top 200 in the leaderboard</b>.</p>

<p style="color:rgb(228, 12, 33);">I invite you to join me on this journey of exploration, analysis, and prediction. Let's dive in!</p>

</div>


<a id="ToC"></a>
# Table of Contents
- [1. Imports](#1)
- [2. EDA](#2)
- [3. Feature Engineering](#3)
    - [Group size - PassengerId](#2.1)
    - [HomePlanet](#2.2)
    - [CryoSleep](#2.3)
    - [Cabin](#2.4)
        - [Deck-Cabin](#2.5)
        - [Num-Cabin](#2.6)
        - [Side-Cabin](#2.7)
    - [Destination](#2.8)
    - [Age](#2.9)
    - [VIP](#2.10)
    - [RoomService, FoodCourt, ShoppingMall, Spa, VRDeck](#2.11)
    - [Name](#2.12)
    - [Categorical Features](#2.13)
- [4. Model](#4)    
- [5. Evaluation](#5)
- [6. Hyper Parameter Tuning](#6)
- [7. Submission](#7)

<a id="1"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span> Imports </span></center></div>**

In [None]:
import warnings
warnings.filterwarnings("ignore")

import os 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style('white')

from tqdm import tqdm, tqdm_notebook

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder,LabelEncoder 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (accuracy_score, classification_report, roc_curve, auc,
precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay)
from sklearn.model_selection import train_test_split,cross_val_score,RepeatedStratifiedKFold,GridSearchCV

# HyperParameter 
import optuna

# model 
import lightgbm as lgb
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 100)
pd.options.display.float_format = '{:.2f}'.format

- Colors for my notebook

In [None]:
custom_colors = [
    (100/255, 108/255, 116/255),   # nevada
    (228/255, 12/255, 33/255),     # red-ribbon
    (68/255, 68/255, 76/255),      # abbey
    (172/255, 28/255, 44/255),     # roof-terracotta 
]
custom_palette = sns.color_palette(custom_colors)

In [None]:
custom_palette

In [None]:
FILE_PATH = "/kaggle/input/spaceship-titanic"
train_df = pd.read_csv(FILE_PATH+'/train.csv')
train_df['Transported'] = train_df['Transported'].astype(int)
train_df

In [None]:
test_df = pd.read_csv(FILE_PATH+'/test.csv')
test_df

<a id="2"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span>EDA</span></center></div>**

In [None]:
df = pd.concat([train_df, test_df], axis=0)
df.describe()

In [None]:
def summary(df):
    print(f"Dataset has {df.shape[1]} features and {df.shape[0]} examples.")
    summary = pd.DataFrame(index=df.columns)
    summary["Unique"] = df.nunique().values
    summary["Missing"] = df.isnull().sum().values
    summary["Duplicated"] = df.duplicated().sum()
    summary["Types"] = df.dtypes
    return summary

summary(df)

In [None]:
plt.figure(figsize=(6,6))

data = df['Transported'].value_counts().values
labels = ['True', 'False']
plt.pie(data, labels = labels, colors = custom_palette, autopct='%.0f%%')
plt.show()

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">Transported exhibits a balanced distribution of true and false classes.</p>
</div>

<a id="3"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span>Feature Engineering</span></center></div>**

<a id="2.1"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Group size - PassengerId</h2>

<p style="color:rgb(68, 68, 76);">A unique Id for each passenger. Each Id takes the form <code>gggg_pp</code> where <code>gggg</code> indicates a group the passenger is travelling with and <code>pp</code> is their number within the group. People in a group are often family members, but not always.</p>

<p style="color:rgb(172, 28, 44);">From this PassengerId, we are creating a new feature called 'Group Size'. This feature will reflect the size of the group that each passenger is traveling with, providing us with additional insights for our predictive models.</p>

</div>


In [None]:
group = df['PassengerId'].apply(lambda x: x.split('_')[0]).value_counts().to_dict()

In [None]:
df['Group_size'] = df['PassengerId'].apply(lambda x: group[x.split('_')[0]])

In [None]:
df.set_index('PassengerId', inplace=True)

<a id="2.2"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">HomePlanet</h2>

<p style="color:rgb(68, 68, 76);">The planet the passenger departed from, typically their planet of permanent residence.</p>

<p style="color:rgb(172, 28, 44);">Our dataset contains 288 missing values for 'HomePlanet'. Instead of filling these missing values with a new token like 'Unknown', which would make up a small proportion of the data and could potentially skew our model's predictions, we will opt for a different strategy. We will fill these missing values in a way that preserves the existing distribution of 'HomePlanet' values. This method ensures that our data remains representative and our model robust.</p>

</div>


In [None]:
tmp = df['HomePlanet'].value_counts()
tmp 

In [None]:
# creating probability distribution for each planet
v = tmp.index # ['Earth', 'Europa', 'Mars']

p = tmp.values 
p = p/sum(p)
p

In [None]:
df.loc[df['HomePlanet'].isna(), 'HomePlanet'] = np.random.choice(v, df['HomePlanet'].isna().sum(), p=p)

<a id="2.3"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">CryoSleep</h2>

<p style="color:rgb(68, 68, 76);">Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.</p>

<p style="color:rgb(172, 28, 44);">In our dataset, we have 310 missing values for 'CryoSleep'. To handle these missing data, we will be assigning a value of 'False'. This is based on the assumption that if a passenger's CryoSleep status is not recorded, they were likely not in cryosleep.</p>

<p style="color:rgb(172, 28, 44);">Furthermore, to make this feature more compatible with our machine learning models, we will be converting the boolean values {True, False} to integers {1, 0}. This conversion will allow our model to process this information more effectively.</p>

</div>


In [None]:
df['CryoSleep'].value_counts()

In [None]:
df['CryoSleep'].fillna(df['CryoSleep'].median(), inplace=True)
df['CryoSleep'] = df['CryoSleep'].astype(int)

<a id="2.4"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Cabin</h2>

<p style="color:rgb(68, 68, 76);">The cabin number where the passenger is staying. Takes the form <code>deck/num/side</code>, where <code>side</code> can be either <code>P</code> for Port or <code>S</code> for Starboard.</p>

<p style="color:rgb(172, 28, 44);">In our dataset, 'Cabin' has 299 missing values. Rather than just filling these values, we've decided to take an extra step to extract more information from the existing data.</p>

<p style="color:rgb(172, 28, 44);">We will split the 'Cabin' feature into three new features: 'Deck', 'Cabin Number', and 'Cabin Side'. This approach allows us to preserve and utilize as much information as possible from the 'Cabin' feature, thereby enriching our dataset and enhancing our model's predictive power.</p>

</div>


In [None]:
df[['Cabin']].sample(5)

In [None]:
tmp = df['Cabin'].apply(lambda x: x.split('/') if type(x) != float else ['-1', '-1', '-1']).to_list()
tmp = np.array(tmp)

In [None]:
df['Cabin_deck'] = tmp[:, 0]
df['Cabin_num'] = tmp[:, 1]
df['Cabin_side'] = tmp[:, 2]
df.drop(columns='Cabin', inplace=True)

<a id="2.5"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h3 style="color:rgb(228, 12, 33);">Deck-Cabin</h3>

<p style="color:rgb(68, 68, 76);">For the missing values in the 'Deck' feature, we have opted for a unique approach. Instead of filling them with a common token, we will fill the missing values randomly with the names of the top two most common decks. This method should help maintain the original distribution of the 'Deck' feature and prevent our model from being biased towards a specific deck.</p>

</div>


In [None]:
df.loc[df['Cabin_deck']=='-1', 'Cabin_deck'] = np.random.choice(['F', 'G'], sum(df['Cabin_deck']=='-1'), 
                                                              p=[0.5, 0.5])

In [None]:
df['Cabin_deck'].value_counts()

<a id="2.6"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h3 style="color:rgb(228, 12, 33);">Num-Cabin</h3>

<p style="color:rgb(68, 68, 76);">For the missing values in the 'Num-Cabin' feature, we will be using a mean imputation strategy. This involves replacing missing values with the mean value of the available data. This is a reliable method when the data is normally distributed or when the amount of missing data is relatively small.</p>

<p style="color:rgb(68, 68, 76);">It's interesting to note that we have '1895' unique values in 'Num-Cabin', which offers a rich variety of data for our model to learn from.</p>

</div>


In [None]:
df['Cabin_num'].nunique()

In [None]:
df['Cabin_num'] = df['Cabin_num'].astype(int)

In [None]:
df.loc[df['Cabin_num']=='-1', 'Cabin_num'] = int(df['Cabin_num'].mean())

In [None]:
sns.histplot(df['Cabin_num'].astype(int), bins=30, kde=False, color=custom_colors[1])

<a id="2.7"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h3 style="color:rgb(228, 12, 33);">Side - Cabin</h3>

<p style="color:rgb(68, 68, 76);">For the missing values in the 'Side' feature, our strategy is to fill them by randomly assigning the values 'S' (Starboard) and 'P' (Port). This method is chosen to avoid any bias that could arise from filling missing values with a single side.</p>

<p style="color:rgb(68, 68, 76);">Furthermore, we will convert the string values in the 'Side' feature to integers {0, 1}. This conversion is necessary to provide a numerical format which is more suitable for our machine learning models.</p>

</div>


In [None]:
df.loc[df['Cabin_side']=='-1', 'Cabin_side'] = np.random.choice(['S', 'P'], sum(df['Cabin_side']=='-1'), 
                                                              p=[0.5, 0.5])

In [None]:
df['Cabin_side'] = df['Cabin_side'].map({'S':0, 'P':1})

In [None]:
df['Cabin_side'].value_counts()

<a id="2.8"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Destination</h2>

<p style="color:rgb(68, 68, 76);">The planet the passenger will be debarking to.</p>

<p style="color:rgb(172, 28, 44);">In our dataset, 'Destination' has 274 missing values. We have decided to handle these missing values by randomly assigning one of the three planets as the destination. This approach aims to preserve the original distribution of the 'Destination' feature and avoids introducing a bias towards a specific planet.</p>

</div>


In [None]:
df['Destination'].value_counts()

In [None]:
df.loc[df['Destination'].isna(), 'Destination'] = np.random.choice(['TRAPPIST-1e', '55 Cancri e', 'PSO J318.5-22'], 
                                                                  sum(df['Destination'].isna()), 
                                                                  p=[0.5, 0.3, 0.2])

<a id="2.9"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Age</h2>

<p style="color:rgb(68, 68, 76);">The age of the passenger.</p>

<p style="color:rgb(172, 28, 44);">In our dataset, 'Age' has 270 missing values. To handle these missing values, we have decided to fill them with a uniform distribution within one standard deviation of the mean age. This strategy helps maintain the overall distribution of the 'Age' feature and prevents the model from being biased towards a specific age range.</p>

</div>


In [None]:
sns.histplot(df['Age'], bins=30, kde=False, color=custom_colors[1])

In [None]:
mean_age = df["Age"].mean()
std_age = df["Age"].std()
is_null = df["Age"].isnull().sum()
rand_sample = np.random.uniform(mean_age - std_age, mean_age + std_age, size = is_null)
df.loc[df['Age'].isna(), 'Age'] = rand_sample

<a id="2.10"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">VIP</h2>

<p style="color:rgb(68, 68, 76);">Whether the passenger has paid for special VIP service during the voyage.</p>

<p style="color:rgb(172, 28, 44);">In our dataset, 'VIP' has 296 missing values. To handle these missing values, we have decided to fill them with 'False'. This is based on the assumption that if a passenger's VIP status is not recorded, they were likely not a VIP.</p>

<p style="color:rgb(172, 28, 44);">Furthermore, to make this feature more compatible with our machine learning models, we will be converting the boolean values {True, False} to integers {1, 0}. This conversion will allow our model to process this information more effectively.</p>

</div>


In [None]:
df['VIP'].value_counts()

In [None]:
df['VIP'].fillna(False, inplace=True)
df['VIP'] = df['VIP'].astype(int)

<a id="2.11"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">RoomService, FoodCourt, ShoppingMall, Spa, VRDeck</h2>

<p style="color:rgb(68, 68, 76);">These represent the amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.</p>

<p style="color:rgb(172, 28, 44);">We have decided to create a new featureâ€”'total_spending'â€”which combines all the spending across these amenities. This gives us a holistic view of a passenger's total expenditure on board.</p>

<p style="color:rgb(172, 28, 44);">As the data in these features is highly skewed, we will fill the missing values with the median values for each column. This strategy is more robust to outliers and skewed data than filling with the mean.</p>

<p style="color:rgb(172, 28, 44);">Speaking of outliers, we noticed that these features contain several extreme values. To deal with them, we will transform these features to a log scale. However, since the logarithm of zero is undefined, we will replace zero values with 0.356. This way, we preserve the essence of the original distribution while reducing the impact of outliers.</p>

</div>


In [None]:
cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
for col in cols:
    df[col].fillna(df[col].median(), inplace=True)

In [None]:
df['total_spending'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] +\
df['Spa'] + df['VRDeck']

In [None]:
cols.append('total_spending')

In [None]:
fig, axes = plt.subplots(len(cols),2, figsize=(12,14))
for i, col in enumerate(cols):
    sns.histplot(data=df, x=col, ax=axes[i, 0], bins=20, color=custom_colors[0])
    sns.histplot(data=np.log(df[[col]]), x=col, ax=axes[i, 1], color=custom_colors[1])
    axes[i, 0].set_title('Normal Distribution')
    axes[i, 1].set_title('Logarithmic Distribution')
plt.tight_layout()

In [None]:
for col in cols:
    df.loc[df[col]==0, col] = 0.367
    df[col] = np.log(df[col])

<a id="2.12"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Name</h2>

<p style="color:rgb(68, 68, 76);">The first and last names of the passenger.</p>

<p style="color:rgb(172, 28, 44);">To make the 'Name' feature more suitable for our machine learning models, we will convert the names, which are in string format, into integers. This conversion will be done using a label encoder, which assigns a unique integer to each unique string. This transformation maintains the uniqueness of each name while presenting the data in a format that our models can effectively process.</p>

</div>


In [None]:
df['Name'].fillna('Unk Unk', inplace=True)

In [None]:
tmp = np.array(df['Name'].apply(lambda x: x if type(x)==float else x.split(' ')).to_list())

df['Name_first'] = tmp[:, 0]
df['Name_last'] = tmp[:, 1]

In [None]:
label_encoder = LabelEncoder()
df["Name_first"] = label_encoder.fit_transform(df.loc[:, "Name_first"])

label_encoder = LabelEncoder()
df["Name_last"] = label_encoder.fit_transform(df.loc[:, "Name_last"])

In [None]:
df.drop(columns="Name", inplace=True)

In [None]:
summary(df)

<a id="2.13"></a>
<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">

<h2 style="color:rgb(228, 12, 33);">Categorical Features</h2>

<p style="color:rgb(68, 68, 76);">For the categorical features in our dataset, we will be using one-hot encoding. One-hot encoding is a technique that converts categorical variables into a binary representation, where each category becomes a separate binary feature column. This transformation allows our models to interpret and utilize categorical data effectively.</p>

</div>


In [None]:
categorical_features = ['HomePlanet', 'Destination', 'Cabin_deck']
df = pd.concat([df, pd.get_dummies(df[categorical_features])], axis=1)
df.drop(columns=categorical_features, inplace=True)

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">Splitting the training and test data</p>

In [None]:
test_df  = df[train_df.shape[0]:]
train_df = df[:train_df.shape[0]]

In [None]:
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">Splitting the training data into Train and Validation set</p>

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_df.drop('Transported', axis=1), train_df[['Transported']], 
                                                    test_size=0.15, random_state=42)

In [None]:
X_train

<a id="4"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span>Model</span></center></div>**

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">Evaluation with CatBoost, XGBoost, and LightGBM models:</p>
<p style="color:rgb(68, 68, 76);">We perform cross-validation (CV) to obtain an average CV score for each model. This provides us with an estimate of their performance on unseen data.</p>
<p style="color:rgb(68, 68, 76);">Additionally, we evaluate each model's performance on a separate validation set that has not been seen during training. This allows us to assess how well each model generalizes to new, unseen data.</p>
<p style="color:rgb(68, 68, 76);">Finally, based on the CV scores and the performance on the validation set, we carefully analyze and compare the results of CatBoost, XGBoost, and LightGBM to select the best-performing model for our task.</p>
 </div>


In [None]:
clf_xgb = XGBClassifier('binary:logistic',
    colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=10, 
                             min_child_weight=1.7817, n_estimators=1500,
                             reg_alpha=4.5, reg_lambda=8.5,
                             subsample=0.5213,
                             random_state=42)
clf_lgb = lgb.LGBMClassifier()
clf_cat = CatBoostClassifier(verbose=False)

x = X_train
y = y_train

cv_xgb = cross_val_score(clf_xgb, x, y, cv=5, scoring='accuracy').mean()
cv_lgbm = cross_val_score(clf_lgb, x, y, cv=5, scoring='accuracy').mean()
cv_cat = cross_val_score(clf_cat, x, y, cv=5, scoring='accuracy').mean()


clf_xgb.fit(x, y,
        eval_set=[(X_valid, y_valid)],
        early_stopping_rounds=25,
        verbose=False)

clf_lgb.fit(x, y,
        eval_set=[(X_valid, y_valid)],
        early_stopping_rounds=25,
        verbose=False)

clf_cat.fit(x, y,
        eval_set=[(X_valid, y_valid)],
        early_stopping_rounds=25,
        verbose=False)

y_pred_xgb = clf_xgb.predict(X_valid)
y_pred_lgbm = clf_lgb.predict(X_valid)
y_pred_cat = clf_cat.predict(X_valid)

<a id="5"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span> Evaluation </span></center></div>**

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">The CV score for the ranking model is intriguing as it does not align with the validation data score. This suggests that the validation dataset does not accurately represent the distribution of the training dataset. Therefore, we will examine the CV score for the ranking model.</p>
</div>


In [None]:
print("XGBoost CV Accuracy: ", cv_xgb)
print("LightGBM CV Accuracy: ", cv_lgbm)
print("CatBoost CV Accuracy: ", cv_cat)

In [None]:
print('validation xgb score:', accuracy_score(y_pred_xgb, y_valid.values))
print('validation lgbm score:', accuracy_score(y_pred_lgbm, y_valid.values))
print('validation CatBoost score:', accuracy_score(y_pred_cat, y_valid.values))

In [None]:
cm = confusion_matrix(y_valid, y_pred_xgb)

fig, axs = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt='g', ax=axs, linewidths=0.5, cmap=custom_palette, cbar=False)
axs.set_xlabel('Predicted labels')
axs.set_ylabel('True labels'); 
axs.xaxis.set_ticklabels(['Not Transported', 'Transported'])
axs.yaxis.set_ticklabels(['Not Transported', 'Transported']);
plt.tight_layout()

<a id="6"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span> Hyper Parameter Tuning </span></center></div>**

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">Next, we will conduct hyperparameter tuning for the CatBoost model using Optuna. This powerful technique allows us to identify the optimal set of hyperparameters, ensuring we obtain the best model performance. Once we have determined the best hyperparameter configuration, we will proceed to train the model on the entire dataset, leveraging the power of this optimized model. By doing so, we aim to achieve superior performance and maximize the model's predictive capabilities.</p>
</div>


In [None]:
X = train_df.drop('Transported', axis=1).copy()
y = train_df[['Transported']].copy()

X.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

X = X.values
y = y.values

In [None]:
Seed = 7
kfold = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=Seed)
def objective(trial):
    params = {
        'iterations': 1000,
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        'depth': trial.suggest_int('depth', 4, 16),
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-8, 10.0),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'thread_count': -1,
        'verbose': False
    }
    scores = []
    for train_idx, valid_idx in kfold.split(X, y):
        X_train, y_train = X[train_idx], y[train_idx]
        X_valid, y_valid = X[valid_idx], y[valid_idx]
        model = CatBoostClassifier(**params)
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=20, verbose=False)
        y_prob = model.predict(X_valid)
        score = accuracy_score(y_valid,y_prob)
        scores.append(score)
    
    # Compute the mean validation score across all cross-validation folds
    mean_score = np.mean(scores)
    return mean_score

In [None]:
# # Run the optimization using Optuna
# study = optuna.create_study(direction='maximize')
# study.optimize(objective, n_trials=20)

# # Print the best hyperparameters and validation AUC score
# print('Best trial:')
# trial = study.best_trial
# print('  Score: {}'.format(trial.value))
# print('  Params: ')
# for key, value in trial.params.items():
#     print('    {}: {}'.format(key, value))

# # Train a final model using the best hyperparameters found by Optuna
# best_params = study.best_params
# final_model = CatBoostClassifier(**best_params)
# final_model.fit(X, y)

# directly putting the best hyper-parameters to save time 
best_params = {'learning_rate': 0.018049356549743555,
 'depth': 6,
 'l2_leaf_reg': 7.838880563296214,
 'border_count': 182,
 'verbose' : False}

final_model = CatBoostClassifier(**best_params)
final_model.fit(X, y)

<a id="7"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span> Submission </span></center></div>**

In [None]:
pred = final_model.predict(test_df.drop(columns='Transported'))

In [None]:
sub = pd.read_csv(FILE_PATH+'/sample_submission.csv')

In [None]:
sub['Transported'] = pred
sub['Transported'] = sub['Transported'].astype(bool)

In [None]:
sum(pred)

In [None]:
sub

In [None]:
sub.to_csv('./submission_hp_cat.csv', index=False)

In [None]:
pred_cat = clf_cat.predict(test_df.drop(columns='Transported'))
sub['Transported'] = pred_cat
sub['Transported'] = sub['Transported'].astype(bool)
sub.to_csv('./submission.csv', index=False)

<a id="6"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:#e40c21;font-size:120%;font-family:Verdana;"><center><span> Summary </span></center></div>**

<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<p style="color:rgb(68, 68, 76);">In this Kaggle notebook, we embarked on a machine learning task with the aim of achieving optimal performance. Our approach involved experimenting with three different models: XGBoost, LightGBM, and CatBoost.</p>

<p style="color:rgb(68, 68, 76);">After evaluating the models using cross-validation, we observed that CatBoost had the highest CV score, indicating promising potential. Consequently, we proceeded to enhance its performance through hyperparameter tuning using Optuna. By leveraging Optuna's optimization capabilities, we aimed to identify the best combination of hyperparameters for the CatBoost model.</p>

<p style="color:rgb(172, 28, 44);">However, despite our efforts in hyperparameter tuning, we encountered an unexpected outcome. Surprisingly, the performance of the CatBoost model trained with default parameters on the entire dataset outperformed the model after hyperparameter tuning. This outcome prompts us to further investigate and understand the reasons behind this phenomenon.</p>
</div>


<div style="background-color:rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 15px;">
<h2 style="color:rgb(228, 12, 33);">Things to Do Next</h2>

<h3>1. Understanding the Impact of Hyperparameter Tuning in CatBoost</h3>

<p style="color:rgb(68, 68, 76);">
    Despite our efforts in hyperparameter tuning using Optuna, we unexpectedly observed that the performance of the CatBoost model trained with default parameters outperformed the model after tuning. To gain further insights, we will conduct a comprehensive analysis to understand the reasons behind this outcome. Our focus will be on examining the specific hyperparameters that were tuned and investigating their impact on the model's performance. By doing so, we aim to uncover any potential issues or challenges that might have hindered the effectiveness of hyperparameter tuning in improving the CatBoost model's results.
</p>

<h3>2. Error Analysis and Pattern Identification</h3>

<p style="color:rgb(68, 68, 76);">
    To gain a deeper understanding of the CatBoost model's misclassifications, we will perform an error analysis. Specifically, we will focus on instances where the model made incorrect predictions and explore if any patterns or trends emerge from these misclassified cases. By closely examining these instances, we hope to uncover valuable insights into the underlying factors contributing to the model's errors. This analysis will provide us with important information to guide us in refining the model's performance and addressing any potential weaknesses or areas for improvement.
</p>

<p style="color:rgb(68, 68, 76);">
    By addressing these two tasks, we aim to further enhance the performance and effectiveness of our CatBoost model. Through a better understanding of the hyperparameter tuning process and an in-depth error analysis, we will be well-equipped to refine our approach, optimize model performance, and achieve better results in our machine learning task.
</p>
</div>


<div style="background-color: #f9f9f9; padding: 20px; border: 1px solid #ddd; border-radius: 5px;">
  <h2 style="font-size: 1.5em; color: #333;">Enjoyed This Notebook?</h2>
  <p style="font-size: 1.2em; color: #666;">
    If this notebook helped you, please show your support with an <strong style="color: red;">upvote</strong>. Thank you!
  </p>
</div>
