# <b>1 <span style='color:lightseagreen'>|</span> Introduction</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | Goal</b></p>
</div>

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!

In [None]:
from IPython.display import clear_output
import os
import warnings
from pathlib import Path

# Basic libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling as pp
import seaborn as sns

# Clustering
from sklearn.cluster import KMeans

# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

#Mutual Information
from sklearn.feature_selection import mutual_info_regression

# Cross Validation
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold, learning_curve

# Encoding
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from category_encoders import MEstimateEncoder
from category_encoders import MEstimateEncoder

# Algorithms
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Optuna - Bayesian Optimization 
import optuna
from optuna.samplers import TPESampler

# Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go

warnings.filterwarnings('ignore')

def load_data():
    data_dir = Path("../input/spaceship-titanic")
    df_train = pd.read_csv(data_dir / "train.csv")
    df_test = pd.read_csv(data_dir / "test.csv")
    # Merge the splits so we can process them together
    df = pd.concat([df_train, df_test])
    return df

def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    
    #Define size of bar plot
    plt.figure(figsize=(20,10))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

df_data = load_data()
pp.ProfileReport(df_data)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.7 | Reducing Memory Usage</b></p>
</div>

In order to not having **<span style='color:lightseagreen'>issues with memory</span>** in the kernel, we are going to reduce its memory usage with the following function. Below, we can appreciate that reduction was successful as we manage to make a **<span style='color:lightseagreen'>reduction of 20%</span>**. 

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

df_data = reduce_mem_usage(df_data)

# <b>2 <span style='color:lightseagreen'>|</span> Missing Values</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.1 | Categorical Features</b></p>
</div>

From the starting profiling report we observe that there are plenty of features having missing values. We are going to focus on filling them along this section. We'll start with those belonging to object category. Let's take a quick look at those features. 

In [None]:
df_data.select_dtypes(['object']).head()

### 2.1.1 | HomePlanet

> **<span style='color:gray'>HomePlanet description: the planet the passenger departed from, typically their planet of permanent residence.</span>**

Let's focus first on HomePlanet. As it is shown in the report this feature is categorical, with three different values. Those are the following: Mars, Earth and Europa. We are going to calculate mode and we are going to fill missing values with it. 

In [None]:
def filling_HomePlanet(df):
    mode = df['HomePlanet'].value_counts().index[0]
    df['HomePlanet'] = df['HomePlanet'].fillna(mode)
    return df

### 2.1.2 | CryoSleep

> **<span style='color:gray'>CryoSleep description: indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.</span>**

Let's focus now on CryoSleep. As it is shown in the report this feature is boolean. Due to the fact that, if passenger had elected to put himself into suspended animation rarely it would have a missing value, we are going to consider the option of replacing missing values with False in this case.  

In [None]:
def filling_CryoSleep(df):
    df['CryoSleep'] = df['CryoSleep'].fillna(False)
    return df

### 2.1.3 | Cabin

> **<span style='color:gray'>Cabin description: the cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.</span>**

Let's focus now on Cabin. As it is shown in the report this feature is categorical. As it is almost impossible to estimate cabin number for a passenger with given format, we are going to split cabin number into three different features. Those are going to be describing: desk, number and side. Thus, we'll start Feature Engineering here (continued in detail subsequently). Next, we are going to replace missing values for deck type feature with F (most repeated value). Hereafter, we are going to fill side feature with most repeated value into decks of type F. Finally, we are going to fill cabin number with half of the maximum cabin number (as cabins belonging to one deck type could have more survival rate whether they are one of the first/last cabin).

In [None]:
def split_Cabin(df):
    df['Deck'] = df['Cabin'].str.split("/", n=2, expand=True)[0]
    df['Number'] = df['Cabin'].str.split("/", n=2, expand=True)[1]
    df['Side'] = df['Cabin'].str.split("/", n=2, expand=True)[2]
    df.pop('Cabin')
    return df

def filling_Cabin(df):
    df['Deck'] = df['Deck'].fillna('F')
    mode = df[df.Deck == 'F']['Side'].value_counts().index[0]
    df['Side'] = mode
    df['Number'] = df['Number'].astype(float)
    df['Number'] = df['Number'].fillna(1796 / 2)
    return df

### 2.1.4 | Destination

> **<span style='color:gray'>Destination description: the planet the passenger will be debarking to.</span>**

Let's focus now on Destination. As it is shown in the report this feature is categorical. We are going to fill missing values with most repeated value. 

In [None]:
def filling_Destination(df):
    mode = df['Destination'].value_counts().index[0]
    df['Destination'] = df['Destination'].fillna(mode)
    return df

### 2.1.5 | VIP

> **<span style='color:gray'>VIP description: whether the passenger has paid for special VIP service during the voyage.</span>**

Let's focus now on VIP. It would seem strange that a customer who has paid for a VIP service deal has not been taken into account in the data collection. This is why I'm going to replace missing values with False. 

In [None]:
def filling_VIP(df):
    df['VIP'] = df['VIP'].fillna(False)
    return df

### 2.1.6 | Name

> **<span style='color:gray'>Name description: the first and last names of the passenger.</span>**

Lastly, let's focus on Name Feature. We are going to replace it with None, as it is difficult to guess first and last name of a person as you could guess. 

In [None]:
def filling_Name(df):
    df['Name'] = df['Name'].fillna('None')
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.2 | Numerical Features</b></p>
</div>

From the starting profiling report we observe that there are quantitative features having missing values. We are going to focus on filling them along this section.

### 2.2.1 | Age
> **<span style='color:gray'>Age description: the age of the passenger.</span>**

We are going to start with Age Feature. We are going to replace missing values with median age. 

In [None]:
def filling_Age(df):
    median = df['Age'].describe()[5]
    df['Age'] = df['Age'].fillna(median)
    return df

### 2.2.2 | Luxury Features

> **<span style='color:gray'>Luxury Features description: amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.</span>**

Now, we are focusing into those VIP features that are related to the amount of money a passenger has paid. As it would be quite unusual to not have recorded payment data from a VIP passenger, we are going to consider that missing values refer to passengers who have not spent anything on those luxuries.

In [None]:
def filling_luxury_features(df):
    luxury_features = ['RoomService','FoodCourt', 'ShoppingMall', 'Spa','VRDeck']
    df[luxury_features] = df[luxury_features].fillna(0.0)
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.3 | Missing Values Filling Function</b></p>
</div>

In [None]:
def filling_numerical(df):
    df = filling_Age(df)
    df = filling_luxury_features(df)
    return df

def filling_categorical(df):
    df = filling_HomePlanet(df)
    df = filling_CryoSleep(df)
    df = split_Cabin(df)
    df = filling_Cabin(df)
    df = filling_Destination(df)
    df = filling_VIP(df)
    df = filling_Name(df)
    return df

def filling_missing(df):
    df = filling_categorical(df)
    df = filling_numerical(df)
    return df

df_data = filling_missing(df_data)

# <b>3 <span style='color:lightseagreen'>|</span> Exploratory Data Analysis</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | General Analysis</b></p>
</div>

In [None]:
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'bar'},{'type':'histogram'}, {'type':'pie'}], [{'type':'pie'}, {'colspan':2},None]], 
                   subplot_titles=('CryoSleep','Age Distribution','HomePlanet','Destination','Deck'),
                   column_widths=[0.33, 0.33, 0.34], vertical_spacing=0.15, horizontal_spacing=0.05)

# Left Upper Chart
cryosleep = pd.DataFrame(df_data['CryoSleep'].value_counts()).reset_index()
fig.add_trace(go.Bar(y=cryosleep['CryoSleep'], x=cryosleep['index'], marker = dict(color=px.colors.sequential.Sunsetdark[3]), 
                     name = 'Day of Week'),row=1, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

# Middle Upper Chart
fig.add_trace(go.Histogram(x=df_data.Age, name='Age Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[0])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)

# Right Upper Chart
homeplanet = pd.DataFrame(df_data['HomePlanet'].value_counts())
fig.add_trace(go.Pie(values=homeplanet['HomePlanet'], labels=homeplanet.index, name='Home Planet',                      
                     hole=0, pull=[0, 0, 0],
                      marker = dict(colors = (px.colors.sequential.Sunsetdark[1],px.colors.sequential.Sunsetdark[3], px.colors.sequential.Sunsetdark[0])),
                     #marker_colors=px.colors.sequential.Sunsetdark,
                     hoverinfo='label+percent+value', textinfo='label'), row=1, col=3)

fig.update_traces(
marker=dict(
        line=dict(color='#303330',
                  width=2)
        ), 
    row = 1, col=3
)

# Left Bottom Chart
destination = pd.DataFrame(df_data['Destination'].value_counts())
fig.add_trace(go.Pie(values=destination['Destination'], labels=destination.index, name='Home Planet',                      
                     hole=0, pull=[0, 0, 0],
                      marker = dict(colors = (px.colors.sequential.Sunsetdark[1],px.colors.sequential.Sunsetdark[3], px.colors.sequential.Sunsetdark[0])),
                     #marker_colors=px.colors.sequential.Sunsetdark,
                     hoverinfo='label+percent+value', textinfo='label'), row=2, col=1)

fig.update_traces(
marker=dict(
        line=dict(color='#303330',
                  width=2)
        ), 
    row = 2, col=1
)


# Right Bottom Chart
deck = pd.DataFrame(df_data.Deck.value_counts())
fig.add_trace(go.Bar(x=deck['Deck'], y=deck.index,  marker_color=px.colors.sequential.Sunsetdark,
                     name='Deck', orientation='h'),row=2, col=2)

fig.update_xaxes(visible = False, showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=2)
fig.update_yaxes(showgrid = True, gridcolor='gray', gridwidth=0.5, linecolor='gray', linewidth=2, zeroline = False, row=2, col=2)

# General Styling
fig.update_layout(height=800, bargap=0.2,
                  margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                  title_text="General Analysis",
                  #template="plotly_dark",
                  paper_bgcolor="#303330",
                  plot_bgcolor = "#303330",
                  title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)

**📌 Interpret:** As we can appreeciate, the count of both values for side feature are almost equal. However, in relation with deck types we observe that decks of type F are the most common ones, followed by G and E. Most unusual decks are A and especially T, where just 11 persons are in this type of deck. Taking this into account, we are going to replace missing values for deck type feature with F. Hereafter, we are going to fill side feature with most repeated value into decks of type F. Finally, we are going to fill cabin number with half of the maximum cabin number (as cabins belonging to one deck type could have more survival rate whether they are one of the first/last cabin). 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Earth Passengers Analysis</b></p>
</div>

In [None]:
def planet_analysis(planet):
    earth = df_data[df_data.HomePlanet == planet]
    fig = make_subplots(rows=3, cols=3, specs=[[{'type':'bar'},{'type':'histogram'}, {'type':'pie'}], [{'colspan':2},None, {'type':'histogram'}], 
                                              [{'type':'histogram'},{'type':'histogram'},{'type':'histogram'}]], 
                       subplot_titles=('CryoSleep','Age Distribution','Destination','Deck','Room Service Distribution','Food Court Distribution',
                                      'Shopping Mall Distribution','Spa Distribution'),
                       column_widths=[0.33, 0.33, 0.34], vertical_spacing=0.1, horizontal_spacing=0.05)

    # Left Upper Chart
    cryosleep = pd.DataFrame(earth['CryoSleep'].value_counts()).reset_index()
    fig.add_trace(go.Bar(y=cryosleep['CryoSleep'], x=cryosleep['index'], marker = dict(color=px.colors.sequential.Sunsetdark[3]), 
                         name = 'Day of Week'),row=1, col=1)

    fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
    fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

    # Middle Upper Chart
    fig.add_trace(go.Histogram(x=earth.Age, name='Age Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[0])), row = 1, col = 2)

    fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)
    fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)

    # Right Upper Chart
    destination = pd.DataFrame(earth['Destination'].value_counts())
    fig.add_trace(go.Pie(values=destination['Destination'], labels=destination.index, name='Home Planet',                      
                         hole=0, pull=[0, 0, 0],
                          marker = dict(colors = (px.colors.sequential.Sunsetdark[1],px.colors.sequential.Sunsetdark[3], px.colors.sequential.Sunsetdark[0])),
                         #marker_colors=px.colors.sequential.Sunsetdark,
                         hoverinfo='label+percent+value', textinfo='label'), row=1, col=3)

    fig.update_traces(
    marker=dict(
            line=dict(color='#303330',
                      width=2)
            ), 
        row = 1, col=3
    )

    # Left Medium Chart
    deck = pd.DataFrame(earth.Deck.value_counts())
    fig.add_trace(go.Bar(x=deck['Deck'], y=deck.index,  marker_color=px.colors.sequential.Sunsetdark,
                         name='Deck', orientation='h'),row=2, col=1)

    fig.update_xaxes(visible = False, showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=1)
    fig.update_yaxes(showgrid = True, gridcolor='gray', gridwidth=0.5, linecolor='gray', linewidth=2, zeroline = False, row=2, col=1)

    # Right Medium Chart
    luxury = earth[(earth.RoomService > 100)]
    fig.add_trace(go.Histogram(x=luxury.RoomService, name='Room Service Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[3])), row = 2, col = 3)

    fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=2, col=3)
    fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=2, col=3)

    # Left Bottom Chart
    luxury = earth[(earth.FoodCourt > 100)]
    fig.add_trace(go.Histogram(x=luxury.FoodCourt, name='Food Court Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[4])), row = 3, col = 1)

    fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=1)
    fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=1)

    # Middle Bottom Chart
    luxury = earth[(earth.ShoppingMall > 100)]
    fig.add_trace(go.Histogram(x=luxury.ShoppingMall, name='Shopping Mall Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[5])), row = 3, col = 2)

    fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=2)
    fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=2)

    # Right Bottom Chart
    luxury = earth[(earth.Spa > 100)]
    fig.add_trace(go.Histogram(x=luxury.Spa, name='Room Service Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[6])), row = 3, col = 3)

    fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=3)
    fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=3)

    # General Styling
    fig.update_layout(height=1250, bargap=0.2,
                      margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                      title_text=planet + " Passengers Analysis",
                      #template="plotly_dark",
                      paper_bgcolor="#303330",
                      plot_bgcolor = "#303330",
                      title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                      font=dict(color='floralwhite'), 
                      hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                      showlegend=False)
    return fig
    
fig = planet_analysis('Earth')
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Mars Passengers Analysis</b></p>
</div>

In [None]:
fig = planet_analysis('Mars')
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.4 | Europa Passengers Analysis</b></p>
</div>

In [None]:
fig = planet_analysis('Europa')
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.5 | Transported Passengers Analysis</b></p>
</div>

In [None]:
transported = df_data[df_data.Transported == True]
fig = make_subplots(rows=3, cols=3, specs=[[{'type':'bar'},{'type':'histogram'}, {'type':'pie'}], [{'colspan':2},None, {'type':'histogram'}], 
                                          [{'type':'histogram'},{'type':'histogram'},{'type':'histogram'}]], 
                   subplot_titles=('CryoSleep','Age Distribution','Destination','Deck','Room Service Distribution','Food Court Distribution',
                                  'Shopping Mall Distribution','Spa Distribution'),
                   column_widths=[0.33, 0.33, 0.34], vertical_spacing=0.1, horizontal_spacing=0.05)

# Left Upper Chart
cryosleep = pd.DataFrame(transported['CryoSleep'].value_counts()).reset_index()
fig.add_trace(go.Bar(y=cryosleep['CryoSleep'], x=cryosleep['index'], marker = dict(color=px.colors.sequential.Sunsetdark[3]), 
                     name = 'Day of Week'),row=1, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

# Middle Upper Chart
fig.add_trace(go.Histogram(x=transported.Age, name='Age Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[0])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=1, col=2)

# Right Upper Chart
destination = pd.DataFrame(transported['Destination'].value_counts())
fig.add_trace(go.Pie(values=destination['Destination'], labels=destination.index, name='Home Planet',                      
                     hole=0, pull=[0, 0, 0],
                      marker = dict(colors = (px.colors.sequential.Sunsetdark[1],px.colors.sequential.Sunsetdark[3], px.colors.sequential.Sunsetdark[0])),
                     #marker_colors=px.colors.sequential.Sunsetdark,
                     hoverinfo='label+percent+value', textinfo='label'), row=1, col=3)

fig.update_traces(
marker=dict(
        line=dict(color='#303330',
                  width=2)
        ), 
    row = 1, col=3
)

# Left Medium Chart
deck = pd.DataFrame(transported.Deck.value_counts())
fig.add_trace(go.Bar(x=deck['Deck'], y=deck.index,  marker_color=px.colors.sequential.Sunsetdark,
                     name='Deck', orientation='h'),row=2, col=1)

fig.update_xaxes(visible = False, showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=1)
fig.update_yaxes(showgrid = True, gridcolor='gray', gridwidth=0.5, linecolor='gray', linewidth=2, zeroline = False, row=2, col=1)

# Right Medium Chart
luxury = transported[(transported.RoomService > 100)]
fig.add_trace(go.Histogram(x=luxury.RoomService, name='Room Service Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[3])), row = 2, col = 3)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=2, col=3)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=2, col=3)

# Left Bottom Chart
luxury = transported[(transported.FoodCourt > 100)]
fig.add_trace(go.Histogram(x=luxury.FoodCourt, name='Food Court Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[4])), row = 3, col = 1)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=1)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=1)

# Middle Bottom Chart
luxury = transported[(transported.ShoppingMall > 100)]
fig.add_trace(go.Histogram(x=luxury.ShoppingMall, name='Shopping Mall Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[5])), row = 3, col = 2)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=2)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=2)

# Right Bottom Chart
luxury = transported[(transported.Spa > 100)]
fig.add_trace(go.Histogram(x=luxury.Spa, name='Room Service Distribution', marker = dict(color = px.colors.sequential.Sunsetdark[6])), row = 3, col = 3)

fig.update_xaxes(showgrid = False, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=3)
fig.update_yaxes(showgrid = True, gridcolor = 'gray', gridwidth = 0.5, showline = True, linecolor = 'gray', linewidth = 2, row=3, col=3)

# General Styling
fig.update_layout(height=1250, bargap=0.2,
                  margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                  title_text="Transported Passengers Analysis",
                  #template="plotly_dark",
                  paper_bgcolor="#303330",
                  plot_bgcolor = "#303330",
                  title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)

# <b>4 <span style='color:lightseagreen'>|</span> Feature Engineering</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.1 | Local CV Scoring Dataset Function</b></p>
</div>

The first step after EDA for us, is going to be building a reliable **<span style='color:lightseagreen'>local validation strategy</span>**. With reliable I mean a local CV score that **<span style='color:lightseagreen'>correlates</span>** with LB score. Because then we can use our local CV score to evaluate experiments or to tune (hyper)parameters. There are **<span style='color:lightseagreen'>two questions</span>** that I usually try to answer.

- **<span style='color:lightseagreen'>How to split</span>** the data in train and validation (there are a lot of different strategies)?
- Once a strategy is chosen does LB score moves in the direction of local CV score? If the answer is yes then probably the relationship between your local folds is the same relationship between Kaggle's train and test. If not, try other CV strategy and if you cannot find a reliable local CV then is it probably time to stop taking part in the competition because at the end you might be highly disappointed after the final shake-up.

In [None]:
#def score_dataset(X, y, model=XGBRegressor(tree_method='gpu_hist', predictor='gpu_predictor'), model_2 = CatBoostRegressor(task_type = 'GPU', silent=True)):
#def score_dataset(X, y, model=XGBRegressor(), model_2 = CatBoostRegressor(silent=True)):
def score_dataset(X,y,model=XGBClassifier(label_encoder=False)):
    # Label encoding is good for XGBoost and RandomForest, but one-hot
    # would be better for models like Lasso or Ridge. The `cat.codes`
    # attribute holds the category levels.
    for colname in X.select_dtypes(["object","bool"]).columns:
        X[colname] = LabelEncoder().fit_transform(X[colname])
    y['Transported'] = LabelEncoder().fit_transform(y['Transported'])
    # Metric for Titanic SpaceShipt competition is MAE (Mean Absolute Error)
    score_xgb = cross_val_score(
        model, X, y, cv=5, scoring="accuracy", n_jobs=-1
    )
    
    score = score_xgb.mean()
    return score

X = df_data[df_data.Transported.isnull() == False].copy()
y = pd.DataFrame(X.pop('Transported'))
baseline_score = score_dataset(X, y)
print(f"Baseline score: {baseline_score:.5f} Accuracy")

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.2 | Creating New Features</b></p>
</div>

### 4.2.1 | Age
In this case, what we'll do is making a distinction between several groups of ages. We are doing this, in order to make it easier to our classifier when making predictions and training. 

In [None]:
df_data['Age'] = pd.qcut(df_data['Age'], 10)
df_data.head().style.set_properties(subset=['Age'], **{'background-color': 'lightseagreen'})

In [None]:
agebox = df_data[df_data.Transported.isnull() == False].copy()
agebox['Transported'].replace([False, True], [0,1], inplace = True)
agebox = agebox.groupby('Age').agg({'Transported':'mean'}).reset_index()
agebox['Age'] = agebox['Age'].astype(str)

fig = px.bar(agebox, x="Age", y="Transported", color_continuous_scale='Viridis', color="Age")
fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False)
fig.update_yaxes(showgrid = True, gridcolor='gray',gridwidth=0.5, linecolor='gray',linewidth=2, zeroline = False)
fig.update_layout(margin=dict(b=50,t = 90, r=30,l=100), title_text="Transported Probability per Age Group",paper_bgcolor="#303330", plot_bgcolor = "#303330", title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"))


### 4.2.2 | Family Features

Hereafter we are going to focus in **<span style='color:lightseagreen'>PassengerId</span>** and **<span style='color:lightseagreen'>Name</span>**. Firstly, we are going to take a brief view to both features, in order to study its relation. As we can see below, PassengerId feature is composed of **<span style='color:lightseagreen'>two parts</span>**
- First one is related to **<span style='color:lightseagreen'>FamilyId</span>**
- Second one is related to each member of the family. 

In other words, we can appreciate that both Altark Susent and Solam Susent are from the same family. Therefore, in PassengerId feature they have the same FamilyId, concretely 0003. In order to **<span style='color:lightseagreen'>distinguish</span>** them into the family group, their second PassengerId part are 01 and 02 respectively. Thus, we are going to create some new features:
- One for **<span style='color:lightseagreen'>FamilyId</span>**
- One for **<span style='color:lightseagreen'>Family Name</span>**
- One for **<span style='color:lightseagreen'>Family Size</span>**

In [None]:
df_data[['Name','PassengerId']].head().style.set_properties(subset=['PassengerId'], **{'background-color': 'lightseagreen'})

In [None]:
df_data['FamilyId'] = df_data['PassengerId'].str.split("_", n=2, expand=True)[0]
df_data['Family Name'] = df_data['Name'].str.split(' ', n=2, expand=True)[1]
df_data = df_data.set_index(['FamilyId','Family Name'])
df_data['Family Size'] = 1
for i in range(df_data.shape[0]):
    fam_size = df_data.loc[df_data.index[i],:].shape[0]
    df_data.loc[df_data.index[i],'Family Size'] = fam_size
    
df_data = df_data.reset_index()
df_data[['FamilyId','PassengerId','Family Name','Name','Family Size']].head().style.set_properties(subset=['FamilyId', 'Family Name','Family Size'], **{'background-color': 'lightseagreen'})

In [None]:
X = df_data[df_data.Transported.isnull() == False].copy()
y = pd.DataFrame(X.pop('Transported'))
baseline_score = score_dataset(X, y)
print(f"Baseline score: {baseline_score:.5f} Accuracy")

### 4.2.2 | Luxury Features

Hereafter, we are going to focus our attention on **<span style='color:lightseagreen'>luxury features</span>**. Those features are: 

- Spa
- VRDeck
- Food Court
- Room Service
- Shopping Mall

They all reflect the amount of money a passenger has spent on it. We can create a feature for telling us the amount of money a passenger has spent in all these luxuries. Let's call it **<span style='color:lightseagreen'>Luxury Spending</span>**. Hereafter, as it's going to be a continuous numerical feature, in order to make it easier to our model, we are going to split it into 10 groups (each of them related to one of the percentiles). 

In [None]:
df_data['Luxury Spending'] = df_data['VRDeck'] + df_data['ShoppingMall'] + df_data['Spa'] + df_data['FoodCourt'] + df_data['RoomService']
luxury = df_data[df_data.Transported.isnull() == False].copy()
luxury['Luxury Spending'] = pd.qcut(luxury['Luxury Spending'], 6, duplicates = 'drop')
luxury = luxury.groupby('Luxury Spending').agg({'Transported':'mean'}).reset_index()
luxury['Luxury Spending'] = LabelEncoder().fit_transform(luxury['Luxury Spending'])
fig = px.bar(luxury, x="Luxury Spending", y="Transported", color='Luxury Spending', color_continuous_scale='Sunsetdark')
fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False)
fig.update_yaxes(showgrid = True, gridcolor='gray',gridwidth=0.5, linecolor='gray',linewidth=2, zeroline = False)
fig.update_layout(margin=dict(b=50,t = 90, r=30,l=100), title_text="Transported Probability per Luxury Spending Group",paper_bgcolor="#303330", plot_bgcolor = "#303330", title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"))

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.3 | Feature Transformation</b></p>
</div>

### 4.3.1 | Boolean Features - Target

In [None]:
boolean_col = df_data.select_dtypes(['bool']).columns
for i in range(len(boolean_col)):
    df_data[boolean_col[i]].replace([False, True], [0,1], inplace = True)
df_data['Transported'].replace([False, True], [0,1], inplace = True)

### 4.3.2 | Categorical Features - Label Encoding

In [None]:
for colname in df_data.drop('PassengerId',axis=1).select_dtypes(["object","category"]).columns:
    df_data[colname] = LabelEncoder().fit_transform(df_data[colname])

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.4 | Feature Selection</b></p>
</div>

### 4.4.1 | Heatmap

In [None]:
df_train = df_data[df_data.Transported.isnull() == False].copy()
corr = df_train.drop('Side',axis=1).corr()

fig = px.imshow(corr, color_continuous_scale='RdBu_r', origin='lower', text_auto=True, aspect='auto')
fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False)
fig.update_yaxes(showgrid = True, gridcolor='gray',gridwidth=0.5, linecolor='gray',linewidth=2, zeroline = False)
fig.update_layout(margin=dict(b=50,t = 90, r=30,l=100), title_text="Heatmap",paper_bgcolor="#303330", plot_bgcolor = "#303330", title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"))
fig.show()

### 4.4.2 | Mutual Information

Mutual information describes **<span style='color:lightseagreen'>relationships</span>** in terms of **<span style='color:lightseagreen'>uncertainty</span>**. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. If you knew the value of a feature, how much more confident would you be about the target? Scikit-learn has two mutual information **<span style='color:lightseagreen'>metrics</span>** in its feature_selection module: one for **<span style='color:lightseagreen'>real-valued targets</span>** (mutual_info_regression) and one for **<span style='color:lightseagreen'>categorical targets</span>** (mutual_info_classif). Our target, price, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe. Hereafter, we are going to define a baseline score which is going to help us to know whether some set of features we've assembled has actually led to any **<span style='color:lightseagreen'>improvement</span>** or not.

In [None]:
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score

def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    #discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

y = df_data[df_data['Transported'].isnull() == False]['Transported']
x = df_data[df_data['Transported'].isnull() == False].drop('Transported', axis=1)
mi_scores = make_mi_scores(x, y)
mi_scores = pd.DataFrame(mi_scores).reset_index().rename(columns={'index':'Feature'})

fig = px.bar(mi_scores, x='MI Scores', y='Feature', color="MI Scores",
             color_continuous_scale='Sunsetdark')

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False)
fig.update_yaxes(showgrid = True, gridcolor='gray',gridwidth=0.5, linecolor='gray',linewidth=2, zeroline = False)
fig.update_layout(height = 750, title_text="Mutual Information Scores",paper_bgcolor="#303330", plot_bgcolor = "#303330", font = dict(color='floralwhite'),
                  title_font=dict(size=29, family="Lato, sans-serif", color='floralwhite'), xaxis={'categoryorder':'category ascending'}, margin=dict(t=80),
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"))
fig.show()

# <b>5 <span style='color:lightseagreen'>|</span> Modeling</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.1 | Algorithm Comparison - Cross Validation</b></p>
</div>

For the modeling part we will compare **<span style='color:lightseagreen'>12 known algorithms</span>**, and proceed to evaluate them with several different metrics. Those metrics are the following: 

- Average accuracy 
- Balanced accuracy
- F1 score
- ROC AUC 

We are going to evaluate the algorithms with a **<span style='color:lightseagreen'>stratified kfold cross validation</span>** procedure. Algorithms are included hereafter: 

- Support Vector Classifier (SVC)
- Decision Tree
- AdaBoost
- Random Forest
- Extra Trees
- Gradient Boosting
- K-Nearest Neighbours (KNN)
- Logistic regression
- Linear Discriminant Analysis
- Extreme Gradient Boosting (XGBoost)
- Catboost Classifier
- LGBM Classifier

To begin with, we are going to create a cross validate model with Kfold stratified. Then we'll test each of the algorithms that I have mentioned before.

In [None]:
X_train = df_data[df_data.Transported.isnull() == False].drop(['Transported','PassengerId'],axis=1)
y = df_data[df_data.Transported.isnull() == False].Transported
X_test = df_data[df_data.Transported.isnull() == True].drop(['Transported','PassengerId'],axis=1).copy()

kfold = StratifiedKFold(n_splits=10)
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(XGBClassifier(random_state = random_state))
classifiers.append(CatBoostClassifier(random_state = random_state))
classifiers.append(LGBMClassifier(random_state = random_state))

cv_accuracy = []
cv_balanced_accuracy = []
cv_f1 = []
cv_roc = []
for classifier in classifiers :
    cv_accuracy.append(cross_val_score(classifier, X_train, y = y, scoring = "accuracy", cv = kfold, n_jobs=4))
    cv_balanced_accuracy.append(cross_val_score(classifier, X_train, y = y, scoring = "balanced_accuracy", cv = kfold, n_jobs=4))
    cv_f1.append(cross_val_score(classifier, X_train, y = y, scoring = "f1", cv = kfold, n_jobs=4))
    cv_roc.append(cross_val_score(classifier, X_train, y = y, scoring = "roc_auc", cv = kfold, n_jobs=4))
    
accuracy = []
balanced_accuracy = []
f1 = []
roc = []
for cv_result in cv_accuracy:
    accuracy.append(cv_result.mean())
for cv_result in cv_balanced_accuracy:
    balanced_accuracy.append(cv_result.mean())
for cv_result in cv_f1:
    f1.append(cv_result.mean())
for cv_result in cv_roc:
    roc.append(cv_result.mean())
    
comparison = pd.DataFrame({"Accuracy":accuracy,"Balanced Accuracy": balanced_accuracy,"ROC AUC":roc,
                           "F1 Score": f1,"Algorithm":["SVC","DecisionTree","AdaBoost","RandomForest","ExtraTrees",
                            "GradientBoosting","KNeighboors","LinearDiscriminantAnalysis",'XGBClassifier','CatBoostClassifier','LGBMClassifier']})
clear_output()

In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{'type':'bar'},{'type':'bar'}], [{'type':'bar'}, {'type':'bar'}]], 
                   subplot_titles=('Accuracy','Balanced Accuracy','F1 Score','ROC AUC'),
                   column_widths=[0.5, 0.5], vertical_spacing=0.15, horizontal_spacing=0.05)

# Accuracy Chart
fig.add_trace(go.Bar(y=comparison['Accuracy'].sort_values(), x=comparison['Algorithm'], marker = dict(color=px.colors.sequential.Sunsetdark[0]), 
                     name = 'Accuracy'),row=1, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

# Balanced Accuracy Chart
fig.add_trace(go.Bar(y=comparison['Balanced Accuracy'].sort_values(), x=comparison['Algorithm'], marker = dict(color=px.colors.sequential.Sunsetdark[1]), 
                     name = 'Balanced Accuracy'),row=1, col=2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# F1 Score Chart
fig.add_trace(go.Bar(y=comparison['F1 Score'].sort_values(), x=comparison['Algorithm'], marker = dict(color=px.colors.sequential.Sunsetdark[2]), 
                     name = 'F1 Score'),row=2, col=1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=2, col=1)

# ROC AUC Chart
fig.add_trace(go.Bar(y=comparison['ROC AUC'].sort_values(), x=comparison['Algorithm'], marker = dict(color=px.colors.sequential.Sunsetdark[3]), 
                     name = 'ROC AUC'),row=2, col=2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=2, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=2, col=2)

# General Styling
fig.update_layout(height=800, bargap=0.2,
                  margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                  title_text="Model Comparison Analysis",
                  #template="plotly_dark",
                  paper_bgcolor="#303330",
                  plot_bgcolor = "#303330",
                  title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.2 | Hyperparameter Tuning - Optuna</b></p>
</div>

In this case, only for Catboost, we are going to make the tuning with **<span style='color:lightseagreen'>Optuna</span>**. I will add the code for hyperparameter tuning below. However, for not wasting CPU time, since I have run it once, I will simply create the model with the specific features values. I will control whether making hyperparameter tuning or not with **<span style='color:lightseagreen'>allow_optimize</span>** .

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.0001, 0.3),
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
        "n_estimators": 1000,
        "max_depth":trial.suggest_int("max_depth", 4, 16),
        'random_strength' :trial.suggest_int('random_strength', 0, 100),
        "l2_leaf_reg":trial.suggest_float("l2_leaf_reg",1e-8,3e-5),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
        #'task_type': trial.suggest_categorical('task_type', ['GPU']),
        'eval_metric': trial.suggest_categorical('eval_metric', ['Accuracy'])
    }

    model = CatBoostClassifier(**params)
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X_train, y, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_mae = accuracy_score(y_train_tmp, y_train_pred)
    valid_mae = accuracy_score(y_valid_tmp, y_valid_pred)
    
    return valid_mae

allow_optimize = 0

In [None]:
TRIALS = 100
TIMEOUT = 3600

if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'cat_parameter_opt',
        direction = 'maximize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X_train, y, test_size=0.3, random_state=42)
    model_tmp = CatBoostClassifier(**best_params, n_estimators=30000, verbose=1000).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
    clear_output()

In [None]:
if allow_optimize:
    model = CatBoostClassifier(**best_params, n_estimators=model_tmp.get_best_iteration(), verbose=1000).fit(X, y)
else:
    model = CatBoostClassifier(
        early_stopping_rounds=10,
        silent = True,
        random_state = 2022, learning_rate = 0.10851035034096206, bagging_temperature = 1.6587780950847202, max_depth = 4, 
        random_strength = 57, l2_leaf_reg = 1.0527606077755484e-05, min_child_samples = 60, max_bin = 203, eval_metric = 'Accuracy'
    ).fit(X_train, y)    

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5. | Permutation Importance</b></p>
</div>

One of the most basic questions we might ask of a model is: **<span style='color:lightseagreen'>What features have the biggest impact on predictions?</span>** This concept is called feature importance. There are multiple ways to measure feature importance. Some approaches answer subtly different versions of the question above. Other approaches have documented shortcomings. In this section, we'll focus on permutation importance. Compared to most other approaches, permutation importance is:

- Fast to calculate,
- Widely used and understood, and
- Consistent with properties we would want a feature importance measure to have.


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, random_state=1).fit(X_train, y)
pred = model.predict(X_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

**📌 Interpret:** the values towards the top are the **<span style='color:lightseagreen'>most important features</span>**, and those towards the bottom matter least. The first number in each row shows how much **<span style='color:lightseagreen'>model performance decreased</span>** with a random shuffling (in this case, using "accuracy" as the performance metric). Like most things in data science, there is some **<span style='color:lightseagreen'>randomness</span>** to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how **<span style='color:lightseagreen'>performance varied from one-reshuffling to the next</span>**. You'll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.

In [None]:
submit = pd.DataFrame({'PassengerId': df_data[df_data.Transported.isnull() == True]['PassengerId'].astype(object), 'Transported':pred}).set_index('PassengerId')
submit['Transported'].replace([0,1], [False, True], inplace=True)
submit.to_csv('./submission.csv')