# <b>1 <span style='color:lightseagreen'>|</span> Introduction</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | Goal</b></p>
</div>

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!

In [None]:
from IPython.display import clear_output
import os
import warnings
from pathlib import Path

# Basic libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling as pp
import seaborn as sns

# Clustering
from sklearn.cluster import KMeans

# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

#Mutual Information
from sklearn.feature_selection import mutual_info_regression

# Cross Validation
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold, learning_curve, train_test_split

# Encoding
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from category_encoders import MEstimateEncoder
from category_encoders import MEstimateEncoder

# Algorithms
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Optuna - Bayesian Optimization 
import optuna
from optuna.samplers import TPESampler

# Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go

# Spaceship Titanic Metric
from sklearn.metrics import accuracy_score

warnings.filterwarnings('ignore')

def load_data():
    data_dir = Path("../input/spaceship-titanic")
    df_train = pd.read_csv(data_dir / "train.csv")
    df_test = pd.read_csv(data_dir / "test.csv")
    # Merge the splits so we can process them together
    df = pd.concat([df_train, df_test])
    return df

def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df[fi_df.feature_importance > 0]
    fig = px.bar(fi_df, x='feature_names', y='feature_importance', color="feature_importance",
             color_continuous_scale='Blugrn')
    # General Styling
    fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100,t=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Feature Importance Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)
    fig.show()

df_data = load_data()
clear_output()
pp.ProfileReport(df_data)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.7 | Reducing Memory Usage</b></p>
</div>

In order to not having **<span style='color:lightseagreen'>issues with memory</span>** in the kernel, we are going to reduce its memory usage with the following function. Below, we can appreciate that reduction was successful as we manage to make a **<span style='color:lightseagreen'>reduction of 20%</span>**.

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

df_data = reduce_mem_usage(df_data)

# <b>3 <span style='color:lightseagreen'>|</span> Missing Values</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | Categorical Features</b></p>
</div>

From the starting profiling report we observe that there are plenty of features having missing values. We are going to focus on filling them along this section. We'll start with those belonging to object category. Let's take a quick look at those features.

In [None]:
df_data.select_dtypes(['object']).head()

### 3.1.1 | HomePlanet

> **<span style='color:gray'>HomePlanet description: the planet the passenger departed from, typically their planet of permanent residence.</span>**

Let's focus first on HomePlanet. As it is shown in the report this feature is categorical, with three different values. Those are the following: Mars, Earth and Europa. We are going to calculate mode and we are going to fill missing values with it.

In [None]:
def filling_HomePlanet(df):
    mode = df['HomePlanet'].value_counts().index[0]
    df['HomePlanet'] = df['HomePlanet'].fillna(mode)
    return df

### 3.1.2 | CryoSleep

> **<span style='color:gray'>CryoSleep description: indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.</span>**

Let's focus now on CryoSleep. As it is shown in the report this feature is boolean. Due to the fact that, if passenger had elected to put himself into suspended animation rarely it would have a missing value, we are going to consider the option of replacing missing values with False in this case.

In [None]:
def filling_CryoSleep(df):
    df['CryoSleep'] = df['CryoSleep'].fillna(False)
    return df

### 3.1.3 | Cabin

> **<span style='color:gray'>Cabin description: the cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.</span>**

Let's focus now on Cabin. As it is shown in the report this feature is categorical. As it is almost impossible to estimate cabin number for a passenger with given format, we are going to split cabin number into three different features. Those are going to be describing: desk, number and side. Thus, we'll start Feature Engineering here (continued in detail subsequently). Next, we are going to replace missing values for deck type feature with F (most repeated value). Hereafter, we are going to fill side feature with most repeated value into decks of type F. Finally, we are going to fill cabin number with half of the maximum cabin number (as cabins belonging to one deck type could have more survival rate whether they are one of the first/last cabin).

In [None]:
def split_Cabin(df):
    df['Deck'] = df['Cabin'].str.split("/", n=2, expand=True)[0]
    df['Number'] = df['Cabin'].str.split("/", n=2, expand=True)[1]
    df['Side'] = df['Cabin'].str.split("/", n=2, expand=True)[2]
    df.pop('Cabin')
    return df

def filling_Cabin(df):
    df['Deck'] = df['Deck'].fillna('F')
    mode = df[df.Deck == 'F']['Side'].value_counts().index[0]
    df['Side'] = mode
    df['Number'] = df['Number'].astype(float)
    df['Number'] = df['Number'].fillna(1796 / 2)
    return df

### 3.1.4 | Destination

> **<span style='color:gray'>Destination description: the planet the passenger will be debarking to.</span>**

Let's focus now on Destination. As it is shown in the report this feature is categorical. We are going to fill missing values with most repeated value.

In [None]:
def filling_Destination(df):
    mode = df['Destination'].value_counts().index[0]
    df['Destination'] = df['Destination'].fillna(mode)
    return df

### 3.1.5 | VIP

> **<span style='color:gray'>VIP description: whether the passenger has paid for special VIP service during the voyage.</span>**

Let's focus now on VIP. It would seem strange that a customer who has paid for a VIP service deal has not been taken into account in the data collection. This is why I'm going to replace missing values with False.

In [None]:
def filling_VIP(df):
    df['VIP'] = df['VIP'].fillna(False)
    return df

### 3.1.6 | Name

> **<span style='color:gray'>Name description: the first and last names of the passenger.</span>**

Lastly, let's focus on Name Feature. We are going to replace it with None, as it is difficult to guess first and last name of a person as you could guess.

In [None]:
def filling_Name(df):
    df['Name'] = df['Name'].fillna('None')
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Numerical Features</b></p>
</div>

From the starting profiling report we observe that there are quantitative features having missing values. We are going to focus on filling them along this section.

### 3.2.1 | Age
> **<span style='color:gray'>Age description: the age of the passenger.</span>**

We are going to start with Age Feature. We are going to replace missing values with median age.

In [None]:
def filling_Age(df):
    median = df['Age'].describe()[5]
    df['Age'] = df['Age'].fillna(median)
    return df

### 3.2.2 | Luxury Features

> **<span style='color:gray'>Luxury Features description: amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.</span>**

Now, we are focusing into those VIP features that are related to the amount of money a passenger has paid. As it would be quite unusual to not have recorded payment data from a VIP passenger, we are going to consider that missing values refer to passengers who have not spent anything on those luxuries.

In [None]:
def filling_luxury_features(df):
    luxury_features = ['RoomService','FoodCourt', 'ShoppingMall', 'Spa','VRDeck']
    df[luxury_features] = df[luxury_features].fillna(0.0)
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Missing Values Filling Function</b></p>
</div>

In [None]:
def filling_numerical(df):
    df = filling_Age(df)
    df = filling_luxury_features(df)
    return df

def filling_categorical(df):
    df = filling_HomePlanet(df)
    df = filling_CryoSleep(df)
    df = split_Cabin(df)
    df = filling_Cabin(df)
    df = filling_Destination(df)
    df = filling_VIP(df)
    df = filling_Name(df)
    return df

def filling_missing(df):
    df = filling_categorical(df)
    df = filling_numerical(df)
    return df

df_data = filling_missing(df_data)

# <b>2 <span style='color:lightseagreen'>|</span> Exploratory Data Analysis</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.1 | Age Analysis</b></p>
</div>

### 2.1.1 | Age Distribution

In [None]:
age_count = pd.DataFrame(df_data[df_data.Transported == False].Age.value_counts().sort_values()).reset_index()
age_count.columns = ['age','count']

# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('Age Distribution over Training Set','Age Distribution over Test Set'))

# Not transported People
fig.add_trace(go.Bar(x=age_count['age'], y=age_count['count'], marker = dict(color=px.colors.sequential.Viridis[5]), name='Non Transported from Train'), row = 1, col = 1)

age_count = pd.DataFrame(df_data[df_data.Transported == True].Age.value_counts().sort_values()).reset_index()
age_count.columns = ['age','count']

fig.add_trace(go.Bar(x=age_count['age'], y=age_count['count'], marker = dict(color=px.colors.qualitative.Plotly[4]), name='Transported from Train'), row = 1, col = 1)


fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

age_count = pd.DataFrame(df_data[df_data.Transported.isnull() == True].Age.value_counts().sort_values()).reset_index()
age_count.columns = ['age','count']

fig.add_trace(go.Bar(x=age_count['age'], y=age_count['count'], marker = dict(color=px.colors.sequential.Viridis[5]), name='Age Distribution from Test'), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=500, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Age Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=True)

### 2.1.2 | Skewness and Kurtosis

[Skewness and Kurtosis Tutorial](https://www.universoformulas.com/estadistica/descriptiva/asimetria-curtosis/)

Let's now check **<span style='color:lightseagreen'>skewness</span>**. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. In cases where one tail is long but the other tail is fat, skewness does not obey a simple rule. For example, a zero value means that the tails on both sides of the mean balance out overall; this is the case for a symmetric distribution, but can also be true for an asymmetric distribution where one tail is long and thin, and the other is short but fat. A skewness greater than 1 is generally judged to be skewed, so check mainly those greater than 1.

In [None]:
from scipy.stats import skew, kurtosis
skew(df_data[df_data.Age.isnull() == False]['Age'])

Let's now move into **<span style='color:lightseagreen'>kurtosis</span>**. In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations.

The standard measure of a distribution's kurtosis, originating with Karl Pearson, is a scaled version of the fourth moment of the distribution. This number is related to the tails of the distribution, not its peak; hence, the sometimes-seen characterization of kurtosis as "peakedness" is incorrect. For this measure, higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.

In [None]:
kurtosis(df_data[df_data.Age.isnull() == False]['Age'])

### 2.1.3 | Target vs Age

In the following chart we are going to plot transported mean per each of the ages. **<span style='color:lightseagreen'>Be careful</span>**, as we could observe previously in the histogram, **<span style='color:lightseagreen'>amounts of people with the same age vary from one age to another</span>**. Thus, results are more significantly reliable for those ages having high count values.

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','Age']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('Age')['Transported'].transform('mean')

fig = px.scatter(temp, x='Age',y='Transported', color="Transported", color_continuous_scale='Blugrn', size='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per Age Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.2 | Room Service Analysis</b></p>
</div>

### 2.2.1 | Room Service Cumulative Distribution 

From the following chart we can conclude that most people do not spent money on Room Services. We can observe that the cumulative distribution tends to be almost an horizontal line when reaching 1000$. Before this amount curve is increasing. To sum up, the more money spent, the less amount of people.

In [None]:
# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('RoomService Cumulative Distribution over Training Set','RoomService Cumulative Distribution over Test Set'))

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == False]['RoomService'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == True]['RoomService'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Room Service Cumulative Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 2.2.2 | Skewness and Kurtosis

As you can imagine, skewness and kurtosis are pretty predictable. Any idea of which kind of value are we going to obtain? For example, thinking about it, is well known that skewness is gonna be positive. Indeed, a large positive number (in terms of skewness values). Those are their respective values.

In [None]:
print('Skewness: ', skew(df_data[df_data.RoomService.isnull() == False]['RoomService']))
print('Kurtosis: ', kurtosis(df_data[df_data.RoomService.isnull() == False]['RoomService']))

### 2.2.3 | Target vs RoomService

In next chart we're going to plot the **<span style='color:lightseagreen'>mean value of being transported depending on which RoomService group a person belongs to</span>**. As we can observe below, people spending less money are more likely to be transported. By contrast, this value decreases when the amount of money spent on this kind of service increases.

In [None]:
temp = df_data[df_data.Transported.isnull() == False][['Transported','RoomService']].copy()
temp = temp[temp.RoomService.isnull() == False]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['RoomService'] = pd.qcut(temp['RoomService'], 20, duplicates='drop')
temp['RoomService'] = temp['RoomService'].astype(str)
temp['Transported'] = temp.groupby('RoomService')['Transported'].transform('mean')
temp = pd.DataFrame(temp['RoomService'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','RoomService']
temp = temp.sort_values(by='Transported')

fig = px.bar(temp, x='RoomService',y='Transported', color="Transported", color_continuous_scale = 'Blugrn')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per RoomService Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.2 | Food Court Analysis</b></p>
</div>

### 2.2.1 | Room Service Cumulative Distribution 

From the following chart we can conclude that most people do not spent money on Food Court. We can observe that the cumulative distribution tends to be almost an horizontal line when reaching 1000$. Before this amount curve is increasing. To sum up, the more money spent, the less amount of people.

In [None]:
# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('Food Court Cumulative Distribution over Training Set','Food Court Cumulative Distribution over Test Set'))

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == False]['FoodCourt'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == True]['FoodCourt'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Food Court Cumulative Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 2.2.2 | Skewness and Kurtosis

As you can imagine, skewness and kurtosis are pretty predictable. Any idea of which kind of value are we going to obtain? For example, thinking about it, is well known that skewness is gonna be positive. Indeed, a large positive number (in terms of skewness values). Those are their respective values.

In [None]:
print('Skewness: ', skew(df_data[df_data.FoodCourt.isnull() == False]['FoodCourt']))
print('Kurtosis: ', kurtosis(df_data[df_data.FoodCourt.isnull() == False]['FoodCourt']))

### 2.2.3 | Target vs FoodCourt

In next chart we're going to plot the **<span style='color:lightseagreen'>mean value of being transported depending on which FoodCourt group a person belongs to</span>**. As we can observe below, result vary a bit from the one obtained for RoomService. In this case, people who has spent a few money are the ones having a lower Transported rate. On the other hand, people paying bigger amounts of money and people not paying anything are more likely to be transported.

In [None]:
temp = df_data[df_data.Transported.isnull() == False][['Transported','FoodCourt']].copy()
temp = temp[temp.FoodCourt.isnull() == False]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['FoodCourt'] = pd.qcut(temp['FoodCourt'], 20, duplicates='drop')
temp['FoodCourt'] = temp['FoodCourt'].astype(str)
temp['Transported'] = temp.groupby('FoodCourt')['Transported'].transform('mean')
temp = pd.DataFrame(temp['FoodCourt'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','FoodCourt']
temp = temp.sort_values(by='Transported')

fig = px.bar(temp, x='FoodCourt',y='Transported', color="Transported", color_continuous_scale = 'Blugrn')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per FoodCourt Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.3 | Shopping Mall Analysis</b></p>
</div>

### 2.3.1 | Shopping Mall Cumulative Distribution 

From the following chart we can conclude that most people do not spent money on Shopping Mall. We can observe that the cumulative distribution tends to be almost an horizontal line when reaching 1000$. Before this amount curve is increasing. To sum up, the more money spent, the less amount of people.

In [None]:
# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('Shopping Mall Cumulative Distribution over Training Set','Shopping Mall Cumulative Distribution over Test Set'))

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == False]['ShoppingMall'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == True]['ShoppingMall'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Shopping Mall Cumulative Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 2.3.2 | Skewness and Kurtosis

As you can imagine, skewness and kurtosis are pretty predictable. Any idea of which kind of value are we going to obtain? For example, thinking about it, is well known that skewness is gonna be positive. Indeed, a large positive number (in terms of skewness values). Those are their respective values.

In [None]:
print('Skewness: ', skew(df_data[df_data.ShoppingMall.isnull() == False]['ShoppingMall']))
print('Kurtosis: ', kurtosis(df_data[df_data.ShoppingMall.isnull() == False]['ShoppingMall']))

### 2.3.3 | Target vs ShoppingMall

In next chart we're going to plot the **<span style='color:lightseagreen'>mean value of being transported depending on which ShoppingMall group a person belongs to</span>**.

In [None]:
temp = df_data[df_data.Transported.isnull() == False][['Transported','ShoppingMall']].copy()
temp = temp[temp.ShoppingMall.isnull() == False]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['ShoppingMall'] = pd.qcut(temp['ShoppingMall'], 20, duplicates='drop')
temp['ShoppingMall'] = temp['ShoppingMall'].astype(str)
temp['Transported'] = temp.groupby('ShoppingMall')['Transported'].transform('mean')
temp = pd.DataFrame(temp['ShoppingMall'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','ShoppingMall']
temp = temp.sort_values(by='Transported')

fig = px.bar(temp, x='ShoppingMall',y='Transported', color="Transported", color_continuous_scale = 'Blugrn')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per ShoppingMall Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.4 | Spa Analysis</b></p>
</div>

### 2.4.1 | Spa Cumulative Distribution 

From the following chart we can conclude that most people do not spent money on Spa. We can observe that the cumulative distribution tends to be almost an horizontal line when reaching 1000$. Before this amount curve is increasing. To sum up, the more money spent, the less amount of people.

In [None]:
# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('Spa Cumulative Distribution over Training Set','Spa Cumulative Distribution over Test Set'))

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == False]['Spa'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == True]['Spa'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Spa Cumulative Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 2.4.2 | Skewness and Kurtosis

As you can imagine, skewness and kurtosis are pretty predictable. Any idea of which kind of value are we going to obtain? For example, thinking about it, is well known that skewness is gonna be positive. Indeed, a large positive number (in terms of skewness values). Those are their respective values.

In [None]:
print('Skewness: ', skew(df_data[df_data.Spa.isnull() == False]['Spa']))
print('Kurtosis: ', kurtosis(df_data[df_data.Spa.isnull() == False]['Spa']))

### 2.4.3 | Target vs Spa

In next chart we're going to plot the **<span style='color:lightseagreen'>mean value of being transported depending on which Spa group a person belongs to</span>**.

In [None]:
temp = df_data[df_data.Transported.isnull() == False][['Transported','Spa']].copy()
temp = temp[temp.Spa.isnull() == False]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Spa'] = pd.qcut(temp['Spa'], 20, duplicates='drop')
temp['Spa'] = temp['Spa'].astype(str)
temp['Transported'] = temp.groupby('Spa')['Transported'].transform('mean')
temp = pd.DataFrame(temp['Spa'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','Spa']
temp = temp.sort_values(by='Transported')

fig = px.bar(temp, x='Spa',y='Transported', color="Transported", color_continuous_scale = 'Blugrn')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per Spa Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.5 | VRDeck Analysis</b></p>
</div>

### 2.5.1 | VRDeck Cumulative Distribution 

From the following chart we can conclude that most people do not spent money on VRDeck. We can observe that the cumulative distribution tends to be almost an horizontal line when reaching 1000$. Before this amount curve is increasing. To sum up, the more money spent, the less amount of people.

In [None]:
# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('VRDeck Cumulative Distribution over Training Set','VRDeck Cumulative Distribution over Test Set'))

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == False]['VRDeck'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

fig.add_trace(go.Histogram(x=df_data[df_data.Transported.isnull() == True]['VRDeck'], marker = dict(color=px.colors.sequential.Viridis[5]), cumulative_enabled=True), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>VRDeck Cumulative Distribution Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 2.5.2 | Skewness and Kurtosis

As you can imagine, skewness and kurtosis are pretty predictable. Any idea of which kind of value are we going to obtain? For example, thinking about it, is well known that skewness is gonna be positive. Indeed, a large positive number (in terms of skewness values). Those are their respective values.

In [None]:
print('Skewness: ', skew(df_data[df_data.VRDeck.isnull() == False]['VRDeck']))
print('Kurtosis: ', kurtosis(df_data[df_data.VRDeck.isnull() == False]['VRDeck']))

### 2.5.3 | Target vs VRDeck

In next chart we're going to plot the **<span style='color:lightseagreen'>mean value of being transported depending on which VRDeck group a person belongs to</span>**.

In [None]:
temp = df_data[df_data.Transported.isnull() == False][['Transported','VRDeck']].copy()
temp = temp[temp.VRDeck.isnull() == False]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['VRDeck'] = pd.qcut(temp['VRDeck'], 20, duplicates='drop')
temp['VRDeck'] = temp['VRDeck'].astype(str)
temp['Transported'] = temp.groupby('VRDeck')['Transported'].transform('mean')
temp = pd.DataFrame(temp['VRDeck'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','VRDeck']
temp = temp.sort_values(by='Transported')

fig = px.bar(temp, x='VRDeck',y='Transported', color="Transported", color_continuous_scale = 'Blugrn')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per VRDeck Analysis</span>",                  
              plot_bgcolor='rgb(242,242,242)',
              paper_bgcolor = 'rgb(242,242,242)',
              font=dict(family="Times New Roman", size= 14),
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.6 | VIP Analysis</b></p>
</div>

In the chart I am showing below I've plotted both VIP feature value counts, and Transported mean per each of the possible values of VIP. As we can observe, VIP people are more likely to not being transported. By contrast, there is a 50% of chance that one Non-VIP person gets transported. On the left hand, we can appreciate that most people are Non-VIP.

In [None]:
temp = pd.DataFrame(df_data[df_data.VIP.isnull() == False]['VIP'].value_counts()).reset_index()
temp.columns = ['VIP', 'Count']

# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('VIP Count','Transported Mean vs VIP'))

fig.add_trace(go.Bar(x=temp['VIP'], y=temp['Count'], marker = dict(color=[px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

temp = df_data[(df_data.VIP.isnull() == False) & (df_data.Transported.isnull() == False)][['VIP','Transported']]
temp['Transported'] = temp.groupby('VIP')['Transported'].transform('mean')
temp = pd.DataFrame(temp['VIP'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','VIP']

fig.add_trace(go.Bar(x=temp['VIP'], y=temp['Transported'], marker = dict(color=[px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>VIP Feature Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.7 | HomePlanet Feature Analysis</b></p>
</div>

In [None]:
temp = pd.DataFrame(df_data[df_data.HomePlanet.isnull() == False]['HomePlanet'].value_counts()).reset_index()
temp.columns = ['HomePlanet', 'Count']

# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('HomePlanet Count','Transported Mean vs HomePlanet'))

fig.add_trace(go.Bar(x=temp['HomePlanet'], y=temp['Count'], marker = dict(color=[px.colors.sequential.Blugrn[6], px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

temp = df_data[(df_data.HomePlanet.isnull() == False) & (df_data.Transported.isnull() == False)][['HomePlanet','Transported']]
temp['Transported'] = temp.groupby('HomePlanet')['Transported'].transform('mean')
temp = pd.DataFrame(temp['HomePlanet'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','HomePlanet']

fig.add_trace(go.Bar(x=temp['HomePlanet'], y=temp['Transported'], marker = dict(color=[px.colors.sequential.Blugrn[6], px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>HomePlanet Feature Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.8 | Destination Feature Analysis</b></p>
</div>

In [None]:
temp = pd.DataFrame(df_data[df_data.Destination.isnull() == False]['Destination'].value_counts()).reset_index()
temp.columns = ['Destination', 'Count']

# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('Destination Count','Transported Mean vs Destination'))

fig.add_trace(go.Bar(x=temp['Destination'], y=temp['Count'], marker = dict(color=[px.colors.sequential.Blugrn[6], px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

temp = df_data[(df_data.Destination.isnull() == False) & (df_data.Transported.isnull() == False)][['Destination','Transported']]
temp['Transported'] = temp.groupby('Destination')['Transported'].transform('mean')
temp = pd.DataFrame(temp['Destination'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','Destination']

fig.add_trace(go.Bar(x=temp['Destination'], y=temp['Transported'], marker = dict(color=[px.colors.sequential.Blugrn[6], px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Destination Feature Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.9 | CryoSleep Feature Analysis</b></p>
</div>

In [None]:
temp = pd.DataFrame(df_data[df_data.CryoSleep.isnull() == False]['CryoSleep'].value_counts()).reset_index()
temp.columns = ['CryoSleep', 'Count']

# chart
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=('CryoSleep Count','Transported Mean vs CryoSleep'))

fig.add_trace(go.Bar(x=temp['CryoSleep'], y=temp['Count'], marker = dict(color=[px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 1)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=1)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=1)

temp = df_data[(df_data.CryoSleep.isnull() == False) & (df_data.Transported.isnull() == False)][['CryoSleep','Transported']]
temp['Transported'] = temp.groupby('CryoSleep')['Transported'].transform('mean')
temp = pd.DataFrame(temp['CryoSleep'].unique(), temp['Transported'].unique()).reset_index()
temp.columns = ['Transported','CryoSleep']

fig.add_trace(go.Bar(x=temp['CryoSleep'], y=temp['Transported'], marker = dict(color=[px.colors.sequential.Viridis[5],px.colors.qualitative.Plotly[4]])), row = 1, col = 2)

fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=1, col=2)
fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=1, col=2)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>CryoSleep Feature Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.10 | Outliers</b></p>
</div>

Before starting with this section, just mention that I have a massive Outlier Detection Tutorial here [Massiva PCA + Outlier Detection Tutorial](https://www.kaggle.com/code/javigallego/massive-pca-outlier-detection-tutorial#3-|-PCA-for-Data-Science). If you want to learn more about this, just check it.

### 2.10.1 | Outliers Definition
Outlier is an observation that is numerically distant from the rest of the data or in a simple word it is the value which is out of the range.let’s take an example to check what happens to a data set with and data set without outliers. Outlier can be of two types: Univariate and Multivariate. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space.

### 2.10.2 | Univariate Outliers Detection

 We'll start by detecting whether there are outliers in our dataset or not. 

#### 2.10.2.1 | Grubbs Test

$$
\begin{array}{l}{\text { Grubbs' test is defined for the hypothesis: }} \\ {\begin{array}{ll}{\text { Ho: }}  {\text { There are no outliers in the data set }} \\ {\mathrm{H}_{\mathrm{1}} :}  {\text { There is exactly one outlier in the data set }}\end{array}}\end{array}
$$
$$
\begin{array}{l}{\text {The Grubbs' test statistic is defined as: }} \\ {\qquad G_{calculated}=\frac{\max \left|X_{i}-\overline{X}\right|}{SD}} \\ {\text { with } \overline{X} \text { and } SD \text { denoting the sample mean and standard deviation, respectively. }} \end{array}
$$
$$
G_{critical}=\frac{(N-1)}{\sqrt{N}} \sqrt{\frac{\left(t_{\alpha /(2 N), N-2}\right)^{2}}{N-2+\left(t_{\alpha /(2 N), N-2}\right)^{2}}}
$$
$$
\begin{array}{l}{\text { If the calculated value is greater than critical, you can reject the null hypothesis and conclude that one of the values is an outlier }}\end{array}$$

In [None]:
import scipy.stats as stats
def grubbs_test(x, feature):
    n = len(x)
    mean_x = np.mean(x)
    sd_x = np.std(x)
    numerator = max(abs(x-mean_x))
    g_calculated = numerator/sd_x
    print("Feature:", feature)
    print("Grubbs Calculated Value:",g_calculated)
    t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
    g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    print("Grubbs Critical Value:",g_critical)
    if g_critical > g_calculated:
        print("From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers\n")
    else:
        print("From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers\n")
    
    print("==============================================================================================================================================")
    
for col in df_data.drop(['Age','Transported'],axis=1).select_dtypes(['float32']).columns:
    grubbs_test(df_data[df_data.Transported.isnull() == False][col], col)

#### 2.10.2.3 | Z-score method

Using Z score method,we can find out how many standard deviations value away from the mean. 

![minipic](https://i.pinimg.com/originals/cd/14/73/cd1473c4c82980c6596ea9f535a7f41c.jpg)

 Figure in the left shows area under normal curve and how much area that standard deviation covers.
* 68% of the data points lie between + or - 1 standard deviation.
* 95% of the data points lie between + or - 2 standard deviation
* 99.7% of the data points lie between + or - 3 standard deviation

$\begin{array}{l} {R.Z.score=\frac{0.6745*( X_{i} - Median)}{MAD}}  \end{array}$

If the z score of a data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other values. It is taken as outliers.

In [None]:
out=[]
def Zscore_outlier(df):
    m = np.mean(df)
    sd = np.std(df)
    row = 0
    for i in df: 
        z = (i-m)/sd
        if np.abs(z) > 3: 
            out.append(row)
        row += 1
    return out

### 2.10.3 | Multivariate Outliers Detection
#### 2.10.3.1 | DBSCAN

We are going to analyse outliers in two dimensions. One of our features is going to be the target one always. Let's start with **<span style='color:lightseagreen'>ShoppingMall</span>** feature.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

def dbscan(feature1):
    # scale data first
    cols=['Transported',feature1]
    X = StandardScaler().fit_transform(df_data[df_data.Transported.isnull() == False][cols].copy().values)

    db = DBSCAN(eps=3.0, min_samples=10).fit(X)
    labels = db.labels_

    plt.figure(figsize=(8,6))

    unique_labels = set(labels)
    colors = ['blue', 'red']

    for color,label in zip(colors, unique_labels):
        sample_mask = [True if l == label else False for l in labels]
        plt.plot(X[:,0][sample_mask], X[:, 1][sample_mask], 'o', color=color);
    plt.xlabel(cols[0]);
    plt.ylabel(cols[1]);

    print(pd.Series(labels).value_counts())
    
dbscan('ShoppingMall')

📌 **Interpret:** As we can observe, top5 people who have spent the most on this luxury service are detected as outliers. Thus, in the next cell we are going to drop them from our dataset. Hereafter, we'll keep our outliers detection task with next luxury service features. 

In [None]:
dropped_indexes = df_data[df_data.Transported.isnull() ==False][['ShoppingMall','Transported']].sort_values(by='ShoppingMall', ascending=False).head().index
aux = df_data[df_data.Transported.isnull() == True].copy()
df_data = df_data[df_data.Transported.isnull() == False].drop(dropped_indexes, axis=0)
df_data = pd.concat([df_data, aux], axis=0)
dbscan('RoomService')

In [None]:
dropped_indexes = df_data[df_data.Transported.isnull() ==False][['RoomService','Transported']].sort_values(by='RoomService', ascending=False).head(1).index
aux = df_data[df_data.Transported.isnull() == True].copy()
df_data = df_data[df_data.Transported.isnull() == False].drop(dropped_indexes, axis=0)
df_data = pd.concat([df_data, aux], axis=0)
dbscan('VRDeck')

In [None]:
dropped_indexes = df_data[df_data.Transported.isnull() ==False][['VRDeck','Transported']].sort_values(by='VRDeck', ascending=False).head(2).index
aux = df_data[df_data.Transported.isnull() == True].copy()
df_data = df_data[df_data.Transported.isnull() == False].drop(dropped_indexes, axis=0)
df_data = pd.concat([df_data, aux], axis=0)
dbscan('Spa')

In [None]:
dropped_indexes = df_data[df_data.Transported.isnull() ==False][['Spa','Transported']].sort_values(by='Spa', ascending=False).head(1).index
aux = df_data[df_data.Transported.isnull() == True].copy()
df_data = df_data[df_data.Transported.isnull() == False].drop(dropped_indexes, axis=0)
df_data = pd.concat([df_data, aux], axis=0)
dbscan('FoodCourt')

In [None]:
dropped_indexes = df_data[df_data.Transported.isnull() ==False][['FoodCourt','Transported']].sort_values(by='FoodCourt', ascending=False).head(4).index
aux = df_data[df_data.Transported.isnull() == True].copy()
df_data = df_data[df_data.Transported.isnull() == False].drop(dropped_indexes, axis=0)
df_data = pd.concat([df_data, aux], axis=0)

# <b>4 <span style='color:lightseagreen'>|</span> Feature Engineering</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.1 | Local CV Scoring Dataset Function</b></p>
</div>

The first step after EDA for us, is going to be building a reliable **<span style='color:lightseagreen'>local validation strategy</span>**. With reliable I mean a local CV score that **<span style='color:lightseagreen'>correlates</span>** with LB score. Because then we can use our local CV score to evaluate experiments or to tune (hyper)parameters. There are **<span style='color:lightseagreen'>two questions</span>** that I usually try to answer.

- **<span style='color:lightseagreen'>How to split</span>** the data in train and validation (there are a lot of different strategies)?
- Once a strategy is chosen does LB score moves in the direction of local CV score? If the answer is yes then probably the relationship between your local folds is the same relationship between Kaggle's train and test. If not, try other CV strategy and if you cannot find a reliable local CV then is it probably time to stop taking part in the competition because at the end you might be highly disappointed after the final shake-up.

In [None]:
#def score_dataset(X, y, model=XGBRegressor(tree_method='gpu_hist', predictor='gpu_predictor'), model_2 = CatBoostRegressor(task_type = 'GPU', silent=True)):
#def score_dataset(X, y, model=XGBRegressor(), model_2 = CatBoostRegressor(silent=True)):
def score_dataset(X,y,model=XGBClassifier(label_encoder=False)):
    # Label encoding is good for XGBoost and RandomForest, but one-hot
    # would be better for models like Lasso or Ridge. The `cat.codes`
    # attribute holds the category levels.
    for colname in X.select_dtypes(["object","bool","category"]).columns:
        X[colname] = LabelEncoder().fit_transform(X[colname])
    y['Transported'] = LabelEncoder().fit_transform(y['Transported'])
    # Metric for Titanic SpaceShipt competition is MAE (Mean Absolute Error)
    score_xgb = cross_val_score(
        model, X, y, cv=5, scoring="accuracy"
    )
    
    score = score_xgb.mean()
    return score

X = df_data[df_data.Transported.isnull() == False].copy()
y = pd.DataFrame(X.pop('Transported'))
baseline_score = score_dataset(X, y)
clear_output()
print(f"Baseline score: {baseline_score:.5f} Accuracy")

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.2 | Creating New Features</b></p>
</div>

### 4.2.1 | Age
In this case, what we'll do is making a distinction between several groups of ages. We are doing this, in order to make it easier to our classifier when making predictions and training.

In [None]:
df_data['Age'] = pd.qcut(df_data['Age'], 10)
df_data.head().style.set_properties(subset=['Age'], **{'background-color': 'lightseagreen'})

In [None]:
agebox = df_data[df_data.Transported.isnull() == False].copy()
agebox['Transported'].replace([False, True], [0,1], inplace = True)
agebox = agebox.groupby('Age').agg({'Transported':'mean'}).reset_index()
agebox['Age'] = agebox['Age'].astype(str)

fig = px.bar(agebox, x="Age", y="Transported", color="Transported", color_continuous_scale = 'Blugrn')

fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100, t=80),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Age Groups Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

### 4.2.2 | Family Features

Hereafter we are going to focus in **<span style='color:lightseagreen'>PassengerId</span>** and **<span style='color:lightseagreen'>Name</span>**. Firstly, we are going to take a brief view to both features, in order to study its relation. As we can see below, PassengerId feature is composed of **<span style='color:lightseagreen'>two parts</span>**
- First one is related to **<span style='color:lightseagreen'>FamilyId</span>**
- Second one is related to each member of the family. 

In other words, we can appreciate that both Altark Susent and Solam Susent are from the same family. Therefore, in PassengerId feature they have the same FamilyId, concretely 0003. In order to **<span style='color:lightseagreen'>distinguish</span>** them into the family group, their second PassengerId part are 01 and 02 respectively. Thus, we are going to create some new features:
- One for **<span style='color:lightseagreen'>FamilyId</span>**
- One for **<span style='color:lightseagreen'>Family Name</span>**
- One for **<span style='color:lightseagreen'>Family Size</span>**

In [None]:
df_data[['Name','PassengerId']].head().style.set_properties(subset=['PassengerId'], **{'background-color': 'lightseagreen'})

In [None]:
df_data['FamilyId'] = df_data['PassengerId'].str.split("_", n=2, expand=True)[0]
df_data['Family Name'] = df_data['Name'].str.split(' ', n=2, expand=True)[1]
df_data = df_data.set_index(['FamilyId','Family Name'])
df_data['Family Size'] = 1
for i in range(df_data.shape[0]):
    fam_size = df_data.loc[df_data.index[i],:].shape[0]
    df_data.loc[df_data.index[i],'Family Size'] = fam_size
    
df_data = df_data.reset_index()
df_data[['FamilyId','PassengerId','Family Name','Name','Family Size']].head().style.set_properties(subset=['FamilyId', 'Family Name','Family Size'], **{'background-color': 'lightseagreen'})

### 4.2.2 | Luxury Features

Hereafter, we are going to focus our attention on **<span style='color:lightseagreen'>luxury features</span>**. Those features are: 

- Spa
- VRDeck
- Food Court
- Room Service
- Shopping Mall

They all reflect the amount of money a passenger has spent on it. We can create a feature for telling us the amount of money a passenger has spent in all these luxuries. Let's call it **<span style='color:lightseagreen'>Luxury Spending</span>**. Hereafter, as it's going to be a continuous numerical feature, in order to make it easier to our model, we are going to split it into 10 groups (each of them related to one of the percentiles).

In [None]:
df_data['Luxury Spending'] = df_data['VRDeck'] + df_data['ShoppingMall'] + df_data['Spa'] + df_data['FoodCourt'] + df_data['RoomService']
luxury = df_data[df_data.Transported.isnull() == False].copy()
luxury['Luxury Spending'] = pd.qcut(luxury['Luxury Spending'], 6, duplicates = 'drop')
luxury = luxury.groupby('Luxury Spending').agg({'Transported':'mean'}).reset_index()
luxury['Luxury Spending'] = LabelEncoder().fit_transform(luxury['Luxury Spending'])

fig = px.bar(luxury, x="Luxury Spending", y="Transported", color='Transported', color_continuous_scale='Blugrn')

fig.update_xaxes(dtick=1, showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

# General Styling
fig.update_layout(height=400, bargap=0.2,
                  margin=dict(b=50,r=30,l=100, t=80),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Transported Probability per Luxury Spending Group</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)

Let's check how our Local CV Scoring Value has changed. As we can observe, there's been a bit of improve. Let's keep working with Feature Engineering thought !

In [None]:
X = df_data[df_data.Transported.isnull() == False].copy()
y = pd.DataFrame(X.pop('Transported'))
baseline_score = score_dataset(X, y)
clear_output()
print(f"Baseline score: {baseline_score:.5f} Accuracy")

In [None]:
df_data.head()

# <b>5 <span style='color:lightseagreen'>|</span> Feature Selection</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.1 | Scaling Features</b></p>
</div>

Before entering into pure feature selection methods, I'm going to scale thos features which previously obtained a skewness values greater than 1.

In [None]:
from sklearn.preprocessing import StandardScaler as ss
skew_col = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Luxury Spending']
for col in skew_col:
    df_data[col] = np.log1p(df_data[col])

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.2 | Mutual Information</b></p>
</div>

Mutual information describes **<span style='color:lightseagreen'>relationships in terms of uncertainty</span>**. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. If you knew the value of a feature, how much more **<span style='color:lightseagreen'>confident</span>** would you be about the target? Scikit-learn has two mutual information metrics in its feature_selection module: one for real-valued targets (mutual_info_regression) and one for categorical targets (mutual_info_classif). The next cell computes the MI scores for our features and wraps them up in a nice dataframe.

In [None]:
from sklearn.feature_selection import mutual_info_classif
def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object","bool"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    #discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_classif(X, y, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def uninformative_cols(df, mi_scores):
    return df.loc[:, mi_scores == 0.0].columns
    
x = df_data[df_data.Transported.isnull() == False].copy()
x['Transported'].replace([False, True], [0,1], inplace = True)
y = x.pop('Transported')

boolean_col = x.select_dtypes(['bool']).columns
for i in range(len(boolean_col)):
    x[boolean_col[i]].replace([False, True], [0,1], inplace = True)

for colname in x.drop('PassengerId',axis=1).select_dtypes(["object","category"]).columns:
    x[colname] = LabelEncoder().fit_transform(x[colname])

mi_scores = make_mi_scores(x, y)
mi_scores = pd.DataFrame(mi_scores).reset_index().rename(columns={'index':'Feature'})

fig = px.bar(mi_scores, x='MI Scores', y='Feature', color="MI Scores", color_continuous_scale='darkmint')

fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_layout(height = 500, title_text="Mutual Information Scores", plot_bgcolor='rgb(242, 242, 242)', paper_bgcolor = 'rgb(242, 242, 242)',
                  title_font=dict(size=29, family="Lato, sans-serif"), xaxis={'categoryorder':'category ascending'}, margin=dict(t=80))

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.3 | Heatmap</b></p>
</div>

In [None]:
x['Transported'] = df_data[df_data.Transported.isnull() == False]['Transported'].copy()
x['Transported'].replace([False, True], [0,1], inplace = True)
corr = x.corr()

fig = px.imshow(corr, color_continuous_scale='RdBu_r', origin='lower', text_auto=True, aspect='auto', color_continuous_midpoint=0.0)
fig.update_layout(height = 500, title_text="Correlation Heatmap", plot_bgcolor='rgb(242, 242, 242)', paper_bgcolor = 'rgb(242, 242, 242)',
                  title_font=dict(size=29, family="Lato, sans-serif"), margin=dict(t=90))

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.4 | Feature Usefulness</b></p>
</div>

In this section we'll analyse the usefulness of the features that we've just created. We'll plot how the target depends on every feature, that is, a diagram of $P(y=1|x)$. To get a meaningful plot, we apply two transformations:

* The x axis is not the value of the feature, but its index (when sorted by feature value).
* The y axis is not the target value (which can be only 0 or 1), but a rolling mean over 1000 targets.

Features with an horizontal line as diagram (the probability of the positive target is 0.5 independently of the feature value), are going to be considered bad ones, not useful. On the other hand, good features would have a curve with high $y_{max}-y_ {min}$
.

In [None]:
fig = make_subplots(rows=6, cols=3, column_widths=[0.33, 0.34, 0.33], vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=x.drop(['Transported', 'PassengerId'],axis=1).columns)
    
#plt.subplots(6, 4, sharey=True, sharex=True, figsize=(20, 25))

row_counter = 1
col_counter = 0
for col in x.drop(['Transported', 'PassengerId'],axis=1).columns:
    temp = pd.DataFrame({col: x[col].values,
                        'Transported': y.values})
    temp = temp.sort_values(col)
    temp.reset_index(inplace=True)
    
    aux = pd.DataFrame(temp.Transported.rolling(1000).mean())
    aux.columns = ['Mean']
    aux = aux[aux.Mean.isnull() == False]
    
    col_counter = col_counter +1
    fig.add_trace(go.Scatter(x=aux.index, y=aux['Mean'], marker = dict(color = px.colors.sequential.Blugrn[6])), row = row_counter, col = col_counter)  
    
    fig.update_xaxes(showgrid = False, linecolor='gray', linewidth = 2, zeroline = False, row=row_counter, col=col_counter)
    fig.update_yaxes(showgrid = False, linecolor='gray',linewidth=2, zeroline = False, row=row_counter, col=col_counter)
    
    #plt.scatter(aux.index, aux.Mean, s=2)            
    #plt.xlabel(col)
    #plt.xticks([])
    if col_counter % 3 == 0:
        col_counter = 0
        row_counter += 1

# General Styling
fig.update_layout(height=900, bargap=0.2,
                  margin=dict(b=50,r=30,l=100, t=90),
                  title = "<span style='font-size:36px; font-family:Times New Roman'>Feature Usefulness Analysis</span>",                  
                  plot_bgcolor='rgb(242,242,242)',
                  paper_bgcolor = 'rgb(242,242,242)',
                  font=dict(family="Times New Roman", size= 14),
                  hoverlabel=dict(font_color="floralwhite"),
                  showlegend=False)        
fig.show()

📌 Interpret: we can conclude the following after analysing the previous chart plotted:

- Both Luxury Services Features, and CryoSleep are the **<span style='color:lightseagreen'>best ones</span>**. 
- Side, Name, Age and some Family Features seem to be **<span style='color:lightseagreen'>useless</span>**.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.3 | Permutation Importance</b></p>
</div>

Before starting with the explanation, just mention that this will be included at the end of Modeling section, in order to reduce training time. Now let's go ahead with explanation: One of the most basic questions we might ask of a model is: What features have the biggest impact on predictions? This concept is called feature importance. There are multiple ways to measure feature importance. Some approaches answer subtly different versions of the question above. Other approaches have documented shortcomings. In this section, we'll focus on permutation importance. Compared to most other approaches, permutation importance is:

* Fast to calculate,
* Widely used and understood, and
* Consistent with properties we would want a feature importance measure to have.

In [None]:
# Permutation Importance
import eli5
from eli5.sklearn import PermutationImportance

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>5.4 | Selecting and Encoding</b></p>
</div>

In [None]:
drop_columns = ['FamilyId','Family Name','Family Size','Name','Side','VIP','Age']
df_data = df_data.drop(drop_columns, axis=1)
for colname in df_data.select_dtypes(["object","bool","category"]).drop('Transported',axis=1):
        df_data[colname], _ = df_data[colname].factorize()
df_data['Transported'].replace([False, True], [0,1], inplace = True)

# <b>6 <span style='color:lightseagreen'>|</span> Modeling</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>6.1 | Catboost</b></p>
</div>

### 6.1.1 | Hyperparameter Tuning - Optuna

I will add the code for hyperparameter tuning below. However, for not wasting CPU time, since I have run it once, I will simply create the model with the specific features values. I will control whether making hyperparameter tuning or not with allow_optimize Finally, just say that code for tuning takes plenty of time. Due to that I enabled GPU technology.

In [None]:
X = df_data[df_data.Transported.isnull() == False].drop('PassengerId',axis=1)
y = X.pop('Transported')

def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.0001, 0.3),
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
        "n_estimators": trial.suggest_int('n_estimators', 100, 500),
        "max_depth":trial.suggest_int("max_depth", 4, 16),
        'random_strength' :trial.suggest_int('random_strength', 0, 100),
        "l2_leaf_reg":trial.suggest_float("l2_leaf_reg",1e-8,3e-5),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'task_type': trial.suggest_categorical('task_type', ['GPU']),
        'eval_metric': trial.suggest_categorical('eval_metric', ['Accuracy'])
    }

    model = CatBoostClassifier(**params)
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_ac_score = accuracy_score(y_train_tmp, y_train_pred)
    valid_ac_score = accuracy_score(y_valid_tmp, y_valid_pred)
    
    print(f'Accuracy Score of Train: {train_ac_score}')
    print(f'Accuracy Score of Validation: {valid_ac_score}')
    
    return valid_ac_score

allow_optimize = 1
TRIALS = 40
TIMEOUT = 3600

if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'cat_parameter_opt',
        direction = 'maximize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_cat = CatBoostClassifier(**best_params, verbose=1000).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
else:
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_cat = CatBoostClassifier(
        verbose=1000,
        early_stopping_rounds=10,
        #iterations=5000,
        random_state = 2022, learning_rate = 0.08665686887824392, bagging_temperature = 2.010272294890727, n_estimators = 806, max_depth = 7, 
        random_strength = 35, l2_leaf_reg = 1.2373460332766636e-05, min_child_samples = 69, max_bin = 317, od_type = 'IncToDec', 
        task_type = 'GPU', eval_metric = 'Accuracy'
    ).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
    
clear_output()

In [None]:
plot_feature_importance(model_cat.get_feature_importance(),X.columns,'CatBoost')

In [None]:
X_test = df_data[df_data.Transported.isnull() == True].drop(['PassengerId','Transported'],axis=1)
perm = PermutationImportance(model_cat, random_state=1).fit(X, y)
pred = model_cat.predict(X_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

📌 **Interpret:** the values towards the top are the most important features, and those towards the bottom matter least. The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric). Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next. You'll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance

In [None]:
data_dir = Path("../input/spaceship-titanic")
df_test = pd.read_csv(data_dir / "test.csv")
submit = pd.DataFrame({'PassengerId': df_test['PassengerId'], 'Transported':pred}).set_index('PassengerId')
submit['Transported'].replace([0,1], [False, True], inplace=True)
submit.to_csv('./submission.csv')

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>6.2 | LGBM</b></p>
</div>

In [None]:
def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),           
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 1),   
        "n_estimators": trial.suggest_int('n_estimators',50,2000),                 
        "max_depth" : trial.suggest_int("max_depth", 1, 20),
        "_min_child_weight" : trial.suggest_float("_min_child_weight", 0.1, 10),
        "reg_lambda" : trial.suggest_float("reg_lambda", 0.01, 10),
        "reg_alpha" : trial.suggest_float('reg_alpha',0.01,10),
        "num_leaves" : trial.suggest_int("num_leaves", 50, 100),
        'subsample' : trial.suggest_float('subsample', 0.01, 1)
    }

    model = LGBMClassifier(**params, device='GPU')
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_ac_score = accuracy_score(y_train_tmp, y_train_pred)
    valid_ac_score = accuracy_score(y_valid_tmp, y_valid_pred)
    
    print(f'Accuracy Score of Train: {train_ac_score}')
    print(f'Accuracy Score of Validation: {valid_ac_score}')
    
    return valid_ac_score

allow_optimize = 1
TRIALS = 40
TIMEOUT = 3600

if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'lgbm_parameter_opt',
        direction = 'maximize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_lgbm = LGBMClassifier(**best_params, verbose=1000).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
else:
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_lgbm = LGBMClassifier(        
        random_state = 2022, learning_rate = 0.010735756929247671, n_estimator = 1778, max_depth = 20, 
        min_child_weight = 8.703291969551636, reg_lambda = 6.849268797524583, reg_alpha = 4.290515300840544, 
        num_leaves = 90, subsample = 0.25248008329519706).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
    
clear_output()

In [None]:
plot_feature_importance(model_lgbm.feature_importances_,X.columns,'LGBM')

In [None]:
X_test = df_data[df_data.Transported.isnull() == True].drop(['PassengerId','Transported'],axis=1)
perm = PermutationImportance(model_lgbm, random_state=1).fit(X, y)
pred = model_lgbm.predict(X_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>6.3 | XGBoosting</b></p>
</div>

In [None]:
def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),           # categorical for concrete values
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 1),   # loguniform for continuos values
        "n_estimators": trial.suggest_int('n_estimators',50,2000),                 # int for discrete values. Interval between [100,2000]
        "max_depth" : trial.suggest_int("max_depth", 1, 20),
        "min_samples_split" : trial.suggest_int("min_samples_split", 2, 20),
        "min_samples_leaf" : trial.suggest_int("min_samples_leaf", 2, 20),
        "alpha" : trial.suggest_loguniform('alpha',0.9,1),
        "max_features" : trial.suggest_int("max_features", 10, 50)
    }

    model = XGBClassifier(**params, tree_method='gpu_hist', predictor='gpu_predictor')
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_ac_score = accuracy_score(y_train_tmp, y_train_pred)
    valid_ac_score = accuracy_score(y_valid_tmp, y_valid_pred)
    
    print(f'Accuracy Score of Train: {train_ac_score}')
    print(f'Accuracy Score of Validation: {valid_ac_score}')
    
    return valid_ac_score

allow_optimize = 1
TRIALS = 40
TIMEOUT = 3600

if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'xgb_parameter_opt',
        direction = 'maximize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_xgb = XGBClassifier(**best_params, tree_method='gpu_hist', predictor='gpu_predictor').fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
else:
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
    model_xgb = XGBClassifier(        
        random_state = 2022, learning_rate = 0.010735756929247671, n_estimator = 1778, max_depth = 20, 
        min_child_weight = 8.703291969551636, reg_lambda = 6.849268797524583, reg_alpha = 4.290515300840544, 
        num_leaves = 90, subsample = 0.25248008329519706, tree_method='gpu_hist', predictor='gpu_predictor').fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)
    
clear_output()

In [None]:
plot_feature_importance(model_xgb.feature_importances_,X.columns,'XGB')

In [None]:
X_test = df_data[df_data.Transported.isnull() == True].drop(['PassengerId','Transported'],axis=1)
perm = PermutationImportance(model_xgb, random_state=1).fit(X, y)
pred = model_xgb.predict(X_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>6.4 | Voting Classifier</b></p>
</div>

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

### 6.4.1 | Majority (Hard) Voting

In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode) of the class labels predicted by each individual classifier. For example, if the prediction for a given sample is:

* classifier 1 -> class 1
* classifier 2 -> class 1
* classifier 3 -> class 2

the VotingClassifier (with voting='hard') would classify the sample as “class 1” based on the majority class label. In the cases of a tie, the VotingClassifier will select the class based on the ascending sort order. For example, in the following scenario:

* classifier 1 -> class 2
* classifier 2 -> class 1

the class label 1 will be assigned to the sample.

### 6.4.2 | Soft Voting

In contrast to majority voting (hard voting), soft voting returns the class label as argmax of the sum of predicted probabilities. Specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged. The final class label is then derived from the class label with the highest average probability. Suppose given some input to three models, the prediction probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest probability averaged by each classifier.

In [None]:
from sklearn.ensemble import VotingClassifier
votingC = VotingClassifier(estimators=[('cat', model_cat), ('lgbm', model_lgbm),
('xgb', model_xgb)], voting='hard', n_jobs=-1)

X_test = df_data[df_data.Transported.isnull() == True].drop(['PassengerId','Transported'],axis=1)
votingC.fit(X,y)
perm = PermutationImportance(votingC, random_state=1).fit(X, y)
pred = votingC.predict(X_test)
clear_output()
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

In [None]:
data_dir = Path("../input/spaceship-titanic")
df_test = pd.read_csv(data_dir / "test.csv")
submit = pd.DataFrame({'PassengerId': df_test['PassengerId'], 'Transported':pred}).set_index('PassengerId')
submit['Transported'].replace([0,1], [False, True], inplace=True)
submit.to_csv('./submission.csv')

In progress ... 

* Working on EDA
* Feature Engineering
* Model Comparison
* Ensembling