# **Spaceship Titanic**

# **Project Description**

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension! To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!

# **Goals**

Get highest classification accuracy score calculated by percentage of predicted labels for transported passengers

# **1.Imports**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import optuna

import torch
import torch.nn as nn
import torch.optim as optim

#from ydata_profiling import ProfileReport

import xgboost as xgb
import lightgbm as lgbm
import catboost as cb

from tqdm import tqdm

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, RocCurveDisplay
from sklearn.feature_selection import chi2, SelectKBest
# from sklearn.experimental import enable_iterative_imputer 'try in future'
# from sklearn.impute import IterativeImputer

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
optuna.logging.set_verbosity(optuna.logging.WARNING)

Because data size is acceptable for memory capacity we're reading entire data from csv to DataFrame.

In [None]:
TRAIN_DATA_PATH = 'data/train.csv'
TEST_DATA_PATH = 'data/test.csv'

In [None]:
train = pd.read_csv(TRAIN_DATA_PATH, index_col=False)
test = pd.read_csv(TEST_DATA_PATH, index_col=False)

# **2.Data processing**

**Data Field Descriptions**

```PassengerId``` - A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always.

```HomePlanet``` - The planet the passenger departed from, typically their planet of permanent residence.

```CryoSleep``` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

```Cabin``` - The cabin number where the passenger is staying. Takes the form `deck/num/side`, where `side` can be either `P` for Port or `S` for Starboard.

```Destination``` - The planet the passenger will be debarking to.

```Age``` - The age of the passenger.

```VIP``` - Whether the passenger has paid for special VIP service during the voyage.

```RoomService```, ```FoodCourt```, ```ShoppingMall```, ```Spa```, ```VRDeck``` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

```Name``` - The first and last names of the passenger.

```Transported``` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

**Few steps before we start data analysis**

As was mentioned above each ```PassengerId``` takes the form `gggg_pp`, ```Cabin``` takes the form `deck/num/side` and ```Name``` contains first and last name my first step will be creating new columns from those features.

In [None]:
temp = pd.concat([train, test]).copy()

In [None]:
def split_passenger_id(df):
    df[['Group', 'pp']] = df.PassengerId.str.split('_', expand=True)
    return df

In [None]:
def split_cabin(df):
    df[['Deck', 'Num', 'Side']] = df.Cabin.str.split('/', expand=True)
    df.drop(columns='Cabin',axis=1, inplace=True)
    return df

In [None]:
def split_name(df):
    df[['FirstName', 'LastName']] = df.Name.str.split(' ', expand=True)
    df.drop(columns='Name',axis=1, inplace=True)
    return df

In [None]:
temp = (temp
        .pipe(split_passenger_id)
        .pipe(split_cabin)
        .pipe(split_name)
        )

## **2.1 Background information**

In [None]:
def get_number_of_null_cells_in_row(df):
    return df.isnull().sum(axis=1).value_counts()

In [None]:
temp

In [None]:
print(f"The number of rows in train data is {temp.shape[0]}, and the number of columns is {temp.shape[1]}")

In [None]:
temp.info()

In [None]:
temp.describe()

In [None]:
temp.isna().sum().plot(kind='bar')

In [None]:
get_number_of_null_cells_in_row(temp)

Approximately ~25% rows include null values that must be filled in.

In [None]:
temp['Transported'].value_counts().plot(kind='pie', autopct='%1.1f%%')

As we can see train dataset is almost perfectly balanced.

### **2.1.1 Age --> Other**

In [None]:
temp

In [None]:
fig, ax = plt.subplots(15,figsize=(18,36))
sns.countplot(x='Age',hue='Transported',data=temp, ax=ax[14])
sns.countplot(x='Age',hue='HomePlanet',data=temp, ax=ax[0])
sns.countplot(x='Age',hue='CryoSleep',data=temp, ax=ax[1])
sns.countplot(x='Age',hue='Destination',data=temp, ax=ax[2])
sns.countplot(x='Age',hue='VIP',data=temp, ax=ax[3])
sns.countplot(x='Age',hue='Deck',data=temp, ax=ax[4])
sns.countplot(x='Age',hue='Side',data=temp, ax=ax[5])

temp.groupby('Age')['RoomService'].sum().plot(kind='bar', ax=ax[6], legend=True)
temp.groupby('Age')['FoodCourt'].sum().plot(kind='bar', ax=ax[7], legend=True)
temp.groupby('Age')['ShoppingMall'].sum().plot(kind='bar', ax=ax[8], legend=True)
temp.groupby('Age')['Spa'].sum().plot(kind='bar', ax=ax[9], legend=True)
temp.groupby('Age')['VRDeck'].sum().plot(kind='bar', ax=ax[10], legend=True)
temp.groupby('Age')['Group'].count().plot(kind='bar', ax=ax[11], legend=True)
temp.groupby('Age')['pp'].count().plot(kind='bar', ax=ax[12], legend=True)
temp.groupby('Age')['Num'].count().plot(kind='bar', ax=ax[13], legend=True)
fig.tight_layout()
plt.show

Age is a continuous feature were necessery is discretization which I do grouping some ages basing on visualization above.<br>
Groups:<br>
0 - age equal 0 many unborns/newborns was transported it might have impact for future training. <br>
1 - age 1-4 high transportation ratio <br>
2 - age 5-12 low count of persons in those age group i compare to other<br>
3 - age 13-17 the first group in which expenses appear<br>
4 - age 18-24 up to 24 age old only 4 VIP exist<br>
5 - age 25-65 others<br>
6 - age 66- count of persons above 66 years old is highly decrased

In [None]:
def group_age(df):
    df['Age'] = pd.cut(df['Age'], bins=[0,1,5,13,18,25,66,110], labels=[0,1,2,3,4,5,6], right=False)
    return df

### **2.1.2 HomePlanet --> other**

In [None]:
fig, ax = plt.subplots(15,figsize=(20,40))
sns.countplot(x='HomePlanet',hue='Transported',data=temp, ax=ax[14])
sns.countplot(x='HomePlanet',hue='HomePlanet',data=temp, ax=ax[0])
sns.countplot(x='HomePlanet',hue='CryoSleep',data=temp, ax=ax[1])
sns.countplot(x='HomePlanet',hue='Destination',data=temp, ax=ax[2])
sns.countplot(x='HomePlanet',hue='VIP',data=temp, ax=ax[3])
sns.countplot(x='HomePlanet',hue='Deck',data=temp, ax=ax[4])
sns.countplot(x='HomePlanet',hue='Side',data=temp, ax=ax[5])

temp.groupby('HomePlanet')['RoomService'].sum().plot(kind='bar', ax=ax[6], legend=True)
temp.groupby('HomePlanet')['FoodCourt'].sum().plot(kind='bar', ax=ax[7], legend=True)
temp.groupby('HomePlanet')['ShoppingMall'].sum().plot(kind='bar', ax=ax[8], legend=True)
temp.groupby('HomePlanet')['Spa'].sum().plot(kind='bar', ax=ax[9], legend=True)
temp.groupby('HomePlanet')['VRDeck'].sum().plot(kind='bar', ax=ax[10], legend=True)
temp.groupby('HomePlanet')['Group'].count().plot(kind='bar', ax=ax[11], legend=True)
temp.groupby('HomePlanet')['pp'].count().plot(kind='bar', ax=ax[12], legend=True)
temp.groupby('HomePlanet')['Num'].count().plot(kind='bar', ax=ax[13], legend=True)
fig.tight_layout()
plt.show()

In [None]:
temp[(temp.VIP == True) & (temp.HomePlanet == 'Earth')]

In [None]:
temp[(temp.Destination == 'PSO J318.5-22') & (temp.HomePlanet == 'Europa')].shape

In [None]:
temp.groupby('HomePlanet')['Deck'].unique()

Important information which can help fill `nan` is that no one from Earth get VIP status also only passengers from Earth get 'G' Deck and from Europa Decks 'B', 'A', 'C'<br>

For only 29 passengers out of about 2000 departure from Europa,the destination is PSO J318.5-22

### **2.1.3 VIP --> expenses**

In [None]:
temp['Expenses'] = temp['RoomService'] +temp['FoodCourt'] + temp['ShoppingMall'] + temp['Spa'] + temp['VRDeck']

In [None]:
temp.groupby('VIP')['Expenses'].mean().plot(kind='bar', legend=True)

As we can see on `Age` and `HomePlanet` charts, the specific bill information does not give specific results therefore it will be collected into one column.<br>

`Expenses` `nan` can be filled by mean of expenses depending on VIP status

In [None]:
EXPENSES_COLUMNS = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
def combine_expenses(df):
    df['Expenses'] = 0
    for column in EXPENSES_COLUMNS:
        df['Expenses'] = df['Expenses'] + df[column]
    #temp.drop(EXPENSES_COLUMNS, axis=1, inplace=True)
    return df

### **2.1.4 CryoSleep**

In [None]:
temp.groupby('CryoSleep')['Expenses'].sum().plot(kind='bar', legend=True)

In [None]:
temp[(temp.CryoSleep == True) & (temp.VIP == True)]

Passengers in `CryoSleep` have no expenses and if they travel from `Europa` are VIP. 

### **2.1.5 Outliers**

In [None]:
plt.figure(figsize=(10,10))
for i, feature in enumerate(EXPENSES_COLUMNS + ["Expenses"]):
    plt.subplot(3,2,i+1)
    sns.boxplot(x = temp[feature])


In [None]:
print(temp['RoomService'].mean())
print(temp[temp['RoomService'] > 0]['RoomService'].median())

In [None]:
print(temp['FoodCourt'].mean())
print(temp[temp['FoodCourt'] > 0]['FoodCourt'].median())

In [None]:
print(temp['ShoppingMall'].mean())
print(temp[temp['ShoppingMall'] > 0]['ShoppingMall'].median())

In [None]:
print(temp['Spa'].mean())
print(temp[temp['Spa'] > 0]['Spa'].median())

In [None]:
print(temp['VRDeck'].mean())
print(temp[temp['VRDeck'] > 0]['VRDeck'].median())

`Expenses` - due to many outliers `nan` will be filled by median

In [None]:
#TODO Check similar LastNames for HomePlanet 

In [None]:
sns.heatmap(temp.corr(), annot=True, cbar=False)
plt.show()

## **2.2 Data cleaning**

#Describe steps based on previous analysis

In [None]:
train_test = pd.concat([train, test])

In [None]:
train_test = (train_test
              .pipe(split_cabin)
              .pipe(split_name)
              .pipe(split_passenger_id)
              )

In [None]:
train_test[train_test.duplicated()]

There is no duplicated rows between train and test set.

## **2.2.1 VIP**

In [None]:
train_test.VIP.isnull().sum()

In [None]:
train_test['VIP'] = np.where(train_test.VIP.isnull() & (train_test.Age < 25), False, train_test.VIP)

In [None]:
train_test['VIP'] = np.where(train_test.VIP.isnull() & train_test.HomePlanet.str.contains('Earth'), False, train_test.VIP)
train_test['VIP'] = np.where(train_test.VIP.isnull() & train_test.HomePlanet.str.contains('Europa'), True, train_test.VIP)

In [None]:
train_test.VIP.isnull().sum()

## **2.2.2 HomePlanet**

In [None]:
#group_age after fill nans

In [None]:
train_test.HomePlanet.isnull().sum()

In [None]:
train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('G'), 'Earth', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('B'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('A'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('C'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.VIP & train_test.Deck.str.contains('F'), 'Mars', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & (train_test.VIP==False), 'Earth', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.CryoSleep & train_test.VIP, 'Europa', train_test.HomePlanet)

In [None]:
train_test.HomePlanet.isnull().sum()

There is still 2 `nan` where just 1 is from test set

## **2.2.3 Expenses**

In [None]:
for column in EXPENSES_COLUMNS:
    if_VIP_true = train_test[(train_test[column] > 0) & (train_test.VIP)][column].median()
    if_VIP_false = train_test[(train_test[column] > 0) & (train_test.VIP == False)][column].median()
    train_test[column] = np.where(train_test[column].isnull() & train_test.CryoSleep, 0, train_test[column])
    train_test[column] = np.where((train_test[column].isnull() )& (train_test.Age < 13), 0, train_test[column])
    train_test[column] = np.where(train_test[column].isnull() & (train_test.CryoSleep == False) & train_test.VIP, if_VIP_true, train_test[column])
    train_test[column] = np.where(train_test[column].isnull() & (train_test.CryoSleep == False), if_VIP_false, train_test[column])

In [None]:
train_test = combine_expenses(train_test)

In [None]:
train_test.Expenses.isnull().sum()

## **2.2.4 CryoSleep**

In [None]:
train_test.CryoSleep.isnull().sum()

In [None]:
train_test['CryoSleep'] = np.where(train_test.CryoSleep.isnull() & (train_test.Expenses == 0), True, train_test.CryoSleep)
train_test['CryoSleep'] = np.where(train_test.CryoSleep.isnull() & (train_test.Expenses > 0), False, train_test.CryoSleep)

In [None]:
train_test.CryoSleep.isnull().sum()

## **2.2.5 Age**

In [None]:
train_test.Age.isnull().sum()

In [None]:
median_without_expenses = train_test[(train_test.Age < 13)]['Age'].median()

In [None]:
median_with_expenses = train_test[(train_test.Age > 12)]['Age'].median()

In [None]:
train_test['Age'] = np.where(train_test.Age.isnull() & (train_test.Expenses == 0), median_without_expenses, train_test.Age)
train_test['Age'] = np.where(train_test.Age.isnull() & (train_test.Expenses > 0), median_with_expenses, train_test.Age)

In [None]:
train_test.Age.isnull().sum()

## **2.2.6 Destination**

No correlation found filled using `mode`

In [None]:
train_test.Destination.isnull().sum()

In [None]:
train_test['Destination'] = train_test.Destination.fillna(value=train_test.Destination.mode()[0])

In [None]:
train_test.Destination.isnull().sum()

In [None]:
#ProfileReport(train, title='Spaceship Titanic').to_file('Spaceship_Titanic.html')

## **2.2.7 Deck/Num/Side**

In [None]:
train_test.Deck.isnull().sum()

Number of `nan` in `Side` and `Num` is same because was created from one feature as `Deck` <br>
First I will try restore `Deck` values from `HomePlanet` were was 'G' for 'Earth' and 'B','A', 'C' for 'Europa' <br>

**Deck**

In [None]:
train_test[train_test.Deck.isnull()]

In [None]:
train_test.groupby(['HomePlanet', 'VIP', 'Deck'])['Destination'].count()

Most common `Deck` for 'Mars' is 'F' that is why I decide to fill rest of `nan` by 'F'

In [None]:
train_test['Deck'] = np.where(train_test.Deck.isnull() & train_test.HomePlanet.str.contains('Earth'), 'G', train_test.Deck)
train_test['Deck'] = np.where(train_test.Deck.isnull() & train_test.HomePlanet.str.contains('Europa'), 'B', train_test.Deck)
train_test['Deck'] = np.where(train_test.Deck.isnull() & train_test.HomePlanet.str.contains('Mars'), 'F', train_test.Deck)

In [None]:
train_test.Deck.isnull().sum()

**Num**

`Num` will be filled based on

In [None]:
train_test.Num.isnull().sum()

`num` and `Side` are same when `Group` and `LastName` are same

In [None]:
train_test['Num'] = np.where(train_test.Num.isnull() & train_test.LastName.eq(train_test.LastName.shift()), train_test.Num.shift(), train_test.Num)
train_test['Side'] = np.where(train_test.Side.isnull() & train_test.LastName.eq(train_test.LastName.shift()), train_test.Side.shift(), train_test.Side)

In [None]:
print(train_test.Num.isnull().sum())
print(train_test.Side.isnull().sum())

Other `nan` will be made up `mode` function

In [None]:
train_test['Num'] = train_test.Num.fillna(value=train_test.Num.mode()[0])
train_test['Side'] = train_test.Side.fillna(value=train_test.Side.mode()[0])

## **2.2.8 LastName**

In [None]:
train_test['LastName'] = np.where(train_test.LastName.isnull(), train_test.LastName.shift(), train_test.LastName)


In [None]:
train_test.LastName.isnull().sum()

## **2.2.9 Repeat steps**
Repetition of the steps is necessary because on the `nan` completed in the next steps were the basis for filling in the previous gaps.

**VIP**

In [None]:
train_test['VIP'] = np.where(train_test.VIP.isnull() & (train_test.Age < 25), False, train_test.VIP)
train_test['VIP'] = np.where(train_test.VIP.isnull() & train_test.HomePlanet.str.contains('Earth'), False, train_test.VIP)

**HomePlanet**

In [None]:
train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('G'), 'Earth', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('B'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('A'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.Deck.str.contains('C'), 'Europa', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.VIP & train_test.Deck.str.contains('F'), 'Mars', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & (train_test.VIP==False), 'Earth', train_test.HomePlanet)

train_test['HomePlanet'] = np.where(train_test.HomePlanet.isnull() & train_test.CryoSleep & train_test.VIP, 'Europa', train_test.HomePlanet)

**Expenses**

In [None]:
for column in EXPENSES_COLUMNS:
    if_VIP_true = train_test[(train_test[column] > 0) & (train_test.VIP)][column].median()
    if_VIP_false = train_test[(train_test[column] > 0) & (train_test.VIP == False)][column].median()
    train_test[column] = np.where(train_test[column].isnull() & train_test.CryoSleep, 0, train_test[column])
    train_test[column] = np.where((train_test[column].isnull() )& (train_test.Age < 13), 0, train_test[column])
    train_test[column] = np.where(train_test[column].isnull() & (train_test.CryoSleep == False) & train_test.VIP, if_VIP_true, train_test[column])
    train_test[column] = np.where(train_test[column].isnull() & (train_test.CryoSleep == False), if_VIP_false, train_test[column])

In [None]:
train_test = combine_expenses(train_test)

**CryoSleep**

In [None]:
train_test['CryoSleep'] = np.where(train_test.CryoSleep.isnull() & (train_test.Expenses == 0), True, train_test.CryoSleep)
train_test['CryoSleep'] = np.where(train_test.CryoSleep.isnull() & (train_test.Expenses > 0), False, train_test.CryoSleep)

In [None]:
train_test.drop('Transported', axis=1).isna().sum().plot(kind='bar')

In [None]:
train_test.drop(['FirstName'], axis=1, inplace=True)

## **2.2.10 Drop `nan`**

In [None]:
get_number_of_null_cells_in_row(train_test.drop('Transported', axis=1))

`nan` in test set

In [None]:
get_number_of_null_cells_in_row(train_test[train_test.Transported.isnull()])

In [None]:
train = train_test[~train_test.Transported.isnull()].dropna().copy()
test = train_test[train_test.Transported.isnull()].copy()

In [None]:
get_number_of_null_cells_in_row(train)

1841 rows has been restored by filling `nan` in train dataset

## **2.3 Feature Engineering**

In [None]:
train_test = pd.concat([train, test])

In [None]:
train_test.columns

In [None]:
train_test.info()

In [None]:
train_test = group_age(train_test)

In [None]:
train_test.VIP = train_test.VIP.replace({True : 1, False : 0})
train_test.CryoSleep = train_test.CryoSleep.replace({True : 1, False : 0})
train_test.Side = train_test.Side.replace({'P' : 1, 'S' : 0}) 
# train_test.Transported = train_test.Transported.astype('bool') 

In [None]:
categorical = ['Age', 'LastName', 'Num','Group', 'pp']  # deleted pp
for column in categorical:
    encoder = LabelEncoder()
    train_test[column] = encoder.fit_transform(train_test[column])

In [None]:
train_test = pd.get_dummies(train_test, columns=['HomePlanet', 'Destination', 'Deck'])

In [None]:
train_test.info()

## **2.3.1 Data splitting**

In [None]:
train_test.reset_index(drop=True, inplace=True)
train = train_test[~train_test.Transported.isnull()].copy().astype('float64')

test = train['Transported'].copy().astype('float64')
train = train.drop(labels=['PassengerId', 'Transported'], axis=1)

submission_test = train_test[train_test.Transported.isnull()].copy()
submission = submission_test.PassengerId.copy()
submission_test = submission_test.drop(labels =['PassengerId', 'Transported'], axis=1)

In [None]:
X_train, X_test,y_train, y_test = train_test_split(train,test, test_size=0.2, random_state=42)

X_train, X_val,y_train, y_val = train_test_split(X_train,y_train, test_size=0.2, random_state=42)

## **2.3.2 Data Normalization**

In [None]:
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)
# X_val = scaler.transform(X_val)

In [None]:
train_test[~train_test.Transported.isnull()]['Transported'].value_counts().plot(kind='pie', autopct='%1.1f%%')

After data cleaning dataset balance stay same as before

# **3.Models**

The algorithms it will use:<br>
    XGBoost<br>
    LightGBM<br>
    CatBoost<br>
    
Neural Network:<br>
    PyTorch logistic regression

## **3.1 Machine Learning algorithms**

In [None]:
def kfold_mean(model, n_splits):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    scores = []

    for train_idx, test_idx in kf.split(train):

        X_train, X_test = train.iloc[train_idx], train.iloc[test_idx]
        y_train, y_test = test[train_idx], test[test_idx]

        preds = model.predict(X_test)
        loss = accuracy_score(y_test, preds)
        scores.append(loss)
    accuracy = np.mean(scores)
    print(f"KFold mean score: {accuracy}")
    return accuracy

In [None]:
def calculate_metrics(y_test, preds):
    accuracy = accuracy_score(y_test, preds)
    recall = recall_score(y_test, preds)
    precision = precision_score(y_test, preds)
    f1 = f1_score(y_test, preds)
    return accuracy, recall, precision, f1

## **3.1.1 XGBoost**

In [None]:
#loss function and evaluation metric serve two different purposes. 
#The loss function is used by the model to learn the relationship between input and output. 
#The evaluation metric is used to assess how good the learned relationship is. 
#Here is a link to a discussion of model https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
def objective(trial):

    params = {
        #Parameters to tune
        'n_estimators': trial.suggest_int('n_estimators', 50, 3000),
        'max_depth': trial.suggest_int('max_depth', 1, 20),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        
        #Parameters for faster speed
        'colsample_bytree' : trial.suggest_loguniform('colsample_bytree', 0.01, 1.0),
        'subsample': trial.suggest_loguniform('subsample', 0.01, 1.0),
        
        #Parameters to control overfitting
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
        #'gamma': trial.suggest_loguniform('gamma',1, 10),
        'early_stopping_rounds' : trial.suggest_int('early_stopping_rounds',5,30),
        
        #Regularizers for bigger dataset
        'alpha' : trial.suggest_int('alpha', 0, 5),
        #'lambda' : trial.suggest_int('lambda', 1, 5),
        
        #Loss function and evaluation metric
        'objective' : 'binary:logistic', # represents cross entropy loss function
        'eval_metric': 'logloss', #according to objective when used with binary classification objective should be 'binary:logistic'
        
        #'tree_method' : 'gpu_hist'
    
    }
    
    
    # Fit the model
    optuna_model = xgb.XGBClassifier(**params)
    optuna_model.fit(X_train, y_train, verbose=False, eval_set=[(X_val, y_val)])

    # Make predictions
    y_pred = optuna_model.predict(X_test)

    # Evaluate predictions
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [None]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

In [None]:
xgbc = xgb.XGBClassifier(**study.best_trial.params)
xgbc.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False) 
kfold_acc = kfold_mean(xgbc,10)


In [None]:
accuracy, recall, precision, f1 = calculate_metrics(y_test, xgbc.predict(X_test))

xgbc_metrics = {'Model': 'XGBoost','KFold cv accuracy': kfold_acc ,'Accuracy': accuracy, 'recall': recall, 'precision': precision, 'f1': f1}

## **3.1.2 LightGBM**

In [None]:
def objective(trial):

    params={
    #Parameters to tune
    'num_leaves' : trial.suggest_int('num_leaves',2,800), #should be less than 2^max_depth lower better acc
    'max_depth' : trial.suggest_int('max_depth', 1, 15),
    'min_data_in_leaf' : trial.suggest_int('min_data_in_leaf', 0, 400),
    
    #Parameters for better accurancy
    'max_bin' : trial.suggest_int('max_bin',100,600), #small number of bins may reduce training accuracy 
                                                    #but may increase general power (deal with over-fitting)
    'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
    
    #Parameters to control over-fitting
    'min_gain_to_split' : trial.suggest_loguniform('min_gain_to_split', 1.0, 10.0),
    'early_stopping' : trial.suggest_int('early_stopping_rounds',5,30),
    
    #Loss function and evaluation metric
    'objective' : 'binary', 
    'metric': 'binary_logloss', 
        
    #Regularizers
    'lambda_l1' : trial.suggest_loguniform('lambda_l1', 0.01, 5.0),
    #'lambda_l2' : trial.suggest_loguniform('lambda_l2', 0.01, 5.0),
        
    'device_type' : 'CPU',
    'n_jobs' : -1,
    'verbose' : -1,
    'verbose_eval' : -1
}
    
    
    # Fit the model
    optuna_model = lgbm.LGBMClassifier(**params)
    optuna_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

    # Make predictions
    y_pred = optuna_model.predict(X_test)

    # Evaluate predictions
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [None]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, n_jobs=-1)

In [None]:
trial = study.best_trial
lgbmc = lgbm.LGBMClassifier(**trial.params)
lgbmc.fit(X_train, y_train, eval_set=[(X_val, y_val)],verbose=False,)
kfold_acc = kfold_mean(lgbmc, 10)

In [None]:
accuracy, recall, precision, f1 = calculate_metrics(y_test, lgbmc.predict(X_test))

lgbmc_metrics = {'Model': 'LightGBM','KFold cv accuracy': kfold_acc ,'Accuracy': accuracy, 'recall': recall, 'precision': precision, 'f1': f1}

In [None]:
lgbmc_metrics

In [None]:
# sub_pred_lgbmc = lgbmc.predict(submission_test)

# sub_pred_lgbmc = sub_pred_lgbmc.astype('bool')

## **3.1.2 CatBoost**

In [None]:
def objective(trial):
# https://practicaldatascience.co.uk/machine-learning/how-to-tune-a-catboostclassifier-model-with-optuna
    params={
    #Parameters to tune
    'iterations' : trial.suggest_int('iterations',100,2000),
    'depth' : trial.suggest_int('depth', 1, 15),    
    'min_data_in_leaf' : trial.suggest_int('min_data_in_leaf', 0, 200),
    
    #Parameters to control overfitting
    'early_stopping_rounds': trial.suggest_int('early_stopping_rounds',5,30),
    'od_type' : trial.suggest_categorical("od_type", ["IncToDec", "Iter"]),
    'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
        
    #Regularization
    'l2_leaf_reg' : trial.suggest_float("l2_leaf_reg", 1e-8, 100.0),
        
    #loss function and evaluation metric
    'objective' : trial.suggest_categorical("objective", ['Logloss']),
    'eval_metric' : trial.suggest_categorical('eval_metric',['Accuracy']),
    
    #'task_type' : 'GPU,'
    'verbose' : False
}
    
    
    # Fit the model
    optuna_model = cb.CatBoostClassifier(**params)
    optuna_model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)

    # Make predictions
    y_pred = optuna_model.predict(X_test)

    # Evaluate predictions
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [None]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, n_jobs=-1)

In [None]:
cbc = cb.CatBoostClassifier(**study.best_trial.params)
cbc.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
kfold_acc = kfold_mean(cbc, 10)

In [None]:
accuracy, recall, precision, f1 = calculate_metrics(y_test, cbc.predict(X_test))

cbc_metrics = {'Model': 'CatBoost','KFold cv accuracy': kfold_acc ,'Accuracy': accuracy, 'recall': recall, 'precision': precision, 'f1': f1}

## **Models compare**

In [None]:
metrics = pd.DataFrame([xgbc_metrics, lgbmc_metrics, cbc_metrics]).set_index('Model', drop=True)
(metrics.style
    .highlight_min('f1', color='red')
    .highlight_max('f1', color='lightgreen')
    .highlight_min('precision', color='red')
    .highlight_max('precision', color='lightgreen')
    .highlight_min('recall', color='red')
    .highlight_max('recall', color='lightgreen')
    .highlight_min('Accuracy', color='red')
    .highlight_max('Accuracy', color='lightgreen')
    .highlight_min('KFold cv accuracy', color='red')
    .highlight_max('KFold cv accuracy', color='lightgreen')
     
)

In [None]:
fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(lgbmc, X_test, y_test, ax=ax)
RocCurveDisplay.from_estimator(xgbc, X_test, y_test, ax=ax)
RocCurveDisplay.from_estimator(cbc, X_test, y_test, ax=ax)
plt.show()

## **Submission**

In [None]:
sub_pred_cbc = cbc.predict(submission_test)

sub_pred_cbc = sub_pred_cbc.astype('bool')

In [None]:
sub = pd.read_csv('data/sample_submission.csv')

sub['Transported'] = sub_pred_cbc

sub.to_csv('data/submission.csv', index=False)

## **Conclusions**

## **PyTorch**

In [None]:
device = (
    "cuda"
    if torch.cuda.is_available()
#     else "mps"
#     if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device"

In [None]:
# Model
class LogisticRegression(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(input_size, output_size)

    def forward(self, x):
        outputs = torch.sigmoid(self.linear(x))
        return outputs
        
#Create model
model = LogisticRegression(X_train.shape[1], 1).to(device)
print(model)

In [None]:
#loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

In [None]:
#k-fold parameters
num_folds = 10
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

In [None]:
%%time
#Cross-Validation
scores = []
for train_index, test_index in kf.split(train):
    X_train, X_test = torch.Tensor(train.iloc[train_index,:].values), torch.Tensor(train.iloc[test_index,:].values)
    y_train, y_test = torch.Tensor(test[train_index].values), torch.Tensor(test[test_index].values)

 # Model Training
    losses = []
    losses_test = []
    Iterations = []
    iter = 0
    epochs = 20000
    for epoch in range(int(epochs)):
        x =  X_train.to(device)
        labels =  y_train.to(device)
        optimizer.zero_grad() # Setting our stored gradients equal to zero
        outputs = model(x)
        loss = criterion(torch.squeeze(outputs), labels) 

        loss.backward() # Computes the gradient of the given tensor w.r.t. the weights/bias
        optimizer.step() # Updates weights and biases with the optimizer (SGD)

    with torch.no_grad():
        # Calculating the loss and accuracy for the test dataset
        correct_test = 0
        total_test = 0
        outputs_test = torch.squeeze(model(X_test.to(device)))
        predicted_test = outputs_test.cpu().round().detach().numpy()
        total_test += y_test.size(0)
        correct_test += np.sum(predicted_test == y_test.detach().numpy())
        accuracy_test = 100 * correct_test/total_test
        scores.append(accuracy_test)
    accuracy = np.mean(scores)
    
print(f"{device} KFold mean accuracy score: {accuracy}\n")