# Predicting whether a passenger was transported to an alternate dimension🚀

## 1. Problem Definition

Our task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic'c collision with the spacetime anomaly. 

## 2. Data
The data is downloaded from Kaggle Spaceship Titanic Competition

 - **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    - `PassengerId` - A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    - `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
    - `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    - `Cabin` - The cabin number where the passenger is staying. Takes the form `deck/num/side`, where `side` can be either `P` for *Port* or `S` for *Starboard*.
    - `Destination` - The planet the passenger will be debarking to.
    - `Age` - The age of the passenger.
    - `VIP` - Whether the passenger has paid for special VIP service during the voyage.
    - `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    - `Name` - The first and last names of the passenger.
    - `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
    
- **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

- **sample_submission.csv** - A submission file in the correct format.
    - `PassengerId` - Id for each passenger in the test set.
    - `Transported` - The target. For each passenger, predict either True or False.

## 3. Evaluation

The evaluation metric for this is based on classification accuracy, the precentage of predicted labels that are correct.

## 4. Features

We will explore all the features and shortlist the important features.


For more info : https://www.kaggle.com/competitions/spaceship-titanic/overview/evaluation

## Exploratory Data Analysis

In [280]:
# Import all the required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [296]:
df = pd.read_csv("data/train.csv")

In [297]:
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [298]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [299]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [300]:
df.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [301]:
df.shape

(8693, 14)

In [302]:
df.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

In [303]:
df["HomePlanet"].unique()

array(['Europa', 'Earth', 'Mars', nan], dtype=object)

In [304]:
df["CryoSleep"].unique()

array([False, True, nan], dtype=object)

In [305]:
len(df["Cabin"].unique())

6561

In [306]:
df["Destination"].unique()

array(['TRAPPIST-1e', 'PSO J318.5-22', '55 Cancri e', nan], dtype=object)

In [307]:
df["VIP"].unique()

array([False, True, nan], dtype=object)

In [308]:
df["Transported"].unique()

array([False,  True])

In [309]:
df.corr()

  df.corr()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
Age,1.0,0.068723,0.130421,0.033133,0.12397,0.101007,-0.075026
RoomService,0.068723,1.0,-0.015889,0.05448,0.01008,-0.019581,-0.244611
FoodCourt,0.130421,-0.015889,1.0,-0.014228,0.221891,0.227995,0.046566
ShoppingMall,0.033133,0.05448,-0.014228,1.0,0.013879,-0.007322,0.010141
Spa,0.12397,0.01008,0.221891,0.013879,1.0,0.153821,-0.221131
VRDeck,0.101007,-0.019581,0.227995,-0.007322,0.153821,1.0,-0.207075
Transported,-0.075026,-0.244611,0.046566,0.010141,-0.221131,-0.207075,1.0


In [310]:
df["Destination"].value_counts()

TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: Destination, dtype: int64

In [311]:
df_tmp = df.copy()

In [312]:
df_tmp.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### Filling missing integer values

In [313]:
def fillMissingValues(df):
    """
    This Function fills the missing values of float type.
    """
    print("RoomService Median : ", df["RoomService"].median())
    print("FoodCourt Median : ", df["FoodCourt"].median())
    print("ShoppingMall Median : ", df["ShoppingMall"].median())
    print("Spa Median : ", df["Spa"].median())
    print("VRDeck Median : ", df["VRDeck"].median())
    
    df["RoomService"].fillna(df["RoomService"].median(),inplace=True)
    df["FoodCourt"].fillna(df["FoodCourt"].median(),inplace=True)
    df["ShoppingMall"].fillna(df["ShoppingMall"].median(),inplace=True)
    df["Spa"].fillna(df["Spa"].median(),inplace=True)
    df["VRDeck"].fillna(df["VRDeck"].median(),inplace=True)
    
    return df

In [314]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [315]:
df_tmp.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [316]:
df_tmp = fillMissingValues(df_tmp)

RoomService Median :  0.0
FoodCourt Median :  0.0
ShoppingMall Median :  0.0
Spa Median :  0.0
VRDeck Median :  0.0


In [317]:
df_tmp.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
dtype: int64

In [318]:
df_tmp["HomePlanet"].unique()

array(['Europa', 'Earth', 'Mars', nan], dtype=object)

In [319]:
pd.api.types.is_categorical_dtype(df_tmp["HomePlanet"])

False

In [320]:
for label,content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

PassengerId
HomePlanet
CryoSleep
Cabin
Destination
VIP
Name


In [321]:
df_tmp["VIP"].unique()

array([False, True, nan], dtype=object)

### Changing the PassengerID dtype string to Integer

In [322]:
def setPID(df_tmp):
    for i in range(len(df_tmp["PassengerId"])):
        df_tmp["PassengerId"][i] = int(df_tmp["PassengerId"][i].replace("_",""))
    return df_tmp

In [327]:
df_tmp = setPID(df_tmp)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["PassengerId"][i] = int(df_tmp["PassengerId"][i].replace("_",""))


In [328]:
df_tmp

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,101,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,201,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,301,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,302,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,401,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,927801,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,927901,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,928001,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [329]:
df_tmp.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8693.0,8693.0,8693.0,8693.0,8693.0
mean,28.82793,220.009318,448.434027,169.5723,304.588865,298.26182
std,14.489021,660.51905,1595.790627,598.007164,1125.562559,1134.126417
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,41.0,61.0,22.0,53.0,40.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [330]:
df_tmp.to_csv("data/train_tmp.csv",index=False)

In [331]:
df_tmp.loc[0]

PassengerId                 101
HomePlanet               Europa
CryoSleep                 False
Cabin                     B/0/P
Destination         TRAPPIST-1e
Age                        39.0
VIP                       False
RoomService                 0.0
FoodCourt                   0.0
ShoppingMall                0.0
Spa                         0.0
VRDeck                      0.0
Name            Maham Ofracculy
Transported               False
Name: 0, dtype: object

In [332]:
df_tmp = pd.read_csv("data/train_tmp.csv")

In [333]:
df_tmp.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
dtype: int64

In [334]:
df_tmp["HomePlanet"].unique()

array(['Europa', 'Earth', 'Mars', nan], dtype=object)

In [335]:
pd.api.types.is_string_dtype(df_tmp["HomePlanet"])

True

In [336]:
df_tmp["HomePlanet"].dtypes

dtype('O')

In [337]:
pd.Categorical(df_tmp["HomePlanet"])

['Europa', 'Earth', 'Europa', 'Europa', 'Earth', ..., 'Europa', 'Earth', 'Earth', 'Europa', 'Europa']
Length: 8693
Categories (3, object): ['Earth', 'Europa', 'Mars']

In [338]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   int64  
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), int64(1), object(6)
memory usage: 891.5+ KB


### Converting string columns into Categorical and integers

Note : Missing values of these columns are automatically filled with 0 when we use `pd.Categorical().codes` method.

In [339]:
def to_category(df):
    """
    This method converts string columns to categorical and returns the dataframe.
    """
    for label,content in df.items():
        if pd.api.types.is_string_dtype(content) and label != "Name" and label != "PassengerId":
            df[label] = pd.Categorical(df[label]).codes + 1
            print(label)
    return df

In [340]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   int64  
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), int64(1), object(6)
memory usage: 891.5+ KB


In [341]:
to_category(df_tmp)

HomePlanet
CryoSleep
Cabin
Destination
VIP


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,101,2,1,150,3,39.0,1,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,201,1,1,2185,3,24.0,1,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,301,2,1,2,3,58.0,2,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,302,2,1,2,3,33.0,1,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,401,1,1,2187,3,16.0,1,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,2,1,147,1,41.0,2,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,927801,1,2,5281,2,18.0,1,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,927901,1,1,5286,3,26.0,1,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,928001,2,1,2132,1,32.0,1,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [342]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   int64  
 1   HomePlanet    8693 non-null   int8   
 2   CryoSleep     8693 non-null   int8   
 3   Cabin         8693 non-null   int16  
 4   Destination   8693 non-null   int8   
 5   Age           8514 non-null   float64
 6   VIP           8693 non-null   int8   
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), int16(1), int64(1), int8(4), object(1)
memory usage: 602.9+ KB


In [343]:
df_tmp

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,101,2,1,150,3,39.0,1,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,201,1,1,2185,3,24.0,1,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,301,2,1,2,3,58.0,2,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,302,2,1,2,3,33.0,1,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,401,1,1,2187,3,16.0,1,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,2,1,147,1,41.0,2,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,927801,1,2,5281,2,18.0,1,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,927901,1,1,5286,3,26.0,1,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,928001,2,1,2132,1,32.0,1,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [344]:
df_tmp["CryoSleep"]

0       1
1       1
2       1
3       1
4       1
       ..
8688    1
8689    2
8690    1
8691    1
8692    1
Name: CryoSleep, Length: 8693, dtype: int8

In [345]:
len(df_tmp["Cabin"].unique())

6561

Dropping the `Name` column as it is not a important feature

In [346]:
df_tmp.drop("Name",axis=1,inplace=True)

In [347]:
df_tmp

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,101,2,1,150,3,39.0,1,0.0,0.0,0.0,0.0,0.0,False
1,201,1,1,2185,3,24.0,1,109.0,9.0,25.0,549.0,44.0,True
2,301,2,1,2,3,58.0,2,43.0,3576.0,0.0,6715.0,49.0,False
3,302,2,1,2,3,33.0,1,0.0,1283.0,371.0,3329.0,193.0,False
4,401,1,1,2187,3,16.0,1,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,2,1,147,1,41.0,2,0.0,6819.0,0.0,1643.0,74.0,False
8689,927801,1,2,5281,2,18.0,1,0.0,0.0,0.0,0.0,0.0,False
8690,927901,1,1,5286,3,26.0,1,0.0,0.0,1872.0,1.0,0.0,True
8691,928001,2,1,2132,1,32.0,1,0.0,1049.0,0.0,353.0,3235.0,False


### Converting the Transported Column to integer type

In [348]:
df_tmp["Transported"].dtypes

dtype('bool')

In [349]:
df_tmp["Transported"] = df_tmp["Transported"].astype("int")

In [350]:
df_tmp

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,101,2,1,150,3,39.0,1,0.0,0.0,0.0,0.0,0.0,0
1,201,1,1,2185,3,24.0,1,109.0,9.0,25.0,549.0,44.0,1
2,301,2,1,2,3,58.0,2,43.0,3576.0,0.0,6715.0,49.0,0
3,302,2,1,2,3,33.0,1,0.0,1283.0,371.0,3329.0,193.0,0
4,401,1,1,2187,3,16.0,1,303.0,70.0,151.0,565.0,2.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,2,1,147,1,41.0,2,0.0,6819.0,0.0,1643.0,74.0,0
8689,927801,1,2,5281,2,18.0,1,0.0,0.0,0.0,0.0,0.0,0
8690,927901,1,1,5286,3,26.0,1,0.0,0.0,1872.0,1.0,0.0,1
8691,928001,2,1,2132,1,32.0,1,0.0,1049.0,0.0,353.0,3235.0,0


In [351]:
df_tmp.isna().sum()

PassengerId       0
HomePlanet        0
CryoSleep         0
Cabin             0
Destination       0
Age             179
VIP               0
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Transported       0
dtype: int64

In [352]:
df_tmp["Age"].fillna(df_tmp["Age"].median(),inplace=True)

In [353]:
df_tmp.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

In [354]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   int64  
 1   HomePlanet    8693 non-null   int8   
 2   CryoSleep     8693 non-null   int8   
 3   Cabin         8693 non-null   int16  
 4   Destination   8693 non-null   int8   
 5   Age           8693 non-null   float64
 6   VIP           8693 non-null   int8   
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Transported   8693 non-null   int32  
dtypes: float64(6), int16(1), int32(1), int64(1), int8(4)
memory usage: 560.4 KB


In [355]:
df_tmp["HomePlanet"].unique()

array([2, 1, 3, 0], dtype=int8)

Let's save the data :)

In [None]:
df_tmp.to_csv("data/train_tmp_final.csv",index=False)

In [357]:
df_tmp.corr()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
PassengerId,1.0,-0.006395,-0.009191,-0.056606,-0.00451,-0.009099,0.002667,0.000375,-0.0092,0.017795,-0.005198,0.015945,0.021491
HomePlanet,-0.006395,1.0,0.070923,-0.458618,0.025604,0.128368,0.083712,0.204435,0.071952,0.09894,0.054783,0.038626,0.110443
CryoSleep,-0.009191,0.070923,1.0,0.097183,-0.070785,-0.062622,-0.050223,-0.224077,-0.189451,-0.188655,-0.180836,-0.176592,0.424362
Cabin,-0.056606,-0.458618,0.097183,1.0,0.116973,-0.239753,-0.113143,-0.087099,-0.261358,-0.063271,-0.194378,-0.208371,-0.052685
Destination,-0.00451,0.025604,-0.070785,0.116973,1.0,-0.012561,-0.0219,0.043846,-0.09694,0.025577,-0.054825,-0.062423,-0.099737
Age,-0.009099,0.128368,-0.062622,-0.239753,-0.012561,1.0,0.070657,0.068629,0.12739,0.033148,0.120946,0.09959,-0.074233
VIP,0.002667,0.083712,-0.050223,-0.113143,-0.0219,0.070657,1.0,0.029505,0.08842,0.026579,0.047827,0.085979,-0.027802
RoomService,0.000375,0.204435,-0.224077,-0.087099,0.043846,0.068629,0.029505,1.0,-0.015126,0.052337,0.009244,-0.018624,-0.241124
FoodCourt,-0.0092,0.071952,-0.189451,-0.261358,-0.09694,0.12739,0.08842,-0.015126,1.0,-0.013717,0.221468,0.224572,0.045583
ShoppingMall,0.017795,0.09894,-0.188655,-0.063271,0.025577,0.033148,0.026579,0.052337,-0.013717,1.0,0.014542,-0.007849,0.009391


Now as we have converted all the string columns into numbers and filled the missing values. We can now choose and fit a model for it.

## Choosing a model

In [358]:
df_tmp = pd.read_csv("data/train_tmp_final.csv")

In [359]:
df_tmp

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,101,2,1,150,3,39.0,1,0.0,0.0,0.0,0.0,0.0,0
1,201,1,1,2185,3,24.0,1,109.0,9.0,25.0,549.0,44.0,1
2,301,2,1,2,3,58.0,2,43.0,3576.0,0.0,6715.0,49.0,0
3,302,2,1,2,3,33.0,1,0.0,1283.0,371.0,3329.0,193.0,0
4,401,1,1,2187,3,16.0,1,303.0,70.0,151.0,565.0,2.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,927601,2,1,147,1,41.0,2,0.0,6819.0,0.0,1643.0,74.0,0
8689,927801,1,2,5281,2,18.0,1,0.0,0.0,0.0,0.0,0.0,0
8690,927901,1,1,5286,3,26.0,1,0.0,0.0,1872.0,1.0,0.0,1
8691,928001,2,1,2132,1,32.0,1,0.0,1049.0,0.0,353.0,3235.0,0


### Import RandomForestClassifier and implement the model

In [360]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

In [361]:
## Splitting the data 
from sklearn.model_selection import train_test_split


In [363]:
X = df_tmp.drop("Transported",axis=1)
y = df_tmp["Transported"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [364]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((6954, 12), (1739, 12), (6954,), (1739,))

In [74]:
model.fit(X_train,y_train)
model.score(X_train,y_train)

1.0

In [75]:
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_preds)

We got an avegrage accuracy of `0.7860839562967222` on our validation set. Now lets tune the hyperparameters

## 4. Tuning the hyperparameters

### i. Using RandomizedSearchCV
Lets first use `RandomizedSearchCV` to tune the hyperparameters

In [77]:
from sklearn.model_selection import RandomizedSearchCV

In [78]:
param_distributions = {
    "n_estimators":np.arange(10,300,20),
    "max_depth":np.arange(1,10,1),
    "min_samples_split":np.arange(2,20,2),
    "min_samples_leaf":np.arange(1,20,2)
}

In [79]:
%%time
np.random.seed(42)
rs_model = RandomizedSearchCV(RandomForestClassifier(),param_distributions,n_iter=50,cv=5,verbose=True)
rs_model.fit(X_train,y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
CPU times: total: 2min 33s
Wall time: 2min 34s


In [29]:
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [80]:
[np.arange(1,10,1)]

[array([1, 2, 3, 4, 5, 6, 7, 8, 9])]

In [81]:
rs_model.score(X_train,y_train)

0.8326143226919759

In [82]:
rs_model.best_params_

{'n_estimators': 90,
 'min_samples_split': 12,
 'min_samples_leaf': 7,
 'max_depth': 9}

In [83]:
rs_model.score(X_test,y_test)

0.7872340425531915

There is not much improvement by using `RandomizedSearchCV` as we got only `0.7872340425531915` 

Lets try out GridSearchCV

### ii. Using GridSearchCV

In [370]:
from sklearn.model_selection import GridSearchCV

In [371]:
param_grids = {
    "n_estimators":np.arange(10,300,20),
    "max_depth":np.arange(1,10,1)
}

In [87]:
# %%time
# gs_model = GridSearchCV(GradientBoosing(),param_grids,cv=5,verbose=True)

# gs_model.fit(X_train,y_train)

Fitting 5 folds for each of 135 candidates, totalling 675 fits
CPU times: total: 7min 6s
Wall time: 7min 6s


In [91]:
gs_model.score(X_train,y_train)

0.8550474547023296

In [92]:
gs_model.score(X_test,y_test)

0.7872340425531915

In [94]:
gs_model.best_params_

{'max_depth': 9, 'n_estimators': 250}

GridSearchCV score : `0.7872340425531915`

Best Params
- `n_estimators` : (90 - 110)
- `max_depth` : 9
- `min_samples_leaf` : 7,14
- `min_samples_split` : 12

#### I tested more hyperparameters

In [376]:
param_grids = {
    "n_estimators":[90,100,110],
    "min_samples_split":np.arange(5,14,1),
    "min_samples_leaf":np.arange(5,14,1)
}

In [379]:
gs_model1 = GridSearchCV(GradientBoostingClassifier(),param_grids,cv=5,verbose=True)
gs_model1.fit(X_train,y_train)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits


In [380]:
gs_model1.best_params_

{'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 90}

In [381]:
gs_model1.score(X_train,y_train)

0.8189531205061835

In [383]:
gs_model1.score(X_test,y_test)

0.7901092581943646

In [384]:
y_preds = gs_model1.predict(X_test)

In [385]:
accuracy_score(y_test,y_preds)

0.7901092581943646

Updated hyperparameters in GridSearch CV has a slightly better score

- RandomizedSearchCV score : `0.7872340425531915`
- GridSearchCV Score : `0.7878090856814262`
- Base Score : `0.7860839562967222`

Finally we came to the conclusion that with hyperparameters `{'min_samples_leaf': 14, 'min_samples_split': 19, 'n_estimators': 100}` by using GridSearchCV technique we got final score of : `0.7878090856814262`

# Aftermath (Submission Process)

In [226]:
# for i in range(len(X_test["PassengerId"])):
#     val = X_test["PassengerId"][i]
#     val = str(val)
#     X_test["PassengerId"][i] = val[:-2] + "_" + val[-2:]

In [218]:
list(X_test["PassengerId"])[0]

33702

In [217]:
def extractPID(df):
    pid = df["PassengerId"]
    pid = list(pid)
    for i in range(len(pid)):
        val = pid[i]
        val = str(val)
        pid[i] = val[:-2] + "_"+ val[-2:]
    n = 7
    for i in range(len(pid)):
        if(len(pid[i]) != n):
            pid[i] = "0" * (n-len(pid[i])) + pid[i]
    return pid

In [135]:
pid[:5]

['337_02', '2891_01', '8998_01', '1771_01', '9034_02']

In [136]:
y_preds

array([0, 1, 1, ..., 0, 1, 0], dtype=int64)

In [137]:
y_preds = list(y_preds)

In [412]:
for i in range(len(y_preds)):
    y_preds[i] = True if y_preds[i] == 1 else False

In [147]:
final_result = pd.DataFrame({"PassengerId":pid,"Transported":y_preds})

In [148]:
final_result

Unnamed: 0,PassengerId,Transported
0,337_02,False
1,2891_01,True
2,8998_01,True
3,1771_01,True
4,9034_02,True
...,...,...
1734,7656_01,True
1735,3437_02,True
1736,1384_01,False
1737,6300_01,True


In [150]:
final_result.to_csv("predictions.csv",index=False)

In [151]:
final_result.shape

(1739, 2)

In [152]:
X_test.shape


(1739, 12)

In [386]:
df = pd.read_csv("data/test.csv")

In [387]:
df = fillMissingValues(df)

RoomService Median :  0.0
FoodCourt Median :  0.0
ShoppingMall Median :  0.0
Spa Median :  0.0
VRDeck Median :  0.0


In [388]:
to_category(df)

HomePlanet
CryoSleep
Cabin
Destination
VIP


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,1,2,2785,3,27.0,1,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,1,1,1868,3,19.0,1,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,2,2,258,1,31.0,1,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,2,1,260,3,38.0,1,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,1,1,1941,3,20.0,1,10.0,0.0,635.0,0.0,0.0,Brence Harperez
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,9266_02,1,2,2680,3,34.0,1,0.0,0.0,0.0,0.0,0.0,Jeron Peter
4273,9269_01,1,1,0,3,42.0,1,0.0,847.0,17.0,10.0,144.0,Matty Scheron
4274,9271_01,3,2,603,1,,1,0.0,0.0,0.0,0.0,0.0,Jayrin Pore
4275,9273_01,2,1,604,0,,1,0.0,2680.0,0.0,0.0,523.0,Kitakan Conale


In [389]:
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,1,2,2785,3,27.0,1,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,1,1,1868,3,19.0,1,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,2,2,258,1,31.0,1,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,2,1,260,3,38.0,1,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,1,1,1941,3,20.0,1,10.0,0.0,635.0,0.0,0.0,Brence Harperez
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,9266_02,1,2,2680,3,34.0,1,0.0,0.0,0.0,0.0,0.0,Jeron Peter
4273,9269_01,1,1,0,3,42.0,1,0.0,847.0,17.0,10.0,144.0,Matty Scheron
4274,9271_01,3,2,603,1,,1,0.0,0.0,0.0,0.0,0.0,Jayrin Pore
4275,9273_01,2,1,604,0,,1,0.0,2680.0,0.0,0.0,523.0,Kitakan Conale


In [390]:
df = setPID(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["PassengerId"][i] = int(df_tmp["PassengerId"][i].replace("_",""))


In [391]:
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,1301,1,2,2785,3,27.0,1,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,1801,1,1,1868,3,19.0,1,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,1901,2,2,258,1,31.0,1,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,2101,2,1,260,3,38.0,1,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,2301,1,1,1941,3,20.0,1,10.0,0.0,635.0,0.0,0.0,Brence Harperez
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,926602,1,2,2680,3,34.0,1,0.0,0.0,0.0,0.0,0.0,Jeron Peter
4273,926901,1,1,0,3,42.0,1,0.0,847.0,17.0,10.0,144.0,Matty Scheron
4274,927101,3,2,603,1,,1,0.0,0.0,0.0,0.0,0.0,Jayrin Pore
4275,927301,2,1,604,0,,1,0.0,2680.0,0.0,0.0,523.0,Kitakan Conale


In [392]:
df.drop("Name",axis=1,inplace=True)

In [393]:
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,1301,1,2,2785,3,27.0,1,0.0,0.0,0.0,0.0,0.0
1,1801,1,1,1868,3,19.0,1,0.0,9.0,0.0,2823.0,0.0
2,1901,2,2,258,1,31.0,1,0.0,0.0,0.0,0.0,0.0
3,2101,2,1,260,3,38.0,1,0.0,6652.0,0.0,181.0,585.0
4,2301,1,1,1941,3,20.0,1,10.0,0.0,635.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4272,926602,1,2,2680,3,34.0,1,0.0,0.0,0.0,0.0,0.0
4273,926901,1,1,0,3,42.0,1,0.0,847.0,17.0,10.0,144.0
4274,927101,3,2,603,1,,1,0.0,0.0,0.0,0.0,0.0
4275,927301,2,1,604,0,,1,0.0,2680.0,0.0,0.0,523.0


In [394]:
df["Age"].fillna(df["Age"].median(),inplace=True)

In [395]:
df.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

In [396]:
y_preds = gs_model1.predict(df)

In [409]:
y_preds = gbc_clf.predict(df)

In [410]:
y_preds = list(y_preds)

In [411]:
y_preds

[1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,


In [399]:
pid = extractPID(df)

In [413]:
final_result = pd.DataFrame({"PassengerId":pid,"Transported":y_preds})

In [414]:
final_result

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
...,...,...
4272,9266_02,True
4273,9269_01,True
4274,9271_01,True
4275,9273_01,True


In [415]:
final_result.to_csv("data/predictions.csv",index=False)

In [365]:
from sklearn.ensemble import GradientBoostingClassifier

gbc_clf = GradientBoostingClassifier(random_state=42)

In [367]:
gbc_clf.fit(X_train,y_train)

In [368]:
gbc_clf.score(X_train,y_train)

0.819672131147541

In [408]:
gbc_clf.score(X_test,y_test)

0.7918343875790684

In [407]:
gs_model1.score(X_test,y_test)

0.7901092581943646