# Introduction

Welcome to this comprehensive guide on **binary classification** with the **Spaceship Titanic** dataset. The objective is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with a spacetime anomaly.

*We will cover:*
* Exploratory and preparation Data
* Encoding, Scaling and Preprocessing
* Training Machine Learning Models
* Cross Validation and Ensembling Predictions

# Data Exploration and Preparation

In [None]:
import numpy as np 
import pandas as pd 


upload dataset ที่ใช้สำหรับ Train และ Test model และกำหนด index ด้วย PassengerId

In [None]:

train = pd.read_csv('../input/spaceship-titanic/train.csv', index_col='PassengerId')
test = pd.read_csv('../input/spaceship-titanic/test.csv', index_col='PassengerId')

# Shape and preview
print('Train set shape:', train.shape)
print('Test set shape:', test.shape)
train.head()

Train set shape: (8693, 13)
Test set shape: (4277, 12)


Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


*Feature descriptions:*
> * **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
> * **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
> * **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
> * **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
> * **Destination** - The planet the passenger will be debarking to.
> * **Age** - The age of the passenger.
> * **VIP** - Whether the passenger has paid for special VIP service during the voyage.
> * **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
> * **Name** - The first and last names of the passenger.
> * **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

นำ column "Name" ออกจากชุดข้อมูล train และ test

In [None]:
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

แทนข้อมูล False ด้วย 0 และแทนข้อมูล True ด้วย 1 ใน column Transported

In [None]:
train['Transported'].replace(False, 0, inplace=True)
train['Transported'].replace(True, 1, inplace=True)

แยกcolumn Cabin เป็น3columnคือ deck num side

In [None]:
train[['deck','num', 'side']] = train['Cabin'].str.split('/', expand=True)
test[['deck','num', 'side']] = test['Cabin'].str.split('/', expand=True)

train.drop('Cabin', axis=1, inplace=True)
test.drop('Cabin', axis=1, inplace=True)

สร้าง column SumSpends ขึ้นมาเพื่อรวมจำนวนของ ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

In [None]:
col_to_sum = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

train['SumSpends'] = train[col_to_sum].sum(axis=1)
test['SumSpends'] = test[col_to_sum].sum(axis=1)

In [None]:
train

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,deck,num,side,SumSpends
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,0,B,0,P,0.0
0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,1,F,0,S,736.0
0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,0,A,0,S,10383.0
0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,0,A,0,S,5176.0
0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,1,F,1,S,1091.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9276_01,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,0,A,98,P,8536.0
9278_01,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,0,G,1499,S,0.0
9279_01,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,1,G,1500,S,1873.0
9280_01,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,0,E,608,S,4637.0


In [None]:
test

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,deck,num,side,SumSpends
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0013_01,Earth,True,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,G,3,S,0.0
0018_01,Earth,False,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,F,4,S,2832.0
0019_01,Europa,True,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,C,0,S,0.0
0021_01,Europa,False,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,C,1,S,7418.0
0023_01,Earth,False,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,F,5,S,645.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9266_02,Earth,True,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,G,1496,S,0.0
9269_01,Earth,False,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,,,,1018.0
9271_01,Mars,True,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,D,296,P,0.0
9273_01,Europa,False,,,False,0.0,2680.0,0.0,0.0,523.0,D,297,P,3203.0


print ออกมาเพื่อดูประเภทของข้อมูลทั้งหมดใน Dataset จะเห็นว่ามีข้อมูล 3 ประเภทคือ dtypes: float64(7), int64(1), object(7)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8693 entries, 0001_01 to 9280_02
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8492 non-null   object 
 1   CryoSleep     8476 non-null   object 
 2   Destination   8511 non-null   object 
 3   Age           8514 non-null   float64
 4   VIP           8490 non-null   object 
 5   RoomService   8512 non-null   float64
 6   FoodCourt     8510 non-null   float64
 7   ShoppingMall  8485 non-null   float64
 8   Spa           8510 non-null   float64
 9   VRDeck        8505 non-null   float64
 10  Transported   8693 non-null   int64  
 11  deck          8494 non-null   object 
 12  num           8494 non-null   object 
 13  side          8494 non-null   object 
 14  SumSpends     8693 non-null   float64
dtypes: float64(7), int64(1), object(7)
memory usage: 1.1+ MB


ใช้คำสั่งนี้เพื่อดูว่ามี Missing Value หรือไม่ และพบว่ามีดังข้างล่าง

In [None]:
train.isna().sum()

HomePlanet      201
CryoSleep       217
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
deck            199
num             199
side            199
SumSpends         0
dtype: int64

In [None]:
test.isna().sum()

HomePlanet       87
CryoSleep        93
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
deck            100
num             100
side            100
SumSpends         0
dtype: int64

In [None]:
null_cols = train.isnull().sum().sort_values(ascending=False)
null_cols = list(null_cols[null_cols>1].index)
null_cols

['CryoSleep',
 'ShoppingMall',
 'VIP',
 'HomePlanet',
 'deck',
 'num',
 'side',
 'VRDeck',
 'FoodCourt',
 'Spa',
 'Destination',
 'RoomService',
 'Age']

# Cleaning, Encoding, Scaling and Preprocessing

object_cols แทนข้อมูลประเภท object และ category
numeric_cols แทนข้อมูลประเภท float64 




In [None]:
object_cols = [col for col in train.columns if train[col].dtype == 'object' or train[col].dtype == 'category']
numeric_cols = [col for col in train.columns if train[col].dtype == 'float64']

print(f'Object cols -- {object_cols}')
print(f'Numeric cols -- {numeric_cols}')

Object cols -- ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'deck', 'num', 'side']
Numeric cols -- ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'SumSpends']


แทน object_cols ของ train และ test ให้เป็นประเภท category

In [None]:
train[object_cols] = train[object_cols].astype('category')
test[object_cols] = test[object_cols].astype('category')

Encoding (แปลงข้อมูล) ค่า categorical จาก object หรือ string ให้เป็นค่า float หรือตัวเลข

In [None]:
from sklearn.preprocessing import OrdinalEncoder

oc = OrdinalEncoder()

df_for_encode = pd.concat([train, test])

df_for_encode[object_cols] = df_for_encode[object_cols].astype('category')

df_for_encode[object_cols] = oc.fit_transform(df_for_encode[object_cols])

del train, test

train = df_for_encode.iloc[:8693, :]
test = df_for_encode.iloc[8693: , :]

del df_for_encode

test.drop('Transported', inplace=True, axis=1)

นำ column Transported ออกจาก test set เนื่องจากเราต้องการทำนายข้อมูล Transported

In [None]:
test

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,deck,num,side,SumSpends
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0013_01,0.0,1.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1117.0,1.0,0.0
0018_01,0.0,0.0,2.0,19.0,0.0,0.0,9.0,0.0,2823.0,0.0,5.0,1228.0,1.0,2832.0
0019_01,1.0,1.0,0.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0
0021_01,1.0,0.0,2.0,38.0,0.0,0.0,6652.0,0.0,181.0,585.0,2.0,1.0,1.0,7418.0
0023_01,0.0,0.0,2.0,20.0,0.0,10.0,0.0,635.0,0.0,0.0,5.0,1339.0,1.0,645.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9266_02,0.0,1.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,553.0,1.0,0.0
9269_01,0.0,0.0,2.0,42.0,0.0,0.0,847.0,17.0,10.0,144.0,,,,1018.0
9271_01,2.0,1.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1113.0,0.0,0.0
9273_01,1.0,0.0,,,0.0,0.0,2680.0,0.0,0.0,523.0,3.0,1114.0,0.0,3203.0


แทนข้อมูลที่เป็น missing value ด้วยค่า mean ของ column นั้น

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


ct = ColumnTransformer([("imp", SimpleImputer(strategy='mean'), null_cols)])
    
train[null_cols] = ct.fit_transform(train[null_cols])
test[null_cols] = ct.fit_transform(test[null_cols])

# Prepareing dataset for modeling

กำหนดตัวแปรที่ใช้สำหรับ เทรนโมเดลคือ X = train dataset และ y คือ Transported และทำการแบ่ง train 70% test 30% จากข้อมูล train dataset สำหรับเทรนโมเดลอีกที 

In [None]:
X = train.copy()
y = X.pop('Transported')

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23)

ติดตั้ง catboost สำหรับเทรนโมเดล CatBoostClassifier

In [None]:
!pip3 install catboost


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1-cp37-none-manylinux1_x86_64.whl (76.8 MB)
[K     |████████████████████████████████| 76.8 MB 30 kB/s 
Installing collected packages: catboost
Successfully installed catboost-1.1


# Modeling Cross Validation and Ensembling Predictions 7 model



1.  RandomForestClassifier
2.  AdaBoostClassifier
3.  LGBMClassifier
4.  XGBClassifier
5.  CatBoostClassifier
6.  KNeighborsClassifier
7.  DecisionTreeClassifier




*  import โมเดลทั้งหมดที่จะใช้สำหรับการ training 
*  algorithm นี้ทำให้ง่ายขึ้นสำหรับการเทรนแต่ละ model โดยการรวมคำสั่ง fit และ cross_val_score ในรูปของคำสั่ง predict_and_acc




In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
model_list= {}
def predict_and_acc(model, verbose=None):
    
    if verbose == None:
        model = model()
        model.fit(X_train, y_train)
        predict = model.predict(X_test)
        cvs = cross_val_score(model, X, y, cv=4)
        print(f'The accuracy score of {str(model)} is {float(accuracy_score(y_test, predict))}')
        print(f'The cross validation of {str(model)} is:{cvs} with mean of {cvs.mean()}')
    else:
        model = model(verbose=verbose)
        model.fit(X_train, y_train)
        predict = model.predict(X_test)
        cvs = cross_val_score(model, X, y, cv=4)
        print(f'The accuracy score of {str(model)} is {float(accuracy_score(y_test, predict))}')
        print(f'The cross validation of {str(model)} is:{cvs} with mean of {cvs.mean()}')



In [None]:
predict_and_acc(RandomForestClassifier, None)


The accuracy score of RandomForestClassifier() is 0.796688132474701
The cross validation of RandomForestClassifier() is:[0.76310948 0.76760239 0.79889554 0.79153244] with mean of 0.7802849620943832


In [None]:
predict_and_acc(AdaBoostClassifier)


The accuracy score of AdaBoostClassifier() is 0.7980680772769089
The cross validation of AdaBoostClassifier() is:[0.74931003 0.78416935 0.79337322 0.80901979] with mean of 0.7839680959471239


In [None]:
predict_and_acc(LGBMClassifier)

The accuracy score of LGBMClassifier() is 0.8031278748850046
The cross validation of LGBMClassifier() is:[0.75620975 0.76944317 0.80625863 0.79291302] with mean of 0.7812061424583974


In [None]:
predict_and_acc(CatBoostClassifier, verbose=False)

The accuracy score of <catboost.core.CatBoostClassifier object at 0x7fcab70137d0> is 0.8054277828886844
The cross validation of <catboost.core.CatBoostClassifier object at 0x7fcab70137d0> is:[0.76310948 0.78324896 0.82144501 0.7947538 ] with mean of 0.7906393109208905


In [None]:
from sklearn.neighbors import KNeighborsClassifier
predict_and_acc(KNeighborsClassifier)


The accuracy score of KNeighborsClassifier() is 0.764949402023919
The cross validation of KNeighborsClassifier() is:[0.72033119 0.72112287 0.77956742 0.76806259] with mean of 0.7472710157401343


In [None]:
from sklearn.tree import DecisionTreeClassifier
predict_and_acc(DecisionTreeClassifier)


The accuracy score of DecisionTreeClassifier() is 0.7433302667893285
The cross validation of DecisionTreeClassifier() is:[0.71297148 0.7257248  0.71790152 0.73722964] with mean of 0.7234568601609365


ใช้ Backward Feature Selection กับ CatBoostClassifier (เนื่องจากเป็น model ที่ทำคะแนนได้ดีที่สุด)

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

model_fs = CatBoostClassifier(verbose=False)
sf = SequentialFeatureSelector(model_fs, scoring='accuracy', direction = 'backward')
sf.fit(X,y)

SequentialFeatureSelector(direction='backward',
                          estimator=<catboost.core.CatBoostClassifier object at 0x7fcab33c7690>,
                          scoring='accuracy')

ดู feature ที่มีผลต่อการเรียนรู้มากที่สุด (เนื่องจากไม่จำเป็นต้องใช้ทุก feature ในการทำนายก็ได้)

In [None]:
best_features = list(sf.get_feature_names_out())
best_features

['CryoSleep', 'RoomService', 'Spa', 'VRDeck', 'deck', 'side', 'SumSpends']

เทรนโมเดลอีกครั้งกับข้อมูล best_features ของ X (train set)

In [None]:
model = CatBoostClassifier(verbose=False, eval_metric='Accuracy')
model.fit(X[best_features], y)
prediction = model.predict(test[best_features])

# Ensembling Predictions

ดูผลลัพท์ที่โมเดลทำนายและแทน 0 = false และ 1 = True


In [None]:
final = pd.DataFrame()
final.index = test.index
final['Transported'] = prediction
final['Transported'].replace(0, False, inplace=True)
final['Transported'].replace(1, True, inplace=True)
final

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,True
0018_01,False
0019_01,True
0021_01,True
0023_01,True
...,...
9266_02,True
9269_01,False
9271_01,True
9273_01,True


save ผลลัพท์โมเดลเพื่อส่ง

In [None]:
final.to_csv('submission.csv')