# Spaceship titanic classification
<p>A simple example of a classification from Kaggle data set. </p> </br>
There was a disaster on board the spaceship.  
Some people have been transferred to another dimension.  
The model classifies passengers who have been transferred.  


link do data set: https://www.kaggle.com/c/spaceship-titanic


In [108]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [109]:
#importing test and train set (previously splited)
df_train = pd.read_csv(r"data\spaceship titanic\train.csv")
df_test = pd.read_csv(r"data\spaceship titanic\test.csv")
df_submission = pd.read_csv(r"data\spaceship titanic\sample_submission.csv")

In [110]:
df_train

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


# analyzing data

In [111]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [112]:
df_train.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [113]:
#checking the correctness of categorical data (no typos which could add new category; no outliers; no strange categories)
def check_categorical(df):
    for col in df.columns[1:]:
        print(df[col].value_counts(), "\n")
           

In [114]:
check_categorical(df_train[list(df_train.select_dtypes(include=['bool', "object"]).columns)])

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64 

False    5439
True     3037
Name: CryoSleep, dtype: int64 

G/734/S     8
G/109/P     7
B/201/P     7
G/1368/P    7
G/981/S     7
           ..
G/556/P     1
E/231/S     1
G/545/S     1
G/543/S     1
F/947/P     1
Name: Cabin, Length: 6560, dtype: int64 

TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: Destination, dtype: int64 

False    8291
True      199
Name: VIP, dtype: int64 

Gollux Reedall        2
Elaney Webstephrey    2
Grake Porki           2
Sus Coolez            2
Apix Wala             2
                     ..
Jamela Griffy         1
Hardy Griffy          1
Salley Mckinn         1
Mall Frasp            1
Propsh Hontichre      1
Name: Name, Length: 8473, dtype: int64 

True     4378
False    4315
Name: Transported, dtype: int64 



# data preprocessing

### spliting data set 

In [115]:
X_train, y_train = df_train[df_train.columns[:-1]], df_train["Transported"]


### dropping useless(?) features 

In [116]:
#The name attribute  doesn't seem to affect targets - it should be dropped
X_train.drop("Name", axis=1, inplace=True)
X_test = df_test.drop("Name", axis=1, inplace=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


### extracting data transformers

In [117]:
#first column looks useless, because this is just unique ID of passengers...but!
#there is an information about number of family/group members on the ship - this could be important feature
#let's "pull out" this from data
from sklearn.base import BaseEstimator, TransformerMixin

class passenger_id_transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("")
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        X_passenger = X_["PassengerId"].apply(lambda x: x[:4])
        members = dict(X_passenger.value_counts())
        X_.loc[:, "PassengerId"] = X_passenger.apply(lambda x: members[x])
        return X_
    

#like before with PassengerId: column Cabin contains information about deck/side of passengers cabin - this could be important
#building encoder transformation to extract this data

class cabin_transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_ = X.copy()
        deck = X_["Cabin"].apply(lambda x: str(x)[0])
        side = X_["Cabin"].apply(lambda x: str(x)[-1])
        X_.drop("Cabin", axis=1, inplace=True)
        X_.loc[:, "Deck"] = deck
        X_.loc[:, "Side"] = side
        return X_
    

In [118]:
#quick look if it works
passenger_transformer = passenger_id_transformer()
passenger_transformer.fit_transform(X_train)




Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,1,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0
1,1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0
2,2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0
3,2,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0
4,1,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,1,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0
8689,1,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0
8690,1,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0
8691,2,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0


In [119]:
#quick look if it works
cabin_transform = cabin_transformer()
cabin_transform.fit_transform(X_train)




Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,B,P
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,F,S
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,A,S
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,A,S
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,F,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,A,P
8689,9278_01,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,G,S
8690,9279_01,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,G,S
8691,9280_01,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,E,S


### getting it all together into pipelines

In [120]:
#building preprocessing pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler, Normalizer

cat_features = ["PassengerId", "HomePlanet", "CryoSleep", "Destination", "VIP", "Deck", "Side"]
num_features_age = ["Age"]
num_features_service = ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]



preprocess = Pipeline([
    ("PassengerId_transform", passenger_id_transformer()),
    ("Cabin_transfrom", cabin_transformer()),
    
    ("cleaning", ColumnTransformer(transformers=[
        
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy='most_frequent')),
            ("encoder", OneHotEncoder())]), cat_features),
        
        ("num_age", SimpleImputer(strategy="mean"), num_features_age),
        
        ("num_service", SimpleImputer(strategy="mean"), num_features_service)
    ])),
    
    ("scaler", MinMaxScaler())
])






In [121]:
#let's visualize the pipeline to make sure that everything is in right place :)
from sklearn import set_config
set_config(display="diagram")
preprocess

# Model building and evaluation

In [122]:
#importing most popular and basic estimators 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

knn_clf = KNeighborsClassifier()
sgd_clf = SGDClassifier()
rf_clf = RandomForestClassifier()
lr_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier()


In [123]:
#finally building whole pipeline with estimator
model_pipeline = Pipeline([
    ("preprocessing", preprocess),
    ("estimator", dt_clf)
])
model_pipeline

In [124]:
#to choose the best estimator with the best hyperparameters I will use GridSearch method
#After maaany different attempts, the final gridsearch parameters looked like this:
from sklearn.model_selection import GridSearchCV

params = [
#     {
#     "estimator__C": np.logspace(-4, 4, 25),
#     "estimator__penalty":['l1', 'l2'],
#     "estimator": [lr_clf]
#     },
#     {
#         "estimator__max_depth": [3,4,5,6,7],
#         "estimator": [dt_clf]
#     },
    {
        "preprocessing__cleaning__num_service__strategy": ["median", "mean"],
        "preprocessing__cleaning__num_age__strategy": ["median", "mean"],
        "preprocessing__scaler": [MinMaxScaler(), Normalizer(), StandardScaler()],
        "estimator__max_depth": [9,10,11],
        "estimator": [rf_clf]
    }
]

model = GridSearchCV(model_pipeline, params, cv=3, scoring = "accuracy", n_jobs=5)


In [126]:
model.fit(X_train.copy(), y_train.copy())

































































































































































































































In [127]:
grid_search_table = pd.DataFrame(data=model.cv_results_)
grid_search_table

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_estimator,param_estimator__max_depth,param_preprocessing__cleaning__num_age__strategy,param_preprocessing__cleaning__num_service__strategy,param_preprocessing__scaler,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.779843,0.022018,0.104322,0.009657,RandomForestClassifier(max_depth=10),9,median,median,MinMaxScaler(),{'estimator': RandomForestClassifier(max_depth...,0.780538,0.785024,0.807387,0.790983,0.011743,22
1,1.210296,0.014797,0.110287,0.006654,RandomForestClassifier(max_depth=10),9,median,median,Normalizer(),{'estimator': RandomForestClassifier(max_depth...,0.783644,0.788475,0.807387,0.793169,0.010246,9
2,0.789837,0.010493,0.120748,0.008022,RandomForestClassifier(max_depth=10),9,median,median,StandardScaler(),{'estimator': RandomForestClassifier(max_depth...,0.780193,0.787095,0.80428,0.790523,0.010128,28
3,0.810215,0.016331,0.118194,0.002328,RandomForestClassifier(max_depth=10),9,median,mean,MinMaxScaler(),{'estimator': RandomForestClassifier(max_depth...,0.780883,0.782264,0.806697,0.789948,0.011857,34
4,1.216062,0.003302,0.113,0.004315,RandomForestClassifier(max_depth=10),9,median,mean,Normalizer(),{'estimator': RandomForestClassifier(max_depth...,0.783299,0.792616,0.807732,0.794549,0.010068,2
5,0.786029,0.014262,0.116667,0.021058,RandomForestClassifier(max_depth=10),9,median,mean,StandardScaler(),{'estimator': RandomForestClassifier(max_depth...,0.779848,0.784679,0.808768,0.791098,0.012649,21
6,0.783019,0.022404,0.104995,0.010611,RandomForestClassifier(max_depth=10),9,mean,median,MinMaxScaler(),{'estimator': RandomForestClassifier(max_depth...,0.782264,0.785369,0.804625,0.790753,0.009891,26
7,1.159037,0.019642,0.088115,0.013548,RandomForestClassifier(max_depth=10),9,mean,median,Normalizer(),{'estimator': RandomForestClassifier(max_depth...,0.783644,0.7902,0.808768,0.794204,0.01064,3
8,0.726778,0.024579,0.10234,0.004113,RandomForestClassifier(max_depth=10),9,mean,median,StandardScaler(),{'estimator': RandomForestClassifier(max_depth...,0.779848,0.79089,0.808423,0.793054,0.011765,11
9,0.760217,0.034618,0.113702,0.009235,RandomForestClassifier(max_depth=10),9,mean,mean,MinMaxScaler(),{'estimator': RandomForestClassifier(max_depth...,0.783299,0.783644,0.806697,0.791213,0.010949,20


In [128]:
from sklearn.metrics import accuracy_score
#the best hyperparameters are as follows:
def grid_search_results(grid):
    print(f"best params: {model.best_params_}")
    print(f"best accuracy: {model.best_score_: .3f}%")
    
grid_search_results(model)


best params: {'estimator': RandomForestClassifier(max_depth=10), 'estimator__max_depth': 10, 'preprocessing__cleaning__num_age__strategy': 'median', 'preprocessing__cleaning__num_service__strategy': 'mean', 'preprocessing__scaler': MinMaxScaler()}
best accuracy:  0.796%


### bulding final model with the best hyperparameters

In [130]:
from sklearn.base import clone
final_model = clone(model.best_estimator_)
final_model.fit(X_train, y_train)





# preparation for Kaggle evaluation

In [132]:
df_submission

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,False
3,0021_01,False
4,0023_01,False
...,...,...
4272,9266_02,False
4273,9269_01,False
4274,9271_01,False
4275,9273_01,False


In [137]:
#prediction for test set
y_pred = final_model.predict(X_test)

In [139]:
#making submission
df_submission.loc[:, "Transported"] = y_pred

In [140]:
df_submission

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
...,...,...
4272,9266_02,True
4273,9269_01,False
4274,9271_01,True
4275,9273_01,True


In [145]:
df_submission.to_csv(r"data\spaceship titanic\final_submission.csv", index=False)

In [146]:
pd.read_csv(r"data\spaceship titanic\final_submission.csv")

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
...,...,...
4272,9266_02,True
4273,9269_01,False
4274,9271_01,True
4275,9273_01,True


And after submission on Kaggle...
### NICE! 79.307% accuracy. Not bad :)