<a href="https://www.kaggle.com/code/potongpasir/titanic?scriptVersionId=153551982" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview of Data
First, we load the relevant libraries, files, and inspect the data

In [1]:
import numpy as np
import pandas as pd
titanic_train = pd.read_csv("../input/titanic/train.csv")
print(titanic_train.head())
print(titanic_train.info())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
<c

In [2]:
titanic_train.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [3]:
percent_missing = titanic_train.isnull().sum() * 100 / len(titanic_train)
missing= pd.DataFrame({'percent_missing': percent_missing})
print(missing)

             percent_missing
PassengerId         0.000000
Survived            0.000000
Pclass              0.000000
Name                0.000000
Sex                 0.000000
Age                19.865320
SibSp               0.000000
Parch               0.000000
Ticket              0.000000
Fare                0.000000
Cabin              77.104377
Embarked            0.224467


In [4]:
print(titanic_train["Sex"].value_counts(), end="\n\n")
print(titanic_train["Embarked"].value_counts(), end="\n\n")
print(titanic_train["Pclass"].value_counts(), end="\n\n")

male      577
female    314
Name: Sex, dtype: int64

S    644
C    168
Q     77
Name: Embarked, dtype: int64

3    491
1    216
2    184
Name: Pclass, dtype: int64



Takeaway here is:
1. More males than females
2. 3 types of values under embarked
3. Much more Pclass with value 3 than the others

# Preprocessing
We see that we need to 
1. Extract PassengerId and Survived columns
2. Drop the Name and Cabin, and Fare columns (Cabin has a high number of NaNs, Fare is highly correlated to Pclass)
3. Fill the rest of the NaN values
4. Process the Ticket numbers (which seems weird)
5. Convert other categorical data into binary (0,1) or simply integers (0,1,2,3,etc)

We also realise that we can combine SibSp and Parch into one column representing the number of family on board, so let's get started.

In [5]:
def preprocess(t, mode="default", extract=True):
    # extract relevant information
    t_id = t["PassengerId"]
    
    if not extract:
        t_y = 0
    else:
        t_y = t["Survived"]
        t = t.drop(columns="Survived")
    
    # combine SibSp and Parch, drop Cabin
    t["Family"] = t["SibSp"] + t["Parch"]
    t = t.drop(columns=["PassengerId", "Name", "Cabin", "SibSp", "Parch", "Fare", "Ticket"])
    
    # Fill NaN
    meds = t["Age"].median()
    t["Age"] = t["Age"].fillna(meds)
    
    modes = t["Embarked"].mode()[0]
    t["Embarked"] = t["Embarked"].fillna(modes)
    
    # convert categorical data
    if mode == "onehot":
        t = pd.get_dummies(t, columns=["Sex", "Embarked"])
    else:
        t["Sex"] = t["Sex"].replace({"male": 0, "female": 1})
        t["Embarked"] = t["Embarked"].replace({"S": 0, "C": 1, "Q":2})
    
    print(t.head())
    return t, t_id, t_y
titanic_train_X, _, titanic_train_y = preprocess(titanic_train)

   Pclass  Sex   Age  Embarked  Family
0       3    0  22.0         0       1
1       1    1  38.0         1       1
2       3    1  26.0         0       0
3       1    1  35.0         0       1
4       3    0  35.0         0       0


In [6]:
titanic_train_X.isna().any()

Pclass      False
Sex         False
Age         False
Embarked    False
Family      False
dtype: bool

# Model Selection
Now, we select out model. In consideration are:
1. Logistic Regression from sklearn
2. XGBoostClassifier from xgboost
3. CatBoostClassifier from catboost

We will use 5-fold Cross Validation from sklearn to measure performance and accuracy.

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from catboost import CatBoostClassifier as cbc

#Initialise models
lr = LogisticRegression(tol=1e-8, solver="lbfgs", max_iter=400, C=1.5)
xgb = XGBClassifier(eta=0.009, n_estimators=400, subsample=0.25, max_depth=6, booster="gbtree")
cat = cbc(loss_function="Logloss", n_estimators=400, eta=0.009, max_depth=6, eval_metric="Accuracy", subsample=0.25, verbose=False)
svc = SVC(kernel="rbf", max_iter=-1, gamma="auto", C=2, probability=True)
vote = VotingClassifier([("xgb", xgb), ("cat", cat), ("svc", svc)], voting="soft")
nope = VotingClassifier([("xgb", xgb), ("cat", cat)], voting="soft")

models = [("lr", lr), ("xgb", xgb), ("cat", cat), ("svc", svc), ("vote", vote), ("vote without svc", nope)]

for m, model in models:
    print(m)
    scores = cross_val_score(model, titanic_train_X, titanic_train_y, cv=5, scoring="accuracy")
    print(scores)
    print(scores.mean())
    print("-" * 20)

lr
[0.79329609 0.79213483 0.79775281 0.7752809  0.82022472]
0.7957378695624883
--------------------
xgb
[0.81564246 0.83146067 0.80898876 0.79213483 0.85393258]
0.8204318624066287
--------------------
cat
[0.81564246 0.8258427  0.81460674 0.79213483 0.85393258]
0.8204318624066287
--------------------
svc
[0.77653631 0.79775281 0.8258427  0.82022472 0.83146067]
0.8103634423451134
--------------------
vote
[0.79888268 0.83146067 0.83707865 0.79213483 0.84831461]
0.8215742891218379
--------------------
vote without svc
[0.82122905 0.83146067 0.81460674 0.78651685 0.84831461]
0.8204255853367648
--------------------


# Model Prediction
XGBoostClassifier looks good, so for a preliminary attempt let's use it only.

Now, we preprocess the data and let our model predict on it.

In [8]:
model = vote

In [9]:
model.fit(titanic_train_X, titanic_train_y)

titanic_test = pd.read_csv("../input/titanic/test.csv")
titanic_tests, ids, _ = preprocess(titanic_test, extract=False)

res = model.predict(titanic_tests)

final = pd.DataFrame({"Survived": res}, index=ids)
print(final)
print(final.isna().any())

   Pclass  Sex   Age  Embarked  Family
0       3    0  34.5         2       0
1       3    1  47.0         0       1
2       2    0  62.0         2       0
3       3    0  27.0         0       0
4       3    1  22.0         0       2
             Survived
PassengerId          
892                 0
893                 0
894                 0
895                 0
896                 1
...               ...
1305                0
1306                1
1307                0
1308                0
1309                0

[418 rows x 1 columns]
Survived    False
dtype: bool


In [10]:
final.to_csv("sya.csv")