# Titanic - Machine Learning from Disaster <a id='top'></a>
Welcome! This notebook is going to build a model that predicts survival on the Titanic. The data used to build this model is collected from the [Kaggle's Titanic competition](https://www.kaggle.com/c/titanic/overview). 

The data is split into two groups: the training set (train.csv) and the test set (test.csv). The training set should be used to build the machine learning models. The test set should only be used for evaluating the model's performance.

Data dictionary:
| Variable | Definition | Key |
|----------|------------|-----|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embark | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Table of contents:
1. [Data Exploratory](#eda)
2. [Data Visualization](#data-visualization)
3. [Handle Missing Data](#handle-na)
4. [Preprocessing](#preprocessing)
5. [Building Models](#build-models)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


In [None]:
sns.set_style("darkgrid")
plt.style.use("seaborn-darkgrid")
pio.renderers.default = "notebook_connected"


## Data Exploratory <a id='eda'></a>
[Back to top](#top)

In [None]:
df = pd.read_csv('./data/train.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe().transpose()

## Data Visualization <a id='data-visualization'></a>
[Back to top](#top)

In [None]:
plt.figure(figsize=(6, 5), dpi=200)
sns.heatmap(df.corr(), annot=True, cmap='viridis')
plt.show()


In [None]:
fig = px.histogram(
    data_frame=df, x="Survived", title="How Many Died, How Many Survived?"
)
fig.update_layout(xaxis={"type": "category"})
fig.show()


Age distribution of male and female passengers aboard the Titanic:

In [None]:
fig = px.histogram(
    data_frame=df,
    x="Age",
    color="Sex",
    barmode="group",
    marginal="box",
    title="Age Distribution of Male and Female Passengers aboard the Titanic",
)
fig.show()


The majority of people aboard the Titanic were in their 20s and 30s.

Age distribution of passengers who did not survive:

In [None]:
fig = px.histogram(
    data_frame=df[df["Survived"] == 0],
    x="Age",
    color="Sex",
    barmode="group",
    marginal="box",
    title="Age distribution of Passengers whom did not survive the Disaster",
)
fig.show()


Passengers who died were mostly in their late 10s and early 40s, and were mainly male passengers.

In [None]:
fig = px.histogram(data_frame=df[df['Survived'] == 0], x='Sex', title='Number of Deceased Passengers')
fig.show()

It looks like a lot of male passengers died with the ship, nearly 6 times greater than that of female passengers.

Let's look at survival status for each class.

In [None]:
fig = px.box(
    data_frame=df,
    x="Pclass",
    y="Age",
    color="Survived",
    title="Age distribution per Class split by Survival status",
)
fig.show()


It looks like there is a same pattern for all three classes that older people did not survive.

In [None]:
fig = px.histogram(
    data_frame=df,
    x="Pclass",
    color="Survived",
    barmode="group",
    title="Number of Passengers per Class Split by Survival status",
)
fig.show()


And the third class accounted for more deaths than the other two classes.

Now I want to gain insight on the ports of embarkation. Which port has the most passengers?

In [None]:
fig = px.histogram(
    data_frame=df,
    x="Embarked",
    color="Survived",
    title="Number of Passengers at each Port Split by Survival status",
)
fig.show()


A lot of passengers boarded the Titanic at Southampton port.

Finally, probably not relevant, but I want to check if `SibSp`, `Parch` and `Fare` affect `Survived` or not.

In [None]:
unrelevant_features = df[["SibSp", "Parch", "Fare", "Survived"]].corr()
unrelevant_features['Survived']

In [None]:
plt.figure(figsize=(6, 5), dpi=200)
sns.heatmap(unrelevant_features, annot=True, linewidths=0.3)
plt.show()


So it looks like the `Fare` actually have a positive relationship with `Survived`, not by much but worth noticing. Let's plot a box plot to visualize this clearer.

In [None]:
fig = px.box(
    data_frame=df,
    x="Survived",
    y="Fare",
    title="Does Fare affect Passengers' Survival Status?",
)
fig.show()


Looks like there is a segregation between rich and poor people. For example, the number of deceased passengers from third class is a lot greater than that of first class and second class. Or a lot of passengers who purchased higher fare ticket survived the event.

## Handle Missing Data <a id='handle-na'></a>
[Back to top](#top)

Let's take a look back our data to see which columns contain missing data.

In [None]:
df.isna().sum()

### The `Embarked` column
First, I want to drop the 2 missing values of the `Embarked` column since they only account for 0.2 percent of the entire data.

In [None]:
df = df.dropna(subset=['Embarked'])
df.isna().sum()


### The `Age` column
Next is the `Age` column. I want to look at the statistics for this column to strengthen my decision.

In [None]:
df['Age'].describe()

In [None]:
fig = px.histogram(data_frame=df, x="Age", marginal="box")
fig.show()


The mean and the median of `Age` column are pretty close together. I'm not done here. I want to see which class those passengers with missing age values are from.

In [None]:
df[df['Age'].isna()]

In [None]:
fig = px.histogram(
    data_frame=df[df["Age"].isna()],
    x="Pclass",
    title="Numbers of Passengers with Missing Age Values per Class",
)
fig.update_layout(xaxis={"categoryorder": "category ascending", "type": "category"})
fig.show()


Passengers with missing age values are mostly from third class. I want to fill in those missing values with the age median of the corresponding class. 

Meaning:
* Passengers with missing age values from first class will be filled with the age median of first class.
* Passengers with missing age values from second class will be filled with the age median of second class.
* Passengers with missing age values from third class will be filled with the age median of third class.

In [None]:
age_median_pclass = df.groupby('Pclass').median()['Age']
age_median_pclass

In [None]:
fig = px.box(data_frame=df, x="Pclass", y="Age", title="Age Distribution per Class before Filling Missing Values")
fig.show()


So:
* Missing age values of passengers from first class will be filled with 37.
* Missing age values of passengers from second class will be filled with 29.
* Missing age values of passengers from third class will be filled with 24.


In [None]:
def fill_age(pclass, age):
    if np.isnan(age):
        return age_median_pclass[pclass]
    else:
        return age


In [None]:
df["Age"] = df.apply(lambda table: fill_age(table["Pclass"], table["Age"]), axis=1)


In [None]:
df.isna().sum()

In [None]:
fig = px.box(data_frame=df, x="Pclass", y="Age", title="Age Distribution per Class after Filling Missing Values")
fig.show()


### The `Cabin` column

In [None]:
df['Cabin'].sort_values()

In [None]:
fig = px.histogram(x=df["Cabin"].sort_values(), title="Cabin Value Counts")
fig.update_layout(xaxis_title="Unique Cabin Values")
fig.show()


I notice the naming convention is a letter followed by two or three digits. I also notice there are some strange cabin name entries such as `B57 B59 B63 B66`, `C23 C25 C27`, `D10 D12`, `C62 C64`, etc. Those cabins were probably occupied by members of the same family.

In [None]:
df[df["Cabin"] == "B57 B59 B63 B66"]


The Ryerson family had 5 members including 3 siblings and 2 parents. They were staying in 4 cabins: B57, B59 B63 and B66. Only two sisters survived, the other members were missing.

In [None]:
df[df['Cabin'] == 'D10 D12']

Mr. William Bertram had one parent/child aboard the Titanic. He probalby stayed in one room and the other family member stayed in the other. Unfortunately, we don't know who the other family member is, probably that family member is the one that has `NaN` for Cabin number.

I want to create another column called "Cabin Class" which contains the first letter of each value in the "Cabin" column. This way, I can treat the "Cabin Class" column as the high-level class and the "Cabin" column as the detailed subclass of the "Cabin Class" column.

In [None]:
df['Cabin Class'] = df['Cabin'].map(lambda x: x[0], na_action='ignore')


In [None]:
df[['Cabin', 'Cabin Class']]

In [None]:
df['Cabin Class'].unique()

For the `NaN` in the "Cabin Class", I want to assign **U** to them (U stands for unknown cabin class).

In [None]:
df['Cabin Class'] = df['Cabin Class'].fillna('U')

In [None]:
df['Cabin Class'].unique()

Now let's drop the "Cabin" column since it is not necessary anymore.

In [None]:
df = df.drop('Cabin', axis=1)

In [None]:
df.isna().sum()

## Preprocessing <a id='preprocessing'></a>
[Back to top](#top)

In [None]:
df.sample(n=10)

In [None]:
X = df.drop(["PassengerId", "Survived", "Name", "Ticket", "SibSp", "Parch"], axis=1)
y = df["Survived"]


In [None]:
X.columns


In [None]:
categorical_columns = ["Sex", "Embarked", "Cabin Class"]

one_hot = OneHotEncoder(handle_unknown="ignore", drop='first')
column_transformer = ColumnTransformer(
    [("one_hot", one_hot, categorical_columns)], remainder="passthrough"
)

X = column_transformer.fit_transform(X)


In [None]:
column_transformer.get_feature_names_out()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)


In [None]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Building Models <a id='build-models'></a>
[Back to top](#top)

We are going to train 3 models and compare the results of them. The three models are SVC, RandomForestClassifier and LogisticRegression.

In [None]:
# SVC hyperparameters grid
params_grid_svc = {
    "C": np.logspace(-3, 0, 10),
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "degree": np.arange(2, 5),
    "class_weight": ["balanced", None],
    "max_iter": [100, 1000, 5000],
}

# RandomForestClassifier hyperparameters grid
params_grid_rf_clf = {
    "n_estimators": np.arange(100, 1000, 100),
    "max_features": ["sqrt", "log2"],
}

# LogisticRegression hyperparameters grid
params_grid_log_reg = {
    "penalty": ["l1", "l2", "elasticnet"],
    "C": np.logspace(-3, 0, 10),
    "class_weight": ["balanced", None],
    "solver": ["lbfgs", "liblinear", "saga"],
    "max_iter": [100, 1000, 5000],
    "l1_ratio": [0.1, 0.2, 0.5, 0.7, 0.9, 0.95, 1],
}


In [None]:
gs_svc = GridSearchCV(SVC(), param_grid=params_grid_svc, cv=10, scoring="accuracy")

gs_rf_clf = GridSearchCV(
    RandomForestClassifier(), param_grid=params_grid_rf_clf, cv=10, scoring="accuracy"
)

gs_log_reg = GridSearchCV(
    LogisticRegression(), param_grid=params_grid_log_reg, cv=10, scoring="accuracy"
)


In [None]:
gs_svc.fit(X_train, y_train)


In [None]:
gs_rf_clf.fit(X_train, y_train)


In [None]:
gs_log_reg.fit(X_train, y_train)


In [None]:
gs_svc.best_score_


In [None]:
gs_rf_clf.best_score_


In [None]:
gs_log_reg.best_score_
