# Titanic - Machine Learning from Disaster

![](https://storage.googleapis.com/kaggle-competitions/kaggle/3136/logos/header.png)

<p align="center">
    <img src="https://kaggle.com/static/images/site-logo.svg" width="200">
</p>

- `survival` –	Survival (0 = No, 1 = Yes)
- `pclass` – Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- `sex` – Sex
- `Age` – Age in years
- `sibsp` –	# of siblings or spouses aboard the Titanic	
- `parch` –	# of parents or children aboard the Titanic	
- `ticket` – Ticket number
- `fare` –	Passenger fare
- `cabin` –	Cabin number
- `embarked` – Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Download dataset

In [None]:
!curl https://fully-connected-graph.github.io/datasets/titanic/titanic.csv -o titanic.csv

Get required libraries

In [None]:
%%capture
!pip install matplotlib

Import required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import plot_model

Set the random seed, so that we get the same result

In [None]:
from numpy.random import seed
from tensorflow.random import set_seed

seed(42)
set_seed(42)

## Data Preprocessing

Load Dataset

In [None]:
data = pd.read_csv("titanic.csv", index_col="PassengerId")

data.head()

Rename the columns

In [None]:
data.columns = [
    "survived", 
    "p_class", 
    "name", 
    "sex", 
    "age", 
    "sib_sp", 
    "parch", 
    "ticket", 
    "fare", 
    "cabin", 
    "embarked"
]

In [None]:
data.head()

In [None]:
data.info()

There are missing values in: `age`, `cabin`, `embarked` (only two)

### Cleaning data

In [None]:
len(data['name'].unique())

In [None]:
len(data['ticket'].unique())

- cabin is missing most of its values
- ticket and name have a unique value for every data point

In [None]:
data.drop(
    ['cabin', 'name', 'ticket'], 
    axis=1, inplace=True
)

### Fill missing values

In [None]:
# fill with the median (the middle age value)
data['age'] = data['age'].fillna(
    value=data['age'].median()
)

# fill with the mode (most occuring value)
data['embarked'] = data['embarked'].fillna(
    value=data['embarked'].value_counts().idxmax()
)

In [None]:
data.info()

### Encode categorical data

In [None]:
data["p_class"] = data["p_class"].astype('str')

One hot encode categorical data

In [None]:
categorical_col = [
    "sex",
    "embarked",
    "p_class",
    "survived"
]

In [None]:
ohe_data = pd.get_dummies(
    data[categorical_col],
    drop_first=True
)

ohe_data.head()

Isolate numerical data

In [None]:
num_data = data.drop(
    categorical_col, 
    axis=1
)

num_data.head()

Recombine data

In [None]:
prep_data = pd.concat([
    ohe_data,
    num_data
], axis=1)

prep_data.head()

Create training and testing data

In [None]:
features = prep_data.drop('survived', axis=1)
target = prep_data['survived']

features_train, features_test, target_train, target_test = train_test_split(
    features,
    target,
    test_size = .2,
    random_state = 1
)

#### Scale numerical data

Scaling numerical data ensures that the algorithm converges faster.

<p align="center">
<img src="https://fully-connected-graph.github.io/GMLM-2022/lecture3/assets/feature_scaling.png"/>
</p>

In [None]:
scaler = StandardScaler()

features_train[num_data.columns] = scaler.fit_transform(features_train[num_data.columns])

features_train.head()

In [None]:
features_test[num_data.columns] = scaler.transform(features_test[num_data.columns])

features_test.head()

## Model Creation

In [None]:
model = Sequential()

In [None]:
# Add first layer and input layer
model.add(Dense(12, input_dim=features_train.shape[1], activation="relu"))

In [None]:
# Add second layer
model.add(Dense(12, activation='linear'))

In [None]:
# Add output layer
model.add(Dense(1, activation='sigmoid'))

View model summary

In [None]:
model.summary()

In [None]:
plot_model(model, show_shapes=True, show_layer_names=False)

Compile the model:
- set its loss function
- optimizer
- metrics

In [None]:
model.compile(loss='binary_crossentropy', optimizer="Adam", metrics=['acc'])

Train the model

In [None]:
history = model.fit(
    features_train, target_train, 
    epochs=20, batch_size=10, 
    validation_split=0.2, 
    verbose=1
)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model accuracy')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

Get accuracy on the testing set

In [None]:
scores = model.evaluate(features_test, target_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# thank you 🧸

let's move on to the practical!

In [None]:
print("Yay")