# Solving the Titanic surival problem

There are many methods to solving the Titanic survival problem, but none of them matter if the data is not analysed thoroughly. The dataset that is given is very small in comparison to other datasets used in machine learning problems, and most would say deep learning is a futile method to use, so extracting the necessary features is the most important step to solving this problem.



## Import basic libraries

The first step is to import the libraries used in all machine learning problems, *NumPy* and *Pandas*.

In [1]:
import numpy as np
import pandas as pd

## Load the data

Next, we need to load the data by using the previously imported *Pandas* library. We load the training and test data into separate variables and then combine them into one. We also save the *'PassengerId'* column of the test set so we can separate the datasets after the feature engineering is complete.

In [2]:
train_data = pd.read_csv('../input/titanic/train.csv')
test_data = pd.read_csv('../input/titanic/test.csv')

test_ids = test_data['PassengerId']

data = pd.concat([train_data, test_data], axis=0)

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Feature engineering

Before getting deep into analysing the data, we will use the info function to get a basic idea of the values that are contained in each of the feature vectors.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB


We can observe that the columns *'Age', 'Cabin'* and *'Embarked'* have some missing values, which means we will have to find a way to deal with them.

### Drop columns that do not carry valuable information

The *'Cabin'* and *'Ticket'* columns do not carry any useful information about the survival chance of the passengers, so we can just remove them from the dataset.

In [4]:
data = data.drop(['Cabin', 'Ticket'], axis=1)

### Drop the passengers that do not have values for 'Embarked' column

Seeing as there are only two rows which have NaN values in the *'Embarked'* column, they can be safely removed as they won't affect the classification.


In [5]:
data = data[data['Embarked'].notna()]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1307 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1307 non-null   int64  
 1   Survived     889 non-null    float64
 2   Pclass       1307 non-null   int64  
 3   Name         1307 non-null   object 
 4   Sex          1307 non-null   object 
 5   Age          1044 non-null   float64
 6   SibSp        1307 non-null   int64  
 7   Parch        1307 non-null   int64  
 8   Fare         1306 non-null   float64
 9   Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(3)
memory usage: 112.3+ KB


### Extract useful features from the 'Name' column

The *'Name'* column in itself is not very telling to the survival chance of the passenger, but some features may be extracted from it that will definitely aid us in our classification problem. From the name of the passenger we can extract the title as well as the surname, which will help in distinguishing family relations between the passengers.

Additionally, seeing as the surname is a string variable, we encode it using the *LabelEncoder* function from *sklearn*.

In [6]:
from sklearn.preprocessing import LabelEncoder

data['Title'] = [value.split(', ')[1].split('.')[0] for value in data['Name'].values]
data.loc[(data['Title'] == 'Lady') | (data['Title'] == 'Mme') | (data['Title'] == 'Ms') | (data['Title'] == 'the Countess') | (data['Title'] == 'Mlle'), 'Title'] = 'Miss'
data.loc[(data['Title'] != 'Mr') & (data['Title'] != 'Mrs') & (data['Title'] != 'Miss') & (data['Title'] != 'Master'), 'Title'] = 'Other'

encoder = LabelEncoder()
data['Surname'] = encoder.fit_transform([value.split(', ')[0] for value in data['Name'].values])

### Extract the 'Alone' and 'FamilyCount' features

From the *'Parch'* and *'SibSp'* features we can extract information about the passenger's companionship aboard the Titanic, specifically whether they are alone, and if they are not, how many family members they have on-board.

In [7]:
data['Alone'] = np.zeros(data.shape[0], dtype=np.int)
data.loc[(data['SibSp'] == 0) & (data['Parch'] == 0), 'Alone'] = 1

data['FamilyCount'] = data['SibSp'] + data['Parch']

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title,Surname,Alone,FamilyCount
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,Mr,100,0,1
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,Mrs,182,0,1
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,Miss,329,1,0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,Mrs,267,0,1
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,Mr,15,1,0


### Fill missing values in 'Age' column

The only feature vector that has NaN values left is the *'Age'* column. We can simply fill the missing values with the mean age value of the entire dataset, but to achieve greater accuracy in the predicted age, we use the mean value for each title. Additionally, for the titles of 'Mr', 'Mrs' and 'Miss' we add a gaussian distribution of numbers to the mean value to further improve the accuracy of the predicted age.

In [8]:
np.random.seed(seed=157)

# Find mean values of 'Age' for each 'Title' variable
mean_ages_title = data[['Title', 'Age']].loc[data['Age'].notna()].groupby('Title').mean()
age_nan_titles = data['Title'].loc[data['Age'].isnull()].unique()

# Fill 'Age' values that are missing with mean value for each 'Title' class
for i in range(len(age_nan_titles)):
    if age_nan_titles[i] == 'Mr' or age_nan_titles[i] == 'Mrs' or age_nan_titles[i] == 'Miss':
        data.loc[(data['Age'].isnull()) & (data['Title'] == age_nan_titles[i]), 'Age'] = [np.ceil(mean_ages_title.loc[age_nan_titles[i]].values[0]) + np.random.randint(-6, 6)
                                                                                          for _ in data['Age'].loc[(data['Age'].isnull()) & (data['Title'] == age_nan_titles[i])]]
    else:
        data.loc[(data['Age'].isnull()) & (data['Title'] == age_nan_titles[i]), 'Age'] = [np.ceil(mean_ages_title.loc[age_nan_titles[i]].values[0])
                                                                                          for _ in data['Age'].loc[(data['Age'].isnull()) & (data['Title'] == age_nan_titles[i])]]

### Extract 'AgeRange' feature to transform the numerical 'Age' column to categorical

Now that the missing values for *'Age'* have been filled, we can transform the numerical feature to a more useful categorical one, by dividing the ages into five distinct groups.

In [9]:
_, age_categories = pd.cut(data['Age'], 5, retbins=True)
age_categories = [np.floor(value) for value in age_categories]

data['AgeRange'] = np.zeros(len(data), dtype=np.int)

data.loc[data['Age'] <= age_categories[1], 'AgeRange'] = 0
data.loc[(data['Age'] > age_categories[1]) & (data['Age'] <= age_categories[2]), 'AgeRange'] = 1
data.loc[(data['Age'] > age_categories[2]) & (data['Age'] <= age_categories[3]), 'AgeRange'] = 2
data.loc[(data['Age'] > age_categories[3]) & (data['Age'] <= age_categories[4]), 'AgeRange'] = 3
data.loc[data['Age'] > age_categories[4], 'AgeRange'] = 4

### Extract 'FareRange' feature to transform the numerical 'Fare' column to categorical

We use the same procedure as before for the numerical *'Fare'* feature.

In [10]:
_, fare_categories = pd.qcut(data['Fare'], 4, retbins=True)

data['FareRange'] = np.zeros(len(data), dtype=np.int)
data.loc[data['Fare'] <= fare_categories[1], 'FareRange'] = 0
data.loc[(data['Fare'] > fare_categories[1]) & (data['Fare'] <= fare_categories[2]), 'FareRange'] = 1
data.loc[(data['Fare'] > fare_categories[2]) & (data['Fare'] <= fare_categories[3]), 'FareRange'] = 2
data.loc[data['Fare'] > fare_categories[3], 'FareRange'] = 3

### One-Hot encoding

We convert the categorical features into binary ones to achieve a better and more refined classification.

In [11]:
data = pd.get_dummies(data=data, columns=['Sex', 'Embarked', 'Pclass', 'AgeRange', 'FareRange', 'Title'], dtype=np.int)

### Drop columns that are no longer useful

For the final step we remove the columns that are no longer of use, and with that the feature engineering of the dataset is finally complete.

In [12]:
data = data.drop(['Name', 'Age', 'Fare', 'SibSp', 'Parch'], axis=1)

### Separate data into train, validation and test subsets

We separate the training and test data again, and further extract a new validation dataset from the training set.

In [13]:
from sklearn.model_selection import train_test_split

X_train = data.loc[data['PassengerId'].isin(test_ids) == False]
y_train = X_train['Survived']

X_test = data.loc[data['PassengerId'].isin(test_ids)]

X_train = X_train.drop(['PassengerId', 'Survived'], axis=1)
X_test = X_test.drop(['PassengerId', 'Survived'], axis=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=157, stratify=y_train)

X_train.head()

Unnamed: 0,Surname,Alone,FamilyCount,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,...,AgeRange_4,FareRange_0,FareRange_1,FareRange_2,FareRange_3,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other
14,819,1,0,1,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
709,542,0,2,0,1,1,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
427,641,1,0,1,0,0,0,1,0,1,...,0,0,0,1,0,0,1,0,0,0
198,473,1,0,1,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
442,638,0,1,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0


### Standardization

Before training the MLP Neural Network, to achieve better results it is advised to standardize the data. Even though most of our data is binary, this step still helps with achieving higher accuracy of the final classification.

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

column_names = X_train.columns
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=column_names)

column_names = X_val.columns
X_val = pd.DataFrame(scaler.fit_transform(X_val), columns=column_names)

column_names = X_test.columns
X_test = pd.DataFrame(scaler.fit_transform(X_test), columns=column_names)

### Building the Keras MLP neural network

In [15]:
from keras.models import Sequential
from keras.layers import BatchNormalization, Activation, Dense, Dropout
from keras.initializers import glorot_uniform
from keras.activations import tanh
from keras.constraints import maxnorm

init_tanh = glorot_uniform(seed=157)    # used for tanh and softmax

model = Sequential()

# input and first Dense layer
model.add(Dense(units=X_train.shape[1], kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))
model.add(BatchNormalization())
model.add(Activation(tanh))

# second Dense layer
model.add(Dense(units=X_train.shape[1] * 2, kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))
model.add(BatchNormalization())
model.add(Activation(tanh))

# third Dense layer
model.add(Dense(units=X_train.shape[1] * 4, kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))
model.add(BatchNormalization())
model.add(Activation(tanh))

# fourth Dense layer
model.add(Dense(units=X_train.shape[1], kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))
model.add(BatchNormalization())
model.add(Activation(tanh))

# fifth Dense layer
model.add(Dense(units=X_train.shape[1] * 4, kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))
model.add(BatchNormalization())
model.add(Activation(tanh))

# output Dense layer
model.add(Dense(units=1, activation='sigmoid', kernel_initializer=init_tanh, kernel_constraint=maxnorm(3)))

### Setting all random variables to a single seed for reproducibility

In [16]:
import os
import random
import tensorflow as tf

# 1. Set `PYTHONHASHSEED` environment variable at a fixed value

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['PYTHONHASHSEED'] = str(157)

# 2. Set `python` built-in pseudo-random generator at a fixed value
random.seed(157)

# 3. Set `numpy` pseudo-random generator at a fixed value
np.random.seed(157)

# 4. Set `tensorflow` pseudo-random generator at a fixed value
tf.compat.v1.set_random_seed(157)

# 5. Configure a new global `tensorflow` session
from tensorflow.python.keras import backend as k

session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                        inter_op_parallelism_threads=1)

sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
k.set_session(sess)

### Training the network

In [17]:
from keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard
from keras.optimizers import Adam
from keras.models import save_model

opt = Adam(learning_rate=0.0009, amsgrad=True)

model.compile(optimizer=opt,
            loss='binary_crossentropy',
            metrics=['accuracy'])

checkpoint = ModelCheckpoint('neural_network_checkpoint_training.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')

tensorboard = TensorBoard(log_dir='./logs',
                          histogram_freq=0,
                          write_graph=True,
                          write_images=True)

model.fit(x=X_train, y=y_train,
        epochs=40,
        validation_data=(X_val, y_val),
        batch_size=50,
        shuffle=True,
        callbacks=[tensorboard, checkpoint],
        verbose=1)

save_model(checkpoint.model, 'neural_network_latest_saved.h5')

Epoch 1/40
Epoch 00001: val_loss improved from inf to 0.47811, saving model to neural_network_checkpoint_training.h5
Epoch 2/40
 1/16 [>.............................] - ETA: 0s - loss: 0.5566 - accuracy: 0.7400
Epoch 00002: val_loss improved from 0.47811 to 0.46387, saving model to neural_network_checkpoint_training.h5
Epoch 3/40
 1/16 [>.............................] - ETA: 0s - loss: 0.4855 - accuracy: 0.7200
Epoch 00003: val_loss improved from 0.46387 to 0.44295, saving model to neural_network_checkpoint_training.h5
Epoch 4/40
 1/16 [>.............................] - ETA: 0s - loss: 0.4725 - accuracy: 0.7400
Epoch 00004: val_loss improved from 0.44295 to 0.41858, saving model to neural_network_checkpoint_training.h5
Epoch 5/40
 1/16 [>.............................] - ETA: 0s - loss: 0.4594 - accuracy: 0.8000
Epoch 00005: val_loss did not improve from 0.41858
Epoch 6/40
 1/16 [>.............................] - ETA: 0s - loss: 0.5139 - accuracy: 0.7600
Epoch 00006: val_loss improved f

### The final prediction and submission

In [18]:
from keras.models import load_model

clf = load_model('neural_network_checkpoint_training.h5')

threshold = 0.5

test_prediction = clf.predict(X_test.values).flatten()

test_prediction[test_prediction <= threshold] = 0
test_prediction[test_prediction > threshold] = 1

test_prediction = pd.DataFrame(test_prediction, columns=['Survived'], dtype=np.int)

prediction = pd.concat([test_ids, test_prediction], axis=1)

prediction.to_csv('submission_nn.csv', index=False)

### Conclusion

With this method we achieve a final score of 78.9% on the test dataset according to Kaggle, which puts this classifier in the top 11%. Even though the dataset is considerably small, the MLP Neural Network still manages to produce a satisfying accuracy, which can be further improved with more detailed feature engineering of the data.