# **Titanic**

The objective of this notebook is to predict whether a passenger onboard the titanic will survive given features such as their gender, age, class etc.

## 1. Preparations

We will begin by installing all of the necessary libraries and loading in the data. Since this project is for a Kaggle competition, I will be using the Kaggle API to download the data and submit my results.

### 1.1. Dependencies

In [None]:
!pip install kaggle 
!pip install scikit-learn

In [None]:
import kaggle 
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

api.competition_download_file('titanic', 'train.csv')
api.competition_download_file('titanic', 'test.csv')

In [71]:
import re
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

### 1.2. Importing Data

In [174]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [175]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [155]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [176]:
train.shape, test.shape

((891, 12), (418, 11))

## 2. Preprocessing Training Data

We can now begin cleaning the data so all of the relevant features are included and formatted in a suitable way to run a machine learning model.

### 2.1. Removing Extraneous Features

In [177]:
train.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace = True)
train.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S
5,0,3,male,,0,0,8.4583,,Q
6,0,1,male,54.0,0,0,51.8625,E46,S
7,0,3,male,2.0,3,1,21.075,,S
8,1,3,female,27.0,0,2,11.1333,,S
9,1,2,female,14.0,1,0,30.0708,,C


### 2.2. Remove Missing Data

In [178]:
print(train.shape)
train.isnull().sum()

(891, 9)


Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

As can be seen, there are only 2 passengers with an unknown departure point but the ages of 177 passengers and the cabin of 687 passengers are also missing. Since this is ~ 75% of the full dataset, missing value imputation needs to be conducted. But first, I will remove the two passengers with missing embarkment locations.

In [179]:
train.dropna(subset = ['Embarked'], inplace = True)
train.shape

(889, 9)

In [180]:
train.Cabin.unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'C83', 'F33', 'F G73',
       'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101', 'F E69',
       'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4', 'A32',
       'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87',
       'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49',
       'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66',
       'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128',
       'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104',
       'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30',
       'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22',
       'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20',
       'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126',
       'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64',
       'E24',

In [181]:
def clean_cabin(string):
    if string != string:
        return 'NaN'
    else:
        num = re.match("\w\d+", string)
        if num:
            return num.group()

def deck_floor(string):
    if string != string:
        return 'NaN'
    else:
        num = re.sub(r'[0-9]', "", string)
        return num

def room_number(string):
    if string != string:
        return 'NaN'
    else:
        num = re.sub("[^0-9]", "", string)
        return num



In [183]:
train['clean_cabin'] = train['Cabin'].apply(lambda x: clean_cabin(x))
#train['Deck'] = train['clean_cabin'].apply(lambda x: deck_floor(x))
#train['Room'] = pd.to_numeric(train['clean_cabin'].apply(lambda x: room_number(x)), errors = 'coerce')
#train.drop(['Cabin', 'clean_cabin'], axis = 1)

str

### 2.3. One-Hot Encoding

Three of the variables provided are categorical ('Pclass', 'Sex', 'Embarked') and therefore need to be converted into dummy variables. We therefore will have 12 features in the final machine learnign model.

In [23]:
one_hot = pd.get_dummies(train['Pclass'])
train = train.drop(['Pclass'],axis = 1)
train = train.join(one_hot)

one_hot = pd.get_dummies(train['Sex'])
train = train.drop(['Sex'],axis = 1)
train = train.join(one_hot)

one_hot = pd.get_dummies(train['Embarked'])
train = train.drop(['Embarked'],axis = 1)
train = train.join(one_hot)


### 2.4. Data Imputation

Multiple imputation by chained equations (MICE) was used to fill in values for the unknown ages of 177 passengers. This method involves multiple rounds of imputation. The first round inputs a placeholder value (e.g. mean value of known ages) and then uses the other variables as factors in a Bayesian regression model. This model can be used to provide updated estimates of the age before repeating the process again. Since there is no missing data for any of the other variables, this method should work quite well. 

In [24]:
# using fancy impute for MICE
mice_imputer = IterativeImputer()
train_arr = mice_imputer.fit_transform(train)

# convert numpy array to pd
full_train = pd.DataFrame(train_arr, columns = ['Survived', 'Age', 'SibSp', 'Parch', 'Fare', '1', '2', '3', 'female', 'male', 'C', 'Q', 'S'])
full_train.head(10)

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,1,2,3,female,male,C,Q,S
0,0.0,22.0,1.0,0.0,7.25,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,1.0,38.0,1.0,0.0,71.2833,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,1.0,26.0,0.0,0.0,7.925,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,1.0,35.0,1.0,0.0,53.1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,35.0,0.0,0.0,8.05,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
5,0.0,31.324242,0.0,0.0,8.4583,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
6,0.0,54.0,0.0,0.0,51.8625,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
7,0.0,2.0,3.0,1.0,21.075,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
8,1.0,27.0,0.0,2.0,11.1333,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
9,1.0,14.0,1.0,0.0,30.0708,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


## 3. Preprocessing Test Data

We now need to prepare the test dataset for the models. Many of the previous steps will be repeated however, these two sets of data are not entirely identical

In [59]:
test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

### 2.1. Removing Extraneous Features

In [None]:
test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)
test.head(10)

## 3. Modelling

Since we are predicting whether a passenger survived or not (1 or 0), the response variable is binary. We can therefore apply a range of classification or regression techniques and submit the results for the model which has the highest accuracy in the cross-validation process. But before we can do that, we need to split our training data into a further train and test dataset. I will split the data into 75% training and 25% testing. 

In [34]:
training, testing = train_test_split(full_train, test_size=0.25)
x_training = training.drop('Survived', axis = 1)
y_training = training.Survived

x_testing = testing.drop('Survived', axis = 1)
y_testing = testing.Survived


...Then finally we will want to store each model's accuracy in a dictionary

In [46]:
comparison_dict ={'model':[],
                  'params': [],
                  'k-fold accuracy': [],
                  'validate accuracy': []}

### 3.1. XGBoost

The first model I attempt will be an extreme gradient boosting model. This is because I think it is the most likely to be the best performing due to it being a very fast, ensemble model that uses regularisation  

In [51]:
params = {
    "loss":["deviance"],
    "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
    "min_samples_split": np.linspace(0.1, 0.5, 12),
    "min_samples_leaf": np.linspace(0.1, 0.5, 12),
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse"],
    "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
    "n_estimators":[10]
    }

# fitting the model
model = GridSearchCV(GradientBoostingClassifier(), params, cv = 10, n_jobs = -1)
model.fit(x_training, y_training)
print("The cross-validation model score was {}".format(model.score(x_training, y_training)))
print(model.best_score_)
print(model.best_params_)

# validation
y_pred = model.predict(x_testing)
validate_accuracy = accuracy_score(y_testing, y_pred)
print("The accuracy of the validation data is {}".format(validate_accuracy))

0.8063063063063063
{'criterion': 'friedman_mse', 'learning_rate': 0.2, 'loss': 'deviance', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 10, 'subsample': 1.0}
