Predicting Survival on the Titanic
History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

Assignment:
Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.


In [81]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib


In [3]:
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)
#Prepare the data set
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

In [4]:
# display data
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [5]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)
# retain only the first cabin if more than
# 1 are available per passenger

In [6]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)
# extracts the title (Mr, Ms, etc) from the name variable


In [8]:

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'

In [9]:
   
data['title'] = data['name'].apply(get_title)
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

In [10]:

# display data
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


In [11]:
# save the data set

data.to_csv('titanic.csv', index=False)


Data Exploration
Find numerical and categorical variables

In [132]:
df=pd.read_csv("titanic.csv")

In [133]:
target = df['survived']
df.drop('survived',axis=1,inplace=True)

vars_cat_cardinality={}
vars_num_cardinality={}
vars_num=[]
vars_cat=[]
integerFeatures=[] 
binaryFeatures=[]

features=list(df)

for fea in features:
    uniqueValues=df[fea].nunique()
    if np.dtype(df[fea])=="float64":
        vars_num.append(fea)
        vars_num_cardinality.update({fea:uniqueValues})        
    else:
        vars_cat.append(fea)
        vars_cat_cardinality.update({fea:uniqueValues})
    if uniqueValues==2:
        binaryFeatures.append(fea)
    else:
        integerFeatures.append(fea)
        

print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))      

Number of numerical variables: 2
Number of categorical variables: 7


Find missing values in variables



In [134]:
# first in numerical variables
for var in vars_num:
    print(var)
    print(df[var].isnull().mean()*100)
    

age
20.091673032849503
fare
0.07639419404125286


In [135]:
# now in categorical variables
for var in vars_cat:
    print(var)
    print(df[var].isnull().mean()*100)
    

pclass
0.0
sex
0.0
sibsp
0.0
parch
0.0
cabin
77.46371275783041
embarked
0.15278838808250572
title
0.0


# Determine cardinality of categorical variables


In [136]:
print('Cardinality of numerical variables: {}'.format(vars_num_cardinality))
print('Cardinality of categorical variables: {}'.format(vars_cat_cardinality))

Cardinality of numerical variables: {'age': 98, 'fare': 281}
Cardinality of categorical variables: {'pclass': 3, 'sex': 2, 'sibsp': 7, 'parch': 8, 'cabin': 181, 'embarked': 3, 'title': 5}


Determine the distribution of numerical variables
 
Separate data into train and test
Use the code below for reproducibility. Don't change it.

In [137]:
X_train, X_test, y_train, y_test = train_test_split(
    df,  # predictors
    target,  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility



 

In [138]:
X_train.shape, X_test.shape

((1047, 9), (262, 9))

# Feature Engineering
Extract only the letter (and drop the number) from the variable Cabin

In [139]:
#first, imputing the Nan values with 0
df['cabin'].fillna('0', inplace=True)

In [140]:
df['CabinType'] = df['cabin'].str.slice(0,1)
df.drop('cabin',axis=1)
vars_cat.append("CabinType")

# Fill in Missing data in numerical variables:
Add a binary missing indicator
Fill NA in original variable with the median

In [141]:
for num in vars_num:
    df[num].fillna(df[num].median(), inplace=True)

# Fill in Missing data in categorical variables:

 
Replace Missing data in categorical variables with the string Missing

In [142]:
for cat in vars_cat:
    df[cat].fillna("Missing", inplace=True)
    
    

Remove rare labels in categorical variables
remove labels present in less than 5 % of the passengers

In [143]:
totalRows=len(df)
for fea in vars_cat:
    valuePercent=df[fea].value_counts()*100/totalRows
    df.loc[df[fea].isin(valuePercent[valuePercent < 5].index), fea] = 'RareValues'
    

In [144]:
for fea in features:
    print(df[fea].value_counts()*100/totalRows)

3    54.163484
1    24.675325
2    21.161192
Name: pclass, dtype: float64
male      64.400306
female    35.599694
Name: sex, dtype: float64
28.0000    22.536287
24.0000     3.590527
22.0000     3.284950
21.0000     3.132162
30.0000     3.055768
             ...    
0.3333      0.076394
22.5000     0.076394
70.5000     0.076394
0.6667      0.076394
26.5000     0.076394
Name: age, Length: 98, dtype: float64
0             68.067227
1             24.369748
RareValues     7.563025
Name: sibsp, dtype: float64
0             76.546982
1             12.987013
2              8.632544
RareValues     1.833461
Name: parch, dtype: float64
8.0500     4.583652
13.0000    4.507257
7.7500     4.201681
26.0000    3.819710
7.8958     3.743316
             ...   
15.0500    0.076394
9.6875     0.076394
15.5792    0.076394
12.0000    0.076394
7.8750     0.076394
Name: fare, Length: 281, dtype: float64
0             77.463713
RareValues    22.536287
Name: cabin, dtype: float64
S             69.824293
C      


Perform one hot encoding of categorical variables into k-1 binary variables
k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
Remember to drop the original categorical variable (the one with the strings) after the encoding
 

In [145]:
dfE = pd.get_dummies(df, columns=vars_cat)


  uniques = Index(uniques)


# Scale the variables
Use the standard scaler from Scikit-learn

In [146]:
# define which columns to transform and which to leave unchanged
ct = ColumnTransformer([('scale', StandardScaler(), vars_num)], remainder='passthrough')

# apply the transformation to the DataFrame
dfS = ct.fit_transform(dfE)

# convert the transformed array back to a DataFrame
dfS = pd.DataFrame(dfS, columns=list(dfE))



In [147]:
list(dfS)

['age',
 'fare',
 'pclass_1',
 'pclass_2',
 'pclass_3',
 'sex_female',
 'sex_male',
 'sibsp_0',
 'sibsp_1',
 'sibsp_RareValues',
 'parch_0',
 'parch_1',
 'parch_2',
 'parch_RareValues',
 'cabin_0',
 'cabin_RareValues',
 'embarked_C',
 'embarked_Q',
 'embarked_RareValues',
 'embarked_S',
 'title_Miss',
 'title_Mr',
 'title_Mrs',
 'title_RareValues',
 'CabinType_0',
 'CabinType_C',
 'CabinType_RareValues']

Train the Logistic Regression model
Set the regularization parameter to 0.0005
Set the seed to 0

In [148]:
X_train, X_test, y_train, y_test = train_test_split(
    dfS,  # predictors
    target,  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility



 

In [149]:
dfS

Unnamed: 0,age,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_RareValues,...,embarked_Q,embarked_RareValues,embarked_S,title_Miss,title_Mr,title_Mrs,title_RareValues,CabinType_0,CabinType_C,CabinType_RareValues
0,-0.039005,3.442584,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,-2.215952,2.286639,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,-2.131977,2.286639,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.038512,2.286639,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,-0.349075,2.286639,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,-1.163009,-0.364003,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1305,-0.116523,-0.364003,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1306,-0.232799,-0.503774,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1307,-0.194040,-0.503774,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [150]:
# Set the seed
np.random.seed(0)

# Initialize the logistic regression model
logreg = LogisticRegression(penalty='l2', C=1/0.0005, random_state=0)

# Fit the model to your data
logreg.fit(X_train, y_train)


LogisticRegression(C=2000.0, random_state=0)

# Make predictions and evaluate model performance

In [151]:
# Make predictions on new data
y_pred = logreg.predict(X_test)


Determine:

## roc-auc



In [156]:
# Compute the AUC-ROC score
y_prob = logreg.predict_proba(X_test)[:,1]
auc_roc = roc_auc_score(y_test, y_prob)
print("The auc_roc score of the model is :",auc_roc)

The auc_roc score of the model is : 0.8584259259259259


## accuracy
Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.

In [157]:
# Compute the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("The accuracy of the model is :",accuracy)

The accuracy of the model is : 0.8053435114503816



 
That's it! Well done

Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!