## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [2]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Prepare the data set

In [3]:
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

# display data
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [4]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [4]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [5]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [6]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [7]:
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

# display data
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22 C26,S,Master
2,1,0,female,2.0,1,2,151.55,C22 C26,S,Miss
3,1,0,male,30.0,1,2,151.55,C22 C26,S,Mr
4,1,0,female,25.0,1,2,151.55,C22 C26,S,Mrs


In [8]:
# save the data set

data.to_csv('titanic.csv', index=False)

## Data Exploration

### Find numerical and categorical variables

In [9]:
target = 'survived'

In [12]:
vars_cat = [var for var in data.columns if data[var].dtype == 'O']
vars_num = data.select_dtypes(include=['number']).columns.tolist()

print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))

Number of numerical variables: 6
Number of categorical variables: 4


### Find missing values in variables

In [23]:
# first in numerical variables
missing_nums = data[data.isnull().any(axis=1)][data.dtypes[data.dtypes != 'object'].index]
missing_nums



Unnamed: 0,pclass,survived,age,sibsp,parch,fare
9,1,0,71.0,0,0,49.5042
13,1,1,26.0,0,0,78.8500
15,1,0,,0,0,25.9250
23,1,1,42.0,0,0,227.5250
25,1,0,25.0,0,0,26.0000
...,...,...,...,...,...,...
1304,3,0,14.5,1,0,14.4542
1305,3,0,,1,0,14.4542
1306,3,0,26.5,0,0,7.2250
1307,3,0,27.0,0,0,7.2250


In [44]:
data['fare']

0       0
1       0
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: fare, Length: 1309, dtype: int32

In [24]:
# now in categorical variables
missing_cats = data[data.isnull().any(axis=1)][data.dtypes[data.dtypes == 'object'].index]
missing_cats


Unnamed: 0,sex,cabin,embarked,title
9,male,,C,Mr
13,female,,S,Miss
15,male,,S,Mr
23,female,,C,Miss
25,male,,C,Mr
...,...,...,...,...
1304,female,,C,Miss
1305,female,,C,Miss
1306,male,,C,Mr
1307,male,,C,Mr


### Determine cardinality of categorical variables


In [27]:
#cat_vars = df.select_dtypes(include=['object']).columns
cardinality = {var: data[var].nunique() for var in vars_cat}
cardinality

{'sex': 2, 'cabin': 186, 'embarked': 3, 'title': 5}

### Determine the distribution of numerical variables

In [29]:
num_stats = data[vars_num].describe()
num_stats


Unnamed: 0,pclass,survived,age,sibsp,parch,fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668
min,1.0,0.0,0.1667,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958
50%,3.0,0.0,28.0,0.0,0.0,14.4542
75%,3.0,1.0,39.0,1.0,0.0,31.275
max,3.0,1.0,80.0,8.0,9.0,512.3292


## Separate data into train and test

Use the code below for reproducibility. Don't change it.

In [52]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1047, 16), (262, 16))

## Feature Engineering

### Extract only the letter (and drop the number) from the variable Cabin

In [31]:
data['cabin'] = data['cabin'].str.extract('([A-Za-z])', expand=False)
data['cabin']


0         B
1         C
2         C
3         C
4         C
       ... 
1304    NaN
1305    NaN
1306    NaN
1307    NaN
1308    NaN
Name: cabin, Length: 1309, dtype: object

### Fill in Missing data in numerical variables:

- Add a binary missing indicator
- Fill NA in original variable with the median

In [39]:
# add a binary missing indicator
data['age_missing'] = np.where(data['age'].isnull(), 1, 0)
data['fare_missing'] = np.where(data['fare'].isnull(), 1, 0)

# fill NA with the median
median_age = data['age'].median()
data['age'].fillna(median_age, inplace=True)

median_fare = data['fare'].median()
data['fare'].fillna(median_fare, inplace=True)

# print the modified variables
print(data[['age', 'age_missing']].head(10))
print(data[['fare', 'fare_missing']].head(10))

       age  age_missing
0  29.0000            0
1   0.9167            0
2   2.0000            0
3  30.0000            0
4  25.0000            0
5  48.0000            0
6  63.0000            0
7  39.0000            0
8  53.0000            0
9  71.0000            0
   fare  fare_missing
0     0             0
1     0             0
2     0             0
3     0             0
4     0             0
5     0             0
6     0             0
7     0             0
8     0             0
9     0             0


### Replace Missing data in categorical variables with the string **Missing**

In [40]:
data['sex'].fillna('Missing', inplace=True)
data['sex']


0       female
1         male
2       female
3         male
4       female
         ...  
1304    female
1305    female
1306      male
1307      male
1308      male
Name: sex, Length: 1309, dtype: object

In [35]:
data['cabin'].fillna('Missing', inplace=True)

In [37]:
data['embarked'].fillna('Missing', inplace=True)

### Remove rare labels in categorical variables

- remove labels present in less than 5 % of the passengers

In [45]:
for var in vars_cat:
    freq = data[var].value_counts(normalize=True)
    freq_labels = freq[freq < 0.05].index
    data[var] = data[var].replace(freq_labels, 'Rare')

# print the modified variables
print(data[vars_cat].head(10))

      sex    cabin embarked title
0  female     Rare        S  Miss
1    male        C        S  Rare
2  female        C        S  Miss
3    male        C        S    Mr
4  female        C        S   Mrs
5    male     Rare        S    Mr
6  female     Rare        S  Miss
7    male     Rare        S    Mr
8  female        C        S   Mrs
9    male  Missing        C    Mr


### Perform one hot encoding of categorical variables into k-1 binary variables

- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
- Remember to drop the original categorical variable (the one with the strings) after the encoding

In [46]:
for var in vars_cat:
    encoded_vars = pd.get_dummies(data[var], prefix=var, drop_first=True)
    data = pd.concat([data, encoded_vars], axis=1)
    data.drop(var, axis=1, inplace=True)

# print the modified dataset
print(data.head(10))

   pclass  survived      age  sibsp  parch  fare  age_missing  fare_missing  \
0       1         1  29.0000      0      0     0            0             0   
1       1         1   0.9167      1      2     0            0             0   
2       1         0   2.0000      1      2     0            0             0   
3       1         0  30.0000      1      2     0            0             0   
4       1         0  25.0000      1      2     0            0             0   
5       1         1  48.0000      0      0     0            0             0   
6       1         1  63.0000      1      0     0            0             0   
7       1         0  39.0000      0      0     0            0             0   
8       1         1  53.0000      2      0     0            0             0   
9       1         0  71.0000      0      0     0            0             0   

   sex_male  cabin_Missing  cabin_Rare  embarked_Q  embarked_Rare  embarked_S  \
0         0              0           1           

### Scale the variables

- Use the standard scaler from Scikit-learn

In [50]:
from sklearn.preprocessing import StandardScaler

# create an instance of the StandardScaler
scaler = StandardScaler()

# select the numerical variables to scale
x_ = ['age', 'fare'] # only numerical varibales can be scaled

# fit the scaler to the data and transform the selected variables
data[x_] = scaler.fit_transform(data[x_])
data[x_]

Unnamed: 0,age,fare
0,-0.039005,-0.02765
1,-2.215952,-0.02765
2,-2.131977,-0.02765
3,0.038512,-0.02765
4,-0.349075,-0.02765
...,...,...
1304,-1.163009,-0.02765
1305,-0.116523,-0.02765
1306,-0.232799,-0.02765
1307,-0.194040,-0.02765


## Train the Logistic Regression model

- Set the regularization parameter to 0.0005
- Set the seed to 0

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create a logistic regression object with regularization parameter 0.0005
logreg = LogisticRegression(C=0.0005, random_state=0)

# Fit the model to the training data
logreg.fit(X_train, y_train)

# Predict the target variable for the testing data
y_pred = logreg.predict(X_test)

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [54]:
from sklearn.metrics import roc_auc_score, accuracy_score

# Predict the target variable and probabilities for the testing data
y_pred = logreg.predict(X_test)
y_prob = logreg.predict_proba(X_test)[:, 1]

# Compute the roc-auc score and accuracy
roc_auc = roc_auc_score(y_test, y_prob)
accuracy = accuracy_score(y_test, y_pred)

# Print the roc-auc score and accuracy
print("ROC-AUC score: {:.2f}".format(roc_auc))
print("Accuracy score: {:.2f}".format(accuracy))

ROC-AUC score: 0.84
Accuracy score: 0.62


That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**