## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [1]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Prepare the data set

In [2]:
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

# display data
data.head()

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [None]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [None]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [None]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [None]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [None]:
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

# display data
data.head()

In [None]:
# save the data set

data.to_csv('titanic.csv', index=False)

## Data Exploration

### Find numerical and categorical variables

In [None]:
target = 'survived'

In [None]:
vars_num = [var for var in data.columns if data[var].dtypes != 'O'] # fill your code here

vars_cat = [var for var in data.columns if data[var].dtypes == 'O'] # fill your code here

print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))

### Find missing values in variables

In [None]:
# first in numerical variables

vars_num_with_na = [var for var in vars_num if data[var].isnull().sum() > 0]
data[vars_num_with_na].isnull().mean()

In [None]:
# now in categorical variables

vars_cat_with_na = [var for var in vars_cat if data[var].isnull().sum() > 0]
data[vars_cat_with_na].isnull().mean()

### Determine cardinality of categorical variables

In [None]:
data[vars_num].nunique()

### Determine the distribution of numerical variables

In [None]:
data[vars_cat].nunique()

## Separate data into train and test

Use the code below for reproducibility. Don't change it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

## Feature Engineering

### Extract only the letter (and drop the number) from the variable Cabin

In [None]:
def get_letter_cabin(df):
    df = df.copy()
    
    df['cabin'] = df['cabin'].astype(str).str[0] # Get first letter in every row
    df['cabin'] = df['cabin'].replace('n', np.nan) # Replace n (first letter in NaN) to NaN again
    return df

In [None]:
X_train = get_letter_cabin(X_train)
X_test = get_letter_cabin(X_test)

In [None]:
X_test.head()

### Fill in Missing data in numerical variables:

- Add a binary missing indicator

In [None]:
def add_binary_missing_indicator(df, var_with_na):
    df = df.copy()
    
    # add binary missing indicator (in train and test)
    for var in var_with_na:
        df[var+'_na'] = np.where(df[var].isnull(), 1, 0)
    return df

In [None]:
X_train = add_binary_missing_indicator(X_train, vars_num_with_na)
X_test = add_binary_missing_indicator(X_test, vars_num_with_na)

- Fill NA in original variable with the median

In [None]:
def fill_na_with_median(df, var_with_na):
    df = df.copy()
    
    # impute missing value with its median
    for var in vars_num_with_na:
        median = df[var].median()
        df[var] = df[var].fillna(median)
    return df

In [None]:
X_train = fill_na_with_median(X_train, vars_num_with_na)
X_test = fill_na_with_median(X_test, vars_num_with_na)

### Replace Missing data in categorical variables with the string **Missing**

In [None]:
def fill_na_with_string(df, var_with_na, string):
    df = df.copy()
    
    # impute missing value with string
    for var in var_with_na:
        df[var] = df[var].fillna(string)
    return df

In [None]:
X_train = fill_na_with_string(X_train, vars_cat_with_na, 'Missing')
X_test = fill_na_with_string(X_test, vars_cat_with_na, 'Missing')

### Remove rare labels in categorical variables

- remove labels present in less than 5 % of the passengers

In [None]:
def find_rare_labels(df, vars_, target, rare_perc):
    # function finds the labels that are shared by more than
    # a certain % of the houses in the dataset
    df = df.copy()
    
    rare_label_dict = {}
    for var in vars_:
        tmp = df.groupby(var)[target].count() / len(df)
        tmp = tmp[tmp < rare_perc].reset_index() # get percentage
        #display(tmp)
        rare_label = tmp[var].unique()
        if len(rare_label) > 0:
            rare_label_dict[var] = rare_label.tolist()
    
    print(rare_label_dict)
    return rare_label_dict

In [None]:
rare_label_train_dict = find_rare_labels(X_train, vars_cat, 'pclass', 0.05)
rare_label_test_dict = find_rare_labels(X_test, vars_cat, 'pclass', 0.05)

In [None]:
X_train['cabin'].unique()

In [None]:
for var in rare_label_train_dict:
    X_train[var] = X_train[var].replace(rare_label_train_dict[var], 'Rare')
    X_test[var] = X_test[var].replace(rare_label_train_dict[var], 'Rare')

In [None]:
X_train['cabin'].unique()

In [None]:
X_train.head()

### Perform one hot encoding of categorical variables into k-1 binary variables

- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
- Remember to drop the original categorical variable (the one with the strings) after the encoding

In [None]:
def one_hot_encoding(df, vars_):
    df = df.copy()
    
    _one_hot = pd.get_dummies(df[vars_cat], prefix=vars_cat)
    df = df.drop(vars_, axis = 1)
    df = df.join(_one_hot)
    return df

In [None]:
X_train = one_hot_encoding(X_train, vars_cat)
X_test = one_hot_encoding(X_test, vars_cat)

### Scale the variables

- Use the standard scaler from Scikit-learn

In [None]:
def add_diff_column(df_train, df_test, value=0):
    '''This function will add new column on train and test df
    if there are different between two column with given value and same index.
    If train df has example_column in index 3,
    this function will add example_column to test df in index 3 with value=0'''
    
    df_train = df_train.copy()
    df_test = df_test.copy()
    
    # get column in each df
    vars_train = np.array([var for var in df_train.columns])
    vars_test = np.array([var for var in df_test.columns])
    
    add_col_train = np.setdiff1d(vars_test, vars_train) # column that not in vars_train
    add_col_test = np.setdiff1d(vars_train, vars_test) # column that not in vars_test
    
    print('Column that not in train: ', add_col_train)
    print('Column that not in test: ', add_col_test)


    if len(add_col_train) != 0:
        for col in add_col_train:
            idx = vars_test.tolist().index(col)
            df_train.insert(idx, col, value)
            
    if len(add_col_test) != 0:
        for col in add_col_test:
            idx = vars_train.tolist().index(col)
            df_test.insert(idx, col, value)
    
    return df_train, df_test

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train, X_test = add_diff_column(X_train, X_test, 0)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train['embarked_Rare'].unique()

In [None]:
def fit_transform_std_scaler(df_train, df_test):
    df_train = df_train.copy()
    df_test = df_test.copy()
    
    # get all variables
    vars_train = [var for var in X_train.columns]
    
    # fit scaler on train data
    scaler = StandardScaler()
    scaler.fit(df_train[vars_train])
    
    df_train[vars_train] = scaler.transform(df_train[vars_train])
    df_test[vars_train] = scaler.transform(df_test[vars_train])
    return df_train, df_test  

In [None]:
X_train, X_test = fit_transform_std_scaler(X_train, X_test)

## Train the Logistic Regression model

- Set the regularization parameter to 0.0005
- Set the seed to 0

In [None]:
log_model = LogisticRegression(C=0.0005, random_state=0)

In [None]:
log_model.fit(X_train, y_train)

## Make predictions and evaluate model performance

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

- accuracy

In [None]:
y_hat_train = log_model.predict(X_train)
y_hat_test = log_model.predict(X_test)

In [None]:
print('Accuracy of train data: ', accuracy_score(y_train, y_hat_train))
print('Accuracy of test data: ',accuracy_score(y_test, y_hat_test))

- roc-auc

In [None]:
# extract the probability of the positive class from the predicted probability
y_hat_prob_train = log_model.predict_proba(X_train)[:, 1]
y_hat_prob_test = log_model.predict_proba(X_test)[:, 1]

In [None]:
print('ROC AUC score of train data: ', roc_auc_score(y_train, y_hat_prob_train[:, 1]))
print('ROC AUC score of test data: ', roc_auc_score(y_test, y_hat_prob_test[:, 1]))

That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**