## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [61]:
#!pip3 install feature_engine

In [62]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# ========== NEW IMPORTS ========
# Respect to notebook 02-Predicting-Survival-Titanic-Solution

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin


# for the preprocessors
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

# for imputation
import sklearn
from sklearn import impute
from sklearn.impute import SimpleImputer

# for encoding categorical variables
from sklearn.preprocessing import OrdinalEncoder

import feature_engine
from feature_engine import encoding

from feature_engine.imputation import AddMissingIndicator




## Prepare the data set

In [63]:
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

# display data
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [64]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [65]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [66]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [67]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [68]:
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

# display data
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


In [69]:
# # save the data set

# data.to_csv('titanic.csv', index=False)

# Begin Assignment

## Configuration

In [70]:
# list of variables to be used in the pipeline's transformers

NUMERICAL_VARIABLES = ['age', 'fare']

CATEGORICAL_VARIABLES = ['pclass', 'sibsp', 'parch', 'embarked', 'title']

CABIN = 'cabin'

## Separate data into train and test

In [71]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1047, 9), (262, 9))

In [72]:
X_train.dtypes

pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
fare        float64
cabin        object
embarked     object
title        object
dtype: object

In [73]:
for var in CATEGORICAL_VARIABLES:
    X_train[var] = X_train[var].astype(str)
    X_test[var] = X_test[var].astype(str)

## Preprocessors

### Class to extract the letter from the variable Cabin

In [74]:
X_train['cabin']

1118     NaN
44       E40
1072     NaN
1130     NaN
574      NaN
1217       F
500      NaN
958      NaN
269      A19
322      C32
703      NaN
668      NaN
1221     NaN
333        D
465        D
195      B79
955      NaN
1265     NaN
70       NaN
682      NaN
10       C62
828      NaN
1008     NaN
893      NaN
1259     NaN
1148     NaN
479      NaN
64        E8
862      NaN
1164     NaN
536      NaN
987      NaN
440      NaN
1073     NaN
1247     NaN
1102     NaN
799      NaN
1252     NaN
530      NaN
726      NaN
1241     NaN
722      NaN
1257     NaN
688      NaN
308      C32
1298     NaN
1013     NaN
990      NaN
131      NaN
15       NaN
1115     NaN
432      NaN
789      NaN
575      NaN
1092     NaN
187      D28
191      NaN
103      C45
970      NaN
60       C46
692      NaN
364      NaN
494      NaN
708      NaN
646      NaN
1278     NaN
747      NaN
565      NaN
82       B22
567      NaN
863      NaN
818      NaN
399      NaN
923      NaN
252      B57
742      NaN
311      NaN

In [78]:
X_train['cabin'].str[0]

1118    NaN
44        E
1072    NaN
1130    NaN
574     NaN
1217      F
500     NaN
958     NaN
269       A
322       C
703     NaN
668     NaN
1221    NaN
333       D
465       D
195       B
955     NaN
1265    NaN
70      NaN
682     NaN
10        C
828     NaN
1008    NaN
893     NaN
1259    NaN
1148    NaN
479     NaN
64        E
862     NaN
1164    NaN
536     NaN
987     NaN
440     NaN
1073    NaN
1247    NaN
1102    NaN
799     NaN
1252    NaN
530     NaN
726     NaN
1241    NaN
722     NaN
1257    NaN
688     NaN
308       C
1298    NaN
1013    NaN
990     NaN
131     NaN
15      NaN
1115    NaN
432     NaN
789     NaN
575     NaN
1092    NaN
187       D
191     NaN
103       C
970     NaN
60        C
692     NaN
364     NaN
494     NaN
708     NaN
646     NaN
1278    NaN
747     NaN
565     NaN
82        B
567     NaN
863     NaN
818     NaN
399     NaN
923     NaN
252       B
742     NaN
311     NaN
811     NaN
113       C
672     NaN
249       B
529     NaN
263       E
1158

In [79]:
class ExtractLetterTransformer(BaseEstimator, TransformerMixin):
    # Extract fist letter of variable

    def __init__(self, variable):
        self.variable = variable


    def fit(self, X, y=None):
        return self

        

    def transform(self, X):
        X = X.copy()
        X[self.variable] = X[self.variable].str[0]
        X[self.variable] = np.where(X[self.variable].isnull(), 'missing', X[self.variable] )
        return X



In [80]:
class AddMissingIndicator(BaseEstimator, TransformerMixin):
    # Extract fist letter of variable

    def __init__(self, variables, missing_indicator=None):
        self.variables = variables
        if missing_indicator is not None:
            self.missing_indicator=missing_indicator
        else:
            self.missing_indicator=None


    def fit(self, X, y=None):
        return self

        

    def transform(self, X):
        X = X.copy()
        for variable in self.variables:
            if self.missing_indicator is not None:
                X[variable + '_missing_indicator'] = np.where(X[variable] == self.missing_indicator, 1, 0)
            else:
                X[variable + '_missing_indicator'] = np.where(X[variable].isnull(), 1, 0)
        #print("At the end of Add Missing Indicator : ", X.columns)
        #print("At the end of Add Missing Indicator : ", X.shape)
        return X
    
    

In [81]:
class RareValueIndicator(BaseEstimator, TransformerMixin):
    # Extract fist letter of variable

    def __init__(self, variables, rare_indicator='Rare'):
        self.variables = variables
        self.rare_indicator=rare_indicator


    def fit(self, X, y=None):
        return self

        

    def transform(self, X):
        X = X.copy()
        for variable in self.variables:
            temp_df = pd.DataFrame(X_train[variable].value_counts(normalize=True))
            rare_values_list = temp_df[temp_df[variable] <= 0.05].index.tolist()
            if len(rare_values_list) > 0:
                X[variable] = X[variable]
                X[variable] = np.where(X[variable].isin(rare_values_list), self.rare_indicator, X[variable])
        #print("At the end of rare Value Encoder : ", X.columns)
        #print("At the end of rare Value Encoder : ", X.shape)
        return X
    
    

In [82]:
# class MissingMedianTransformer(BaseEstimator, TransformerMixin):
#     # Extract fist letter of variable

#     def __init__(self, variable):
#         if not isinstance(variables, list):
#             raise ValueError("variables should be a list")
#         else:
#             self.variables = variables[0]


#     def fit(self):
#         X = X.copy()
#         impute_dict_ = X[variables].median().to_dict()

        

#     def transform():
#         X = X.copy()
#         for var in self.variables:
#             X[var]
#         data[self.variable] = data[self.variable].str[0]

In [83]:
class DropObjVarsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, variables):
        if not isinstance(variables, list):
            raise ValueError("variables should be a list")
        self.variables = variables
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X.drop(columns = self.variables, inplace=True)
        #print("At the end of Drop Variable : ", X.columns)
        #print("At the end of Drop Variable : ", X.shape)
        return X

In [84]:
class PandasCatMissingImputer(BaseEstimator, TransformerMixin):
    def __init__(self, variables, strategy='constant', fill_value = 'missing'):
        if not isinstance(variables, list):
            raise ValueError("variables should be a list")
        self.variables = variables
        self.strategy = strategy
        if fill_value is not None:
            self.fill_value = fill_value
    def fit(self, X, y=None):
        si = SimpleImputer(strategy = self.strategy, fill_value=self.fill_value)
        self.si_fitted = si.fit(X[self.variables])
        return self
    
    def transform(self, X):
        X = X.copy()
        X[self.variables] = self.si_fitted.transform(X[self.variables])
        #print("At the end of Cat Missing Imputer: ", X.columns)
        #print("At the end of Cat Missing Imputer: ", X.shape)
        return X

In [85]:
class PandasNumMedianMissingImputer(BaseEstimator, TransformerMixin):
    def __init__(self, variables, strategy='median'):
        if not isinstance(variables, list):
            raise ValueError("variables should be a list")
        self.variables = variables
        self.strategy = strategy
    def fit(self, X, y=None):
        si = SimpleImputer(strategy = self.strategy)
        self.si_fitted = si.fit(X[self.variables])
        return self
    
    def transform(self, X):
        X = X.copy()
        X[self.variables] = self.si_fitted.transform(X[self.variables])
        #print("At the end of Num Median Missing Imputer: ", X.columns)
        #print("At the end of Num Median Missing Imputer: ", X.shape)
        return X

In [86]:
class PandasOHEImputer(BaseEstimator, TransformerMixin):
    def __init__(self, variables):
        if not isinstance(variables, list):
            raise ValueError("variables should be a list")
        self.variables = variables
    def fit(self, X, y=None):
        self.ohe = sklearn.preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')
        self.ohe_fitted = self.ohe.fit(X[self.variables])
        return self
    
    def transform(self, X):
        X = X.copy()
        X_ohe = pd.DataFrame(self.ohe_fitted.transform(X[self.variables]))
        #print(X_ohe.shape)
        #print('feature_names: ',  self.ohe.get_feature_names(self.variables))
        #print('type: ',  type(self.ohe.get_feature_names(self.variables)))
        X_ohe.columns = list(self.ohe.get_feature_names(self.variables))
        X=X.reset_index(drop=True)
        X_ohe=X_ohe.reset_index(drop=True)
        X = pd.concat([X, X_ohe], axis=1)
        ## DROP ALL OHE COLUMNS
        X.drop(columns=self.variables, inplace=True)
        #print("done")
        #print("At the end of OHE : ", X.columns)
        #print("At the end of OHE : ", X.shape)
        return X

In [94]:
class PandasStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        self.sc = sklearn.preprocessing.StandardScaler()
        #print('Standard Scaler columns :', X.columns)
        self.sc_fitted = self.sc.fit(X)
        return self
    
    def transform(self, X):
        X = X.copy()
        X_sc = pd.DataFrame(self.sc_fitted.transform(X))
        X_sc.columns = X.columns
        #print("At the end of Standard Scaler : ", X_sc.columns)
        #print(X_sc.describe())
        #print(X_sc.info())
        #print(X_sc.isnull().sum())
        return X

## Pipeline

- Impute categorical variables with string missing
- Add a binary missing indicator to numerical variables with missing data
- Fill NA in original numerical variable with the median
- Extract first letter from cabin
- Group rare Categories
- Perform One hot encoding
- Scale features with standard scaler
- Fit a Logistic regression

In [95]:
cat_missing_imputer = PandasCatMissingImputer(CATEGORICAL_VARIABLES) #simple imputer returns numpy which affects the subsequent trannsformations in pipeline
add_missing_indicator = AddMissingIndicator(NUMERICAL_VARIABLES)
num_median_imputer = PandasNumMedianMissingImputer(NUMERICAL_VARIABLES) #simple imputer returns numpy which affects the subsequent trannsformations in pipeline
cabin_letter_extractor = ExtractLetterTransformer('cabin')
rare_value_encoder = RareValueIndicator(CATEGORICAL_VARIABLES)
ohe_encoder = PandasOHEImputer(CATEGORICAL_VARIABLES)
drop_vars_preprocessor = DropObjVarsTransformer(variables = ['sex', 'cabin'])
standard_scaler = PandasStandardScaler()


In [96]:
X_train.shape

(1047, 9)

<font size=5> https://www.kaggle.com/alexisbcook/pipelines </font>

In [97]:
preprocessor = Pipeline(steps=[
    ('categorical_missing_imputing_transformer', cat_missing_imputer),
    ('numerical_missing_indicating_transformer', add_missing_indicator),
    ('numerical_median_missing_imputer_transformer', num_median_imputer),
    ('letter_extracting_transformer', cabin_letter_extractor),
    ("rare_label_encoding_transformer", rare_value_encoder),
    ("ohe_encoding_transformer", ohe_encoder),
    ("variables_dropping_transformer", drop_vars_preprocessor),
    ("variable_scaling_transformer", standard_scaler)
    
    
])

In [98]:
titanic_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(C=0.0005, random_state=0))
])

In [99]:
titanic_pipe.steps

[('preprocessor',
  Pipeline(steps=[('categorical_missing_imputing_transformer',
                   PandasCatMissingImputer(variables=['pclass', 'sibsp', 'parch',
                                                      'embarked', 'title'])),
                  ('numerical_missing_indicating_transformer',
                   AddMissingIndicator(variables=['age', 'fare'])),
                  ('numerical_median_missing_imputer_transformer',
                   PandasNumMedianMissingImputer(variables=['age', 'fare'])),
                  ('letter_ext...
                  ('rare_label_encoding_transformer',
                   RareValueIndicator(variables=['pclass', 'sibsp', 'parch',
                                                 'embarked', 'title'])),
                  ('ohe_encoding_transformer',
                   PandasOHEImputer(variables=['pclass', 'sibsp', 'parch',
                                               'embarked', 'title'])),
                  ('variables_dropping_transformer',

In [100]:
X_train.shape

(1047, 9)

In [101]:
titanic_pipe.fit(X=X_train, y=y_train)

Pipeline(steps=[('preprocessor',
                 Pipeline(steps=[('categorical_missing_imputing_transformer',
                                  PandasCatMissingImputer(variables=['pclass',
                                                                     'sibsp',
                                                                     'parch',
                                                                     'embarked',
                                                                     'title'])),
                                 ('numerical_missing_indicating_transformer',
                                  AddMissingIndicator(variables=['age',
                                                                 'fare'])),
                                 ('numerical_median_missing_imputer_transformer',
                                  PandasNumMedianMissingImputer(variabl...
                                  RareValueIndicator(variables=['pclass',
                                   

In [102]:
titanic_pipe.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [103]:
titanic_pipe.predict_proba(X_test)

array([[0.71667086, 0.28332914],
       [0.61555049, 0.38444951],
       [0.66697342, 0.33302658],
       [0.68069899, 0.31930101],
       [0.61756116, 0.38243884],
       [0.5926995 , 0.4073005 ],
       [0.6829898 , 0.3170102 ],
       [0.66923966, 0.33076034],
       [0.73575797, 0.26424203],
       [0.69716998, 0.30283002],
       [0.66664614, 0.33335386],
       [0.71456222, 0.28543778],
       [0.58881462, 0.41118538],
       [0.6508628 , 0.3491372 ],
       [0.07993617, 0.92006383],
       [0.67301864, 0.32698136],
       [0.65372831, 0.34627169],
       [0.647935  , 0.352065  ],
       [0.671809  , 0.328191  ],
       [0.70595327, 0.29404673],
       [0.68453397, 0.31546603],
       [0.58445255, 0.41554745],
       [0.69589494, 0.30410506],
       [0.31114171, 0.68885829],
       [0.60208471, 0.39791529],
       [0.687601  , 0.312399  ],
       [0.66184535, 0.33815465],
       [0.70564624, 0.29435376],
       [0.67495537, 0.32504463],
       [0.64713738, 0.35286262],
       [0.

In [104]:
titanic_pipe.named_steps['preprocessor'].fit_transform(X_train)

Unnamed: 0,age,fare,age_missing_indicator,fare_missing_indicator,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_Rare,parch_0,parch_1,parch_2,parch_Rare,embarked_C,embarked_Q,embarked_Rare,embarked_S,title_Miss,title_Mr,title_Mrs,title_Rare
0,25.0,7.925,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,41.0,134.5,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,28.0,7.7333,1,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,18.0,7.775,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,29.0,21.0,0,0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5,19.0,7.65,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
6,46.0,26.0,0,0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
7,28.0,25.4667,1,0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
8,28.0,26.0,1,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
9,36.0,135.6333,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


<font size=5> Feature Engine Package Based Pipeline </font>

In [105]:
## FEATURE ENGINE BASED TRANSFORMATIONS
numerical_missing_indicator = feature_engine.imputation.AddMissingIndicator(variables=NUMERICAL_VARIABLES)
rare_label_indicator = feature_engine.encoding.RareLabelEncoder(tol=0.05, variables=CATEGORICAL_VARIABLES)
cat_missing_transformer = feature_engine.imputation.CategoricalImputer(variables=CATEGORICAL_VARIABLES, fill_value='missing')
missing_median_tranformer = feature_engine.imputation.MeanMedianImputer(imputation_method='median',variables=NUMERICAL_VARIABLES)
ohe_ecoder = feature_engine.encoding.OneHotEncoder(variables = CATEGORICAL_VARIABLES, drop_last=True)

In [106]:
## custom transformer
cabin_letter_extractor = ExtractLetterTransformer(variable=CABIN)

In [107]:
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 1000)

In [108]:
drop_vars = DropObjVarsTransformer(variables = ['sex', 'cabin'])

In [109]:
standard_scaler = sklearn.preprocessing.StandardScaler()

In [110]:
# set up the pipeline
titanic_pipe_2 = Pipeline([

    # ===== IMPUTATION =====
    # impute categorical variables with string 'missing'
    ('categorical_imputation', cat_missing_transformer ),

    # add missing indicator to numerical variables
    ('missing_indicator', numerical_missing_indicator ),

    # impute numerical variables with the median
    ('median_imputation', missing_median_tranformer),


    # Extract first letter from cabin
    ('extract_letter', cabin_letter_extractor ),


    # == CATEGORICAL ENCODING ======
    # remove categories present in less than 5% of the observations (0.05)
    # group them in one category called 'Rare'
    ('rare_label_encoder', rare_label_indicator),


    # encode categorical variables using one hot encoding into k-1 variables
    ('categorical_encoder', ohe_ecoder ),
    
    ('drop_vars', drop_vars),

    # scale using standardization
    ('scaler', standard_scaler ),

    # logistic regression (use C=0.0005 and random_state=0)
    ('Logit', LogisticRegression(C=0.0005, random_state=0) ),
])

In [111]:
preprocessor = Pipeline(steps = [
    # ===== IMPUTATION =====
    # impute categorical variables with string 'missing'
    ('categorical_imputation', cat_missing_transformer ),

    # add missing indicator to numerical variables
    ('missing_indicator', numerical_missing_indicator ),

    # impute numerical variables with the median
    ('median_imputation', missing_median_tranformer),


    # Extract first letter from cabin
    ('extract_letter', cabin_letter_extractor ),


    # == CATEGORICAL ENCODING ======
    # remove categories present in less than 5% of the observations (0.05)
    # group them in one category called 'Rare'
    ('rare_label_encoder', rare_label_indicator),


    # encode categorical variables using one hot encoding into k-1 variables
    ('categorical_encoder', ohe_ecoder ),
    
    ('drop_vars', drop_vars),

    # scale using standardization
    ('scaler', standard_scaler ),
])

In [112]:
# set up the pipeline
titanic_pipe_2 = Pipeline([
    ('preprocessor', preprocessor),
    ('Logit', LogisticRegression(C=0.0005, random_state=0) ),
])
    

In [113]:
titanic_pipe_2.steps

[('preprocessor',
  Pipeline(steps=[('categorical_imputation',
                   CategoricalImputer(fill_value='missing',
                                      variables=['pclass', 'sibsp', 'parch',
                                                 'embarked', 'title'])),
                  ('missing_indicator',
                   AddMissingIndicator(variables=['age', 'fare'])),
                  ('median_imputation',
                   MeanMedianImputer(variables=['age', 'fare'])),
                  ('extract_letter', ExtractLetterTransformer(variable='cabin')),
                  ('rare_label_encoder',
                   RareLabelEncoder(variables=['pclass', 'sibsp', 'parch',
                                               'embarked', 'title'])),
                  ('categorical_encoder',
                   OneHotEncoder(drop_last=True,
                                 variables=['pclass', 'sibsp', 'parch',
                                            'embarked', 'title'])),
             

In [114]:
# train the pipeline
titanic_pipe_2.fit(X=X_train, y=y_train)




Pipeline(steps=[('preprocessor',
                 Pipeline(steps=[('categorical_imputation',
                                  CategoricalImputer(fill_value='missing',
                                                     variables=['pclass',
                                                                'sibsp',
                                                                'parch',
                                                                'embarked',
                                                                'title'])),
                                 ('missing_indicator',
                                  AddMissingIndicator(variables=['age',
                                                                 'fare'])),
                                 ('median_imputation',
                                  MeanMedianImputer(variables=['age', 'fare'])),
                                 ('extract_letter',
                                  ExtractLetterTransformer(v...
     

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [115]:
y_train.shape

(1047,)

<font size=4> Predictions based on our custom transformer pipeline </font>

In [57]:
# make predictions for train set
class_ = titanic_pipe.predict(X_train)
pred = titanic_pipe.predict_proba(X_train)
#print(pred)

#print(class_.shape)
#print(pred.shape)
# determine mse and rmse
print('train roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('train accuracy: {}'.format(accuracy_score(y_train, class_)))
print()

# make predictions for test set
class_ = titanic_pipe.predict(X_test)
pred = titanic_pipe.predict_proba(X_test)

# determine mse and rmse
print('test roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))
print('test accuracy: {}'.format(accuracy_score(y_test, class_)))
print()

train roc-auc: 0.7300946676970633
train accuracy: 0.667621776504298

test roc-auc: 0.7652777777777777
test accuracy: 0.6526717557251909



<font size=4> Predictions based on feature engine package based pipeline </font>

In [58]:
# make predictions for train set
class_ = titanic_pipe_2.predict(X_train)
pred = titanic_pipe_2.predict_proba(X_train)
#print(pred)

#print(class_.shape)
#print(pred.shape)
# determine mse and rmse
print('train roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('train accuracy: {}'.format(accuracy_score(y_train, class_)))
print()

# make predictions for test set
class_ = titanic_pipe_2.predict(X_test)
pred = titanic_pipe_2.predict_proba(X_test)

# determine mse and rmse
print('test roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))
print('test accuracy: {}'.format(accuracy_score(y_test, class_)))
print()

train roc-auc: 0.834345054095827
train accuracy: 0.6752626552053486

test roc-auc: 0.8316975308641974
test accuracy: 0.6755725190839694



In [59]:
titanic_pipe.named_steps['preprocessor'].fit_transform(X_train)

Standard Scaler columns : Index(['age', 'fare', 'age_missing_indicator', 'fare_missing_indicator',
       'pclass_1', 'pclass_2', 'pclass_3', 'sibsp_0', 'sibsp_1', 'sibsp_Rare',
       'parch_0', 'parch_1', 'parch_2', 'parch_Rare', 'embarked_C',
       'embarked_Q', 'embarked_Rare', 'embarked_S', 'title_Miss', 'title_Mr',
       'title_Mrs', 'title_Rare'],
      dtype='object')


Unnamed: 0,age,fare,age_missing_indicator,fare_missing_indicator,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_Rare,parch_0,parch_1,parch_2,parch_Rare,embarked_C,embarked_Q,embarked_Rare,embarked_S,title_Miss,title_Mr,title_Mrs,title_Rare
0,25.0,7.925,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,41.0,134.5,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,28.0,7.7333,1,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,18.0,7.775,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,29.0,21.0,0,0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5,19.0,7.65,0,0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
6,46.0,26.0,0,0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
7,28.0,25.4667,1,0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
8,28.0,26.0,1,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
9,36.0,135.6333,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [60]:
titanic_pipe_2.named_steps['preprocessor'].fit_transform(X_train)



array([[-0.37016209, -0.50478215, -0.49492069, ..., -0.50089526,
        -0.43562912, -0.16269784],
       [ 0.90402864,  1.97155505, -0.49492069, ...,  1.99642538,
        -0.43562912, -0.16269784],
       [-0.13125133, -0.5085326 ,  2.02052574, ..., -0.50089526,
        -0.43562912, -0.16269784],
       ...,
       [-0.13125133, -0.5085326 ,  2.02052574, ...,  1.99642538,
        -0.43562912, -0.16269784],
       [-0.7683467 ,  0.05915559, -0.49492069, ...,  1.99642538,
        -0.43562912, -0.16269784],
       [ 0.18729636, -0.35658342, -0.49492069, ..., -0.50089526,
         2.29553067, -0.16269784]])

That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**