# Titanic - Machine Learning From Disaster

## Sckit Learn Workflow Version

The goal of this notebook is to analyze the Titanic passenge dataset, provided in the Titanic - Machine Learning From Disaster Kaggle competition. This data will then be used to predict the survival of a set of passengers based on similar features.

This version of the project will focus more on deploying a scikit workflow, and pipelining the process in a scalable manner.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

sns.set_theme(style='darkgrid')
%matplotlib inline

## Importing and Spliting Into Train/Test

In [2]:
from sklearn.model_selection import train_test_split

# Read data
X = pd.read_csv('data/train.csv', index_col='PassengerId')
X_test_full = pd.read_csv('data/test.csv', index_col='PassengerId')

# Remove rows with missing target, print num rows dropped
rows_full = len(X.index)
X.dropna(axis=0, subset=['Survived'], inplace=True)
print(f'Dropped {rows_full - len(X.index)} rows missing target variable')

# Separate target from predictors
y = X.Survived
X.drop(['Survived'], axis=1, inplace=True)


# Select numeric and categoric columns
numeric_cols = [cname for cname in X.columns 
                if X[cname].dtype in ['int64', 'float64']]
categoric_cols = [cname for cname in X.columns
                  if X[cname].dtype == 'object']

Dropped 0 rows missing target variable


In [3]:
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Handling Missing Values

In [4]:
# Checking entire training dataset for nulls
X.isnull().sum()

Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [5]:
# Checking competition test dataset for nulls
X_test_full.isnull().sum()

Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

The test data has 1 null for 'Fare', though the training data does not. For this reason, I will build the pipeline with an imputer for 'Fare'. The baseline model will include no feature engineering. 'Age' and 'Embarked' will be imputed, and 'Cabin' will be dropped.

### Selecting Features for Baseline Model

The basline model will only include low cardinality features. That is categorical features with less than 15 unique values.

In [6]:
# Select categorical columns with low cardinality
categoric_cols = [cname for cname in X.columns if
                    X[cname].nunique() < 10 and 
                    X[cname].dtype == "object"]

For the purpose of this model, the feature 'Pclass' will be considered a categoric feature.

In [7]:
categoric_cols.append('Pclass')

In [8]:
categoric_cols

['Sex', 'Embarked', 'Pclass']

In [9]:
numeric_cols.remove('Pclass')

In [10]:
numeric_cols

['Age', 'SibSp', 'Parch', 'Fare']

In [11]:
# Drop unused features
X = X[categoric_cols + numeric_cols]

In [12]:
# split into training and validation sets
X_train, X_valid, y_train, y_valid, = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

## Creating Preprocessing Pipeline

### Transformers

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# impute 'Age'
# impute 'Fare'
# impute 'Embarked'

# Create numerical column transformers
numerical_transformer = SimpleImputer(strategy='mean')

# Create categorical column transformers
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numeric_cols),
        ('cat', categorical_transformer, categoric_cols)
    ])

# Scale all features
scaler = StandardScaler()

## Modeling

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Select Model
model = RandomForestClassifier(n_estimators=100)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('scaler', scaler),
                             ('model', model)])

# Fit model to training set
my_pipeline.fit(X_train, y_train)

# Create predictions on validation set
preds = my_pipeline.predict(X_valid)

## Model Evalutation

In [15]:
from sklearn.metrics import classification_report

print(classification_report(y_valid, preds))

              precision    recall  f1-score   support

           0       0.85      0.91      0.88       110
           1       0.84      0.74      0.78        69

    accuracy                           0.84       179
   macro avg       0.84      0.82      0.83       179
weighted avg       0.84      0.84      0.84       179



## Model Tuning

In [16]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Select range of parameters for search
param_distributions = {'n_estimators': randint(1, 500),
                       'max_depth': randint(5, 100)}

# Optimize parameters for random forest classifier
search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)

# Change estimator to the optimized model
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('scaler', scaler),
                             ('model', search)])

# Fit model to training set
my_pipeline.fit(X_train, y_train)

# Print optimized parameters
print(search.best_params_)

# Create predictions on validation set
preds = my_pipeline.predict(X_valid)

{'max_depth': 14, 'n_estimators': 212}


### Evaluate Tuned Model

In [17]:
print(classification_report(y_valid, preds))

              precision    recall  f1-score   support

           0       0.84      0.92      0.88       110
           1       0.85      0.72      0.78        69

    accuracy                           0.84       179
   macro avg       0.84      0.82      0.83       179
weighted avg       0.84      0.84      0.84       179



## Create Submission Data

In [18]:
preds_test = my_pipeline.predict(X_test_full)
output = pd.DataFrame({'PassengerId': X_test_full.index,
                       'Survived': preds_test})
#output.to_csv('data/submission_4.csv', index=False)