# Titanic Survival Part 3: Training Classifiers for Accuracy

In Part 1 of this project I conduct Exploratory Data Analysis (EDA) of the Titanic training data using R. This exploration can be found [here.](http://rpubs.com/BigBangData/512981)

In Part 2 I continue the exploration using Python and building a couple of basic models. This is not intended as the goal of the competition, just an exploration of modeling in Python.

In Part 3 (this notebook) I create a pre-processing pipeline and train several models in Python using the scikit-learn module, and submit my predictions to the competition.


## Pre-Processing

In [1]:
# import modules
import pandas as pd
import numpy as np

# custom pre-processing module
import processing_pipeline as pp  

# load datasets
train_data = pd.read_csv("../input/train.csv")
test_data = pd.read_csv("../input/test.csv")

# separate target from predictors in training set
survived_labels = train_data['Survived'].copy()
train_data_nolabel = train_data.drop('Survived', axis=1)

# get processed training data and labels
X = pp.process_train(train_data_nolabel)
y = survived_labels.to_numpy()

## Modeling

### Stochastic Gradient Descent

In [2]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
     
sgd_clf = SGDClassifier(random_state=42)
    
accuracies = cross_val_score(estimator=sgd_clf, X=X, y=y, cv=100)

In [3]:
mn, sd = round(accuracies.mean(),4), round(accuracies.std(),4)
mn, sd

(0.7824, 0.1362)

As seen in the previous notebook, the SDG model is too variable.

### Random Forests

In [4]:
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [5]:
grid_param = {
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

In [6]:
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                           param_grid=grid_param,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

In [7]:
grid_search.fit(X,y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'bo

In [8]:
best_parameters = grid_search.best_params_
print(best_parameters)

{'bootstrap': True, 'criterion': 'entropy', 'n_estimators': 300}


In [9]:
best_result = grid_search.best_score_
print(best_result)

0.819304152637486


In [10]:
forest_clf = RandomForestClassifier(bootstrap=True,
                                    criterion='entropy',
                                    n_estimators=300,
                                    random_state=42)

In [11]:
accuracies = cross_val_score(forest_clf, X, y, cv=10)
mn, sd = round(accuracies.mean(),4), round(accuracies.std(),4)
mn, sd

(0.8249, 0.0308)