In this notebook, I use a random forest classifier to predict survival from the Titanic data. The exploratory data analysis and cleaning are the same as the logit notebook, so I will just cut and paste the setup. 

In [30]:
import numpy as np
import pandas as pd
from math import log
%matplotlib inline 
import matplotlib.pyplot as plt
import warnings
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder,OneHotEncoder
warnings.filterwarnings('ignore')


train_data = pd.read_csv("Titanic/train.csv")
test_data = pd.read_csv("Titanic/test.csv")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
train_data = train_data.copy()
test_data = test_data.copy()
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")
train_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
features = ['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilyOnBoard', 'Age']
train_data['FamilyOnBoard'] = train_data['Parch'] + train_data['SibSp']
test_data['FamilyOnBoard'] = test_data['Parch'] + test_data['SibSp']
train_data_features = train_data[features]
train_data_features.groupby('Pclass', as_index = False)['Age'].describe()

Unnamed: 0,Pclass,count,mean,std,min,25%,50%,75%,max
0,1,186.0,38.233441,14.802856,0.92,27.0,37.0,49.0,80.0
1,2,173.0,29.87763,14.001077,0.67,23.0,29.0,36.0,70.0
2,3,355.0,25.14062,12.495398,0.42,18.0,24.0,32.0,74.0


In [32]:
y = train_data_features['Survived']
X = train_data_features.drop('Survived', axis = 1)

In [33]:
OHE_encoder = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'most_frequent')), 
        ("encoder", OneHotEncoder()) 
    ]
)
num_pipeline = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'mean')),
        ("scaler", StandardScaler())
    ]
)
preprocessor = ColumnTransformer(transformers = [
    ("num", num_pipeline, ['Fare', 'Age', 'FamilyOnBoard']), 
    ("ord", OrdinalEncoder(), ['Sex']),
    ("OHE", OHE_encoder, ['Embarked'])
])

In the following, I will use train-test-split and GridsearchCV for hyperparameter tuning. Even though GridsearchCV automatically creates a train-test-split for cross validation in order to tune the hyperparameters, I have read that it is still good practice to hold out a test set just for extra validation. Otherwise, we might tune the model to the validation test, thus overfitting.

In the following, I also jump straight into hyperparameter tuning. The n_estimators is the number of trees in my forest, the max_depth is the maximum depth of a given tree and the max_samples is how many samples of the data a given tree is built off. Reducing each of these parameters will reduce overfitting, but I don't want them to get too small, otherwise the model will get undefit.


In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.9, random_state = 42)

parameters={'model__n_estimators' : [75,100,125,150,200], 'model__max_depth' : range(3,8), 'model__max_samples': [350,400,450,500]}
model=RandomForestClassifier(random_state=42)
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
gs = GridSearchCV(my_pipeline, param_grid = parameters, scoring = 'accuracy', cv=5)
gs.fit(X_train,y_train)

In [35]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score
best_model = gs.best_estimator_
print(gs.best_params_)
forest_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(forest_scores.mean())
y_pred = best_model.predict(X_train)
print(confusion_matrix(y_pred, y_train))

{'model__max_depth': 4, 'model__max_samples': 350, 'model__n_estimators': 125}
0.8214751552795031
[[446  85]
 [ 49 221]]


In [36]:
y_pred = best_model.predict(X_test)
Acc_score=accuracy_score(y_test,y_pred)
print(Acc_score)
print(confusion_matrix(y_pred, y_test))

0.8111111111111111
[[45  8]
 [ 9 28]]


In [37]:
predictions=best_model.predict(test_data)
my_submission = pd.DataFrame({'PassengerId': test_data.index, 'Survived': predictions})
my_submission.to_csv('RFClassifierWithTuning.csv', index=False)


This scored a .76794 on submission to the competition, so it's clearly overfit. 

Potentital improvements include:

* Better feature selection. I have seen example notebooks which extract titles from the names, which could be useful. It might also help with imputing missing sex data.
* Another possible thing to try would be to not scale the features. Random forests do not require feature scaling, so it might make sense to just let it have the original data.
