<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Exploration</a></span><ul class="toc-item"><li><span><a href="#Dataset-balance" data-toc-modified-id="Dataset-balance-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset balance</a></span></li><li><span><a href="#Gender-influence" data-toc-modified-id="Gender-influence-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Gender influence</a></span></li></ul></li><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Feature engineering</a></span><ul class="toc-item"><li><span><a href="#Filling-missing-values" data-toc-modified-id="Filling-missing-values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Filling missing values</a></span></li><li><span><a href="#New-child-feature" data-toc-modified-id="New-child-feature-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>New child feature</a></span></li><li><span><a href="#Categorical-features" data-toc-modified-id="Categorical-features-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Categorical features</a></span></li><li><span><a href="#Split-Survived-output" data-toc-modified-id="Split-Survived-output-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Split Survived output</a></span></li><li><span><a href="#Data-overview" data-toc-modified-id="Data-overview-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Data overview</a></span></li></ul></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Random Forest</a></span><ul class="toc-item"><li><span><a href="#Quick-try" data-toc-modified-id="Quick-try-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Quick try</a></span></li><li><span><a href="#RandomizedSearch" data-toc-modified-id="RandomizedSearch-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>RandomizedSearch</a></span></li></ul></li></ul></div>

In [20]:
import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

pd.options.mode.chained_assignment = None

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [17]:
# Submission function
def make_submission(ytest, filename="Titanic Prediction.csv"):
    submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':ytest})
    submission.to_csv(filename,index=False)
    print('Saved file: ' + filename)

# Data Exploration

In [3]:
print(train.shape)
print(test.shape)
train.head(n=20)

(891, 12)
(418, 11)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Dataset balance

In [4]:
print(train["Survived"].value_counts(normalize=True))

0    0.616162
1    0.383838
Name: Survived, dtype: float64


## Gender influence

In [5]:
print("Proportion of male who survived :")
print(train["Survived"][train["Sex"] == "male"].value_counts(normalize=True))

Proportion of male who survived :
0    0.811092
1    0.188908
Name: Survived, dtype: float64


In [6]:
print("Proportion of female who survived :")
print(train["Survived"][train["Sex"] == "female"].value_counts(normalize=True))

Proportion of female who survived :
1    0.742038
0    0.257962
Name: Survived, dtype: float64


# Feature engineering

## Filling missing values

In [7]:
# Remplissage des champs manquants
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"] = train["Embarked"].fillna("S")

test["Age"] = test["Age"].fillna(test["Age"].median())
test["Embarked"] = test["Embarked"].fillna("S")
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

## New child feature

In [8]:
# Creating Children categorical feature :
train["Child"] = 0
train["Child"][train["Age"] < 16 ] = 1
train["Child"][train["Age"] >= 16 ] = 0

test["Child"] = 0
test["Child"][test["Age"] < 16 ] = 1
test["Child"][test["Age"] >= 16 ] = 0

In [9]:
print("Proportion of children who survived :")
train["Survived"][train["Child"] == 1].value_counts(normalize=True)

Proportion of children who survived :


1    0.590361
0    0.409639
Name: Survived, dtype: float64

## Categorical features

In [10]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

## Split Survived output

In [11]:
ytrain = train["Survived"]
train = train.drop(columns="Survived")

## Data overview

In [12]:
train.head()
train.isnull().sum()
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
Child            0
dtype: int64

# Random Forest

- 500 estimators RFClassifier : **0.73205**
- Optimized RandomSearch hyperparameters : **0.77033**

## Quick try

In [19]:
# Filtering string columns for random forest
forest_columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Child']

# Create Classifier
rf = RandomForestClassifier(
    n_estimators=500)

# Fit and predict
rf.fit(train[forest_columns], ytrain)
ytest_rf = rf.predict(test[forest_columns])

# Make Submission
make_submission(ytest_rf, "Titanic Predictions RF500.csv")

Saved file: Titanic Predictions RF500.csv


## RandomizedSearch

In [23]:
# Classifier
rf2 = RandomForestClassifier(n_jobs=-1)

# Create hyperparameter options
rf_max_depth=[2, 3, 5, 7, 10, 20, 35, 60, 100]
rf_n_estimators=[100, 200, 500, 1000, 1200]
rf_min_samples_split=[2, 4, 6, 10]
rf_criterion=['gini', 'entropy']

hyperparameters = dict(
    max_depth=rf_max_depth,
    min_samples_split=rf_min_samples_split,
    n_estimators=rf_n_estimators,
    criterion=rf_criterion)

# Create randomized grid search
clf = RandomizedSearchCV(rf2, hyperparameters, random_state=1, n_iter=300, cv=5, verbose=10, n_jobs=-1)

# Fit randomized search
best_model = clf.fit(train[forest_columns], ytrain)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    5.8s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   21.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   23.5s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   40.6s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:   50.8s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  1

In [24]:
best_model.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [25]:
# Create Classifier
rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

# Fit and predict
rf.fit(train[forest_columns], ytrain)
ytest_rf = rf.predict(test[forest_columns])

# Make Submission
make_submission(ytest_rf, "Titanic Predictions RF-RandomSearch.csv")

Saved file: Titanic Predictions RF-RandomSearch.csv
