In [1]:
%autosave 0

Autosave disabled


In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from adam_prepare import titanic_pipeline

Let's read in our data from the titanic pipeline function!

In [3]:
train, val, test = titanic_pipeline()
train.shape, val.shape, test.shape

((623, 9), (134, 9), (134, 9))

We need to create our X and y subsets.

In [4]:
X_train = train.drop(columns = 'survived')
y_train = train.survived

X_val = val.drop(columns = 'survived')
y_val = val.survived

In [5]:
X_train = pd.get_dummies(X_train)
X_val = pd.get_dummies(X_val)

X_train.head()

Unnamed: 0,age,sibsp,parch,fare,alone,sex_female,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
748,19.0,1,0,53.1,0,0,1,1,0,0,0,0,1
45,29.0,0,0,8.05,1,0,1,0,0,1,0,0,1
28,29.0,0,0,7.8792,1,1,0,0,0,1,0,1,0
633,29.0,0,0,0.0,1,0,1,1,0,0,0,0,1
403,28.0,1,0,15.85,0,0,1,0,0,1,0,0,1


Before modeling, it's always important to define a baseline!

In [6]:
(y_train == 0).mean()

0.6163723916532905

The majority of people died on the Titanic. Guessing died every time has 62% accuracy!

Now we are ready to create a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) object and start modeling!

In [7]:
seed = 42

rf = RandomForestClassifier(max_depth = 5, random_state = seed)

rf.fit(X_train, y_train)

Let's use the .score() method to evaluate the model's accuracy on the train dataset.

In [8]:
rf.score(X_train, y_train)

0.8507223113964687

Very good! Now let's check for overfitting by evaluating our model's accuracy on the validate dataset.

In [9]:
rf.score(X_val, y_val)

0.8582089552238806

Also great! Here we see a great attribute of the random forest: it generalizes well to unseen data.

Let's check out the feature importances to see which features helped our model make accurate predictions.

In [10]:
rf.feature_importances_

array([0.09657746, 0.03239215, 0.03665931, 0.12740827, 0.01625359,
       0.25522965, 0.26063139, 0.0474091 , 0.01884012, 0.07483355,
       0.01194815, 0.00487356, 0.01694372])

With a little finesse, we can create a dataframe of the feature importances and sort by the importance.

In [11]:
fi = pd.DataFrame({'feature': X_train.columns,
                  'importance': rf.feature_importances_})

fi.sort_values(by = 'importance', ascending = False)

Unnamed: 0,feature,importance
6,sex_male,0.260631
5,sex_female,0.25523
3,fare,0.127408
0,age,0.096577
9,class_Third,0.074834
7,class_First,0.047409
2,parch,0.036659
1,sibsp,0.032392
8,class_Second,0.01884
12,embark_town_Southampton,0.016944


It looks like sex is strongly influencing predictions! In future iterations of the project, I would consider setting drop_first = True when creating dummies for the column. Let's try it out!

First, we'll redefine our X and y dataframes.

In [12]:
X_train = train.drop(columns = 'survived')
y_train = train.survived

X_val = val.drop(columns = 'survived')
y_val = val.survived

Next, we will try a slightly different approach when getting dummies.

In [13]:
X_train = pd.get_dummies(X_train, columns = ['sex'], drop_first = True)
X_train = pd.get_dummies(X_train)

X_val = pd.get_dummies(X_val, columns = ['sex'], drop_first = True)
X_val = pd.get_dummies(X_val)

X_train.head()

Unnamed: 0,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
748,19.0,1,0,53.1,0,1,1,0,0,0,0,1
45,29.0,0,0,8.05,1,1,0,0,1,0,0,1
28,29.0,0,0,7.8792,1,0,0,0,1,0,1,0
633,29.0,0,0,0.0,1,1,1,0,0,0,0,1
403,28.0,1,0,15.85,0,1,0,0,1,0,0,1


Now that we only have one column regarding the sex of the passenger, let's go through the same workflow of model creation and evaluation.

In [14]:
seed = 42

rf = RandomForestClassifier(max_depth = 5, random_state = seed)

rf.fit(X_train, y_train)

In [15]:
rf.score(X_train, y_train), rf.score(X_val, y_val)

(0.8619582664526485, 0.8283582089552238)

In [16]:
fi = pd.DataFrame({'feature': X_train.columns,
                  'importance': rf.feature_importances_})

fi.sort_values(by = 'importance', ascending = False)

Unnamed: 0,feature,importance
5,sex_male,0.430622
3,fare,0.144291
0,age,0.114551
8,class_Third,0.097318
6,class_First,0.050478
2,parch,0.043098
1,sibsp,0.033442
4,alone,0.022413
9,embark_town_Cherbourg,0.018568
11,embark_town_Southampton,0.017891


The model performed slightly worse on the validate dataset! The sex of the passenger continues to dominate the feature importance, but there is increased importance for fare and age.