## 3. Random Forest

We implemented a basic random forest classifier, which we import as a Python file below, where we use DecisionTreeClassifier from sklearn.tree to fit the decision trees in our random forest. We both use bootstrapping to select random rows from the dataset to fit our trees, as well as random feature selection.

In [63]:
from random_forest_code import RandomForest
import pandas as pd

In [64]:
train = pd.read_csv('data/train.csv')

Similar to the first decision tree classifier, we start with one hot encoding some of the categorical features and dropping some others. We also drop 'Sex_male', as it is just the negation of 'Sex_female', and they are therefore 100% correlated features.

In [65]:
train = pd.get_dummies(train, prefix=['Pclass', 'Sex', 'Embarked'], columns=['Pclass', 'Sex', 'Embarked'], dtype=int)
train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Sex_male'], axis=1, inplace=True)
train.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,0,0,1


In [66]:
y_train = train['Survived']
train.drop(columns = ['Survived'], axis=1, inplace=True)

In [67]:
test = pd.read_csv('data/test.csv')
test = pd.get_dummies(test, prefix=['Pclass', 'Sex', 'Embarked'], columns=['Pclass', 'Sex', 'Embarked'], dtype=int)
test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Sex_male'], axis=1, inplace=True)
test.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Embarked_C,Embarked_Q,Embarked_S
0,34.5,0,0,7.8292,0,0,1,0,0,1,0
1,47.0,1,0,7.0,0,0,1,1,0,0,1
2,62.0,0,0,9.6875,0,1,0,0,0,1,0
3,27.0,0,0,8.6625,0,0,1,0,0,0,1
4,22.0,1,1,12.2875,0,0,1,1,0,0,1


In [68]:
random_forest = RandomForest(n_trees=100, max_depth=2, n_features='sqrt', n_samples=0.25)

In [69]:
random_forest.fit(train, y_train)

In [70]:
prediction = random_forest.predict(test)

In [71]:
tested = pd.read_csv('data/test.csv')
tested['Survived'] = prediction

predictionrf = tested[['PassengerId', 'Survived']]
predictionrf

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [72]:
# predictionrf.to_csv('predictionrf.csv', index=False)

Result: 0.77990 accuracy.

Let's try one with more trees:

In [73]:
random_forest2 = RandomForest(n_trees=500, max_depth=2, n_features='sqrt', n_samples=0.25)
random_forest2.fit(train, y_train)
prediction2 = random_forest2.predict(test)

tested['Survived'] = prediction2

predictionrf2 = tested[['PassengerId', 'Survived']]
predictionrf2

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [74]:
# predictionrf2.to_csv('predictionrf2.csv', index=False)

Result: same, 0.77990 accuracy.