**Random forest algorithm** has become the most common algorithm to be used in ML competitions.

To understand Random forest algorithm you have to be familiar with **decision trees** at first.

**What are Decision Trees?**

*   Decision trees are predictive models that use a set of binary rules to calculate a target value.

*   There are two types of decision trees namely; Classification and Regression trees.

*   Classification trees are used to create categorical datasets such as land cover classification.

*   Regression trees are used to create continuous datasets such as biomass and percent tree cover.

*   Each individual tree is a fairly simple model that has branches, nodes and leaves.

*   The nodes contain the attributes which the objective function depends on.

**What is Random Forest?**

As Leo Breiman defined it in the research paper, “ Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ”

Another definition “A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, …} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x ” Briefly, Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.


**Import needed dependencies :**

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import accuracy_score

Using TensorFlow backend.


In [2]:
from google.colab import files
files.upload()

Saving TitanicPreprocessed.csv to TitanicPreprocessed.csv


{'TitanicPreprocessed.csv': b'Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U,IsChild,Pclass_1,Pclass_2,Pclass_3,Ticket_A,Ticket_A4,Ticket_A5,Ticket_AQ3,Ticket_AQ4,Ticket_AS,Ticket_C,Ticket_CA,Ticket_CASOTON,Ticket_FC,Ticket_FCC,Ticket_Fa,Ticket_LINE,Ticket_LP,Ticket_PC,Ticket_PP,Ticket_PPP,Ticket_SC,Ticket_SCA3,Ticket_SCA4,Ticket_SCAH,Ticket_SCOW,Ticket_SCPARIS,Ticket_SCParis,Ticket_SOC,Ticket_SOP,Ticket_SOPP,Ticket_SOTONO2,Ticket_SOTONOQ,Ticket_SP,Ticket_STONO,Ticket_STONO2,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily,Survived\r\n1,22.0,1,0,7.25,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0\r\n0,38.0,1,0,71.2833,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0

**Load the preprocessed dataset:**

In [3]:
dataset = pd.read_csv('TitanicPreprocessed.csv')
dataset.head()

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,...,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily,Survived
0,1,22.0,1,0,7.25,0,0,1,0,0,...,0,0,0,0,0,2,0,1,0,0
1,0,38.0,1,0,71.2833,0,0,0,1,0,...,0,0,0,0,0,2,0,1,0,1
2,0,26.0,0,0,7.925,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,0,35.0,1,0,53.1,0,0,0,1,0,...,0,0,0,0,1,2,0,1,0,1
4,1,35.0,0,0,8.05,0,0,1,0,0,...,0,0,0,0,1,1,0,0,0,0


In [0]:
y = dataset['Survived']
X = dataset.drop(['Survived'], axis = 1)

# Split the dataset to trainand test data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

**Set the parameters for the random forest model using dictionary type :**

In [0]:
parameters = {'bootstrap': True,
              'min_samples_leaf': 3,
              'n_estimators': 50, 
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6,
              'max_leaf_nodes': None}

**bootstrap** : boolean, optional (default=True)

*   Whether bootstrap samples are used when building trees.

**min_samples_leaf** : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node:

*   If int, then consider min_samples_leaf as the minimum number.
*   If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

**n_estimators** : integer, optional (default=10)

*   The number of trees in the forest.

**min_samples_split** : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

*   If int, then consider min_samples_split as the minimum number.
*   If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

**max_features **: int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

*   If int, then consider max_features features at each split. -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.

*   If “auto”, then max_features=sqrt(n_features).

*   If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

*   If “log2”, then max_features=log2(n_features).

*   If None, then max_features=n_features.

**max_depth** : integer or None, optional (default=None)

*   The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

**max_leaf_nodes** : int or None, optional (default=None)

*   Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

**Define the model :**

In [0]:
RF_model = RandomForestClassifier(**parameters)

**Train the model :**

In [7]:
RF_model.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Test the trained model on test data :**

In [8]:
RF_predictions = RF_model.predict(test_X)
score = accuracy_score(test_y ,RF_predictions)
print(score)

0.820627802690583


# Our model’s accuracy is 82%, not bad at all.