### ALGORITHM 8: RANDOM FOREST CLASSIFIER

This algorithm uses a collection of decision trees, each tree classifies new data points based on
attributes and votes for a class. The algorithm chooses the final class based on the most no. of
votes over all of the trees in the forest. If there are N no. of training cases, N no. of random
sampling will be done to grow the trees.

![random-forest1](../docs/randomforest1.jpg)

Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.

The algorithm can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.


In [9]:
# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# read the train and test dataset
dataset = pd.read_csv("../data/titanic.csv")
train_data, test_data = train_test_split(dataset, test_size=0.2, shuffle=False)

# separate the train X,y and test X,y dataset
train_X = train_data.drop("Survived", axis=1)
train_y = train_data["Survived"]

test_X = test_data.drop("Survived", axis=1)
test_y = test_data["Survived"]


In [10]:
# create the model and train with data
model = RandomForestClassifier()
model.fit(train_X, train_y)

print("test data :")
display(test_data.head())


test data :


Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
712,0,35.0,7.125,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1
713,0,20.0,7.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1
714,0,26.0,7.8958,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1
715,1,58.0,146.5208,1,0,0,1,0,1,0,...,1,0,0,0,0,0,0,1,0,0
716,1,35.0,83.475,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1


In [11]:
# predict the results
pred_y = model.predict(test_X)
print("predicted survivors :", pred_y[:5])

# score of the model
score = accuracy_score(test_y, pred_y)
print("score of model      :", score * 100, "%")


predicted survivors : [0 0 0 1 1]
score of model      : 80.44692737430168 %
