Owner:      Shilton Jonatan, Data & Analytics, Digitas Singapore, shilton.salindeho@digitas.com

Solution:   Decision Tree/Random Forest

Date of publication:  28 March 2022

## Decision Tree/Random Forest

Classification algorithms are used to categorize data into a class or category. Classification can be of three types: binary classification, multiclass classification, multilabel classification.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

<img src="./decision-tree-classification-algorithm.png" width="350" />
<img src="./Random_forest_diagram_complete.png" width="350" />

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

Then, we load the sample data - in this sample, we'll be using a data of car brands and makes along with the specs.

In [3]:
# Load mtcars sample data set
mtcars = pd.read_csv("DatasetMtcars25032022.csv") #reads text data into data frame

In [4]:
#See first few lines of data
mtcars.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [5]:
mtcars.shape

(32, 12)

In [9]:
mtcars.vs.unique()

array([0, 1], dtype=int64)

Decision Tree and Random Forest are both supervised algorithms - we will need input variables and output variables in order to train the model before the model can predict new sets of data.

In this case, we can try to split the data into training and test data, and define variable 'vs' as the output variable.

In [26]:
#Set all variables except vs as X, and vs as y
X=mtcars.loc[:, ~mtcars.columns.isin(['vs', 'Unnamed: 0'])]
y=mtcars['vs']

In [37]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [38]:
#Fitting the data into the tree
tree = DecisionTreeClassifier().fit(X_train, y_train)
randomforest = RandomForestClassifier().fit(X_train, y_train)

#Get prediction from fitted models
y_pred_tree = tree.predict(X_test)
y_pred_rf = randomforest.predict(X_test)

In [39]:
#Check for model accuracy scores
print("Decision Tree Accuracy:",metrics.accuracy_score(y_test, y_pred_tree))
print("Random Forest Accuracy:",metrics.accuracy_score(y_test, y_pred_rf))

Decision Tree Accuracy: 0.875
Random Forest Accuracy: 0.875
