**Decision Tree Model vs. Random Forests**

Here we will learn more about Random Forests and compare them to a decision tree model, the CART model.  

Random Forest is the choice of algorithm when one can’t think of any algorithm irrespective of situation, to apply on a data set or if one wants to learn about the data before applying any more apt complex algorithms. It is considered to be a solution of all data science problems.

Random Forests are capable of performing both regression and classification tasks. It helps in dimensional reduction, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. 

Let's get started by first looking at the Iris data and run the CART model to examine how decision trees work.  

Let's load in the Iris data and get familiar with the data set.  

Here, we can see that there are 150 observations (rows) with 5 variables (columns).  
It looks like there are three Iris flower species in the dataset.

In [None]:
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
# Install below packages for building a CART model.
library(rpart)
library(caret)


data(iris)
str(iris)
summary(iris)
head(iris)

library(ggplot2)
qplot(Petal.Length,Petal.Width,colour=Species,data=iris)

Loading required package: lattice


'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


Divide the population in to training and testing sets. Compare the predictive power of decision tree and random forest on testing set of data.

In [2]:
# Create a vector called flag such that 70% of the data is put into training set and rest in to testing set. 
# flag will have row numbers corresponding to observations that will be put into training set and the rows remaining in iris_data
# will be put into testing set.
flag <- createDataPartition(y=iris$Species,p=0.7,list=FALSE)

# training will have rows from iris_data for the row numbers present in flag vector.
training <- iris[flag,]
nrow(training)

# testing will have rows from iris_data which are not present in flag vector.
testing <- iris[-flag,]
nrow(testing)

So we have 105 observations in training set and 45 in testing set.

Build a CART model. "caret" and "rpart" packages will be used to build the model. To create a more graphically appealing graph in R, a package called “rattle” is used to make the decision tree. "Rattle" builds more fancy and clean trees which are easy to interpret.

In [None]:
fit <- train(Species~.,method="rpart",data=training)


In [None]:
# Code for generating decision tree plot
rpart_fit <- rpart(Species~.,method="class",data=training) 

library(rpart.plot)
rpart.plot(rpart_fit)

Now check the predictive power of the CART model that is just built. 

Check for the number of misclassifications in the tree as the decision criteria.

In [None]:
train.pred <- predict(fit, newdata = training)
conf <- table(train.pred, training$Species)

conf
sum(diag(conf))/sum(conf) #accuracy

There are just a few misclassifications out of 105 observations. 

Accuracy is calculating by sum of value on the diagonal divided by sum of all.

The misclassification rate signifies its predictive power. Once the model is built, it should be validated on a test set to see how well it performs on unknown data. This will help in determining the model is not over fitted. In case the model is over fitted, validation will show a sharp decline in the predictive power.

In [None]:
test.pred<-predict(fit,newdata=testing)
conf <- table(test.pred,testing$Species)

conf
sum(diag(conf))/sum(conf) #accuracy

The predictive power decreased in testing set as compared to training. This is generally true in most cases. The reason being, the model is trained on the training data set, and just overlaid on validation training set.

**Random Forest**

Run random forest algorithm on iris data to compare the results with CART model.


In [None]:
library(randomForest)

In [None]:
RandomForest_fit <- randomForest(Species~.,method="class",data=training, ntree=100, importance=TRUE) 

plot(RandomForest_fit)
legend("topright", colnames(RandomForest_fit$err.rate),col=1:4,cex=0.8,fill=1:4)

The plot shows the amount of error with the variation in the number of trees constructed. Play with number of trees to generate.

In [None]:
varImpPlot(RandomForest_fit)

**Gini importance:**

Every time a split of a node is made on variable m the gini impurity criterion for the two descendant nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

In [None]:
importance(RandomForest_fit)

In [None]:
# install.packages("party",repo="http://cran.mtu.edu/")

library('party')
 
ct = ctree(Species~., data = training)
plot(ct, main="Tree")
 
# #Table of prediction errors
table(predict(ct), training$Species)
 
# # Estimated class probabilities
train.pred = predict(ct, newdata=training, type="prob")

In [None]:
RF_fit <- train(Species~ .,method="rf",data=training)

In [None]:
train_RF_pred <- predict(RF_fit,training)

In [None]:
conf <- table(train_RF_pred,training$Species)

conf
sum(diag(conf))/sum(conf) #accuracy

Misclassification rate in training data is 0/105. Validate to make sure that the model is not over fitted on the training data by testing on tets data.

In [None]:
test_RF_pred<-predict(RF_fit,newdata=testing)

In [None]:
conf <- table(test_RF_pred,testing$Species)

conf
sum(diag(conf))/sum(conf) #accuracy

There are just a few misclassified observations out of 45, which is similar to CART model prediction power. There is a significant drop in predictive power of the model when compared to training misclassification rate.