## Random Forest Classifier

- **Install the pre-requisite packages**<br>
install.packages("stats")<br>
install.packages("dplyr")<br>
install.packages("randomForest")<br>

- Load required library<br>
1] **Stats** - has the most basic functions in R<br>
2] **dplyr** - has functions that will help with data manipulation<br>
3] **randomForest** - has the necessary functions that will help build the random forest model

In [9]:
library(stats)
library(dplyr)
library(randomForest)

#Load data in mydata object
mydata= iris

- __mydata__ has 150 observations of 5 variables

In [2]:
#inspect mydata
print(mydata)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         

Use **str()** to compactly display the internal structure of the object

In [3]:
#variable selection
str(mydata)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


**Factor** - Species<br>
*Factor* is R's way of determing a categorical variable<br><br>
**Predicted variable** - Species<br>
**Predictors** - Sepal.Length , Sepal.Width , Petal.Length , Petal.Width

-  Splitting Data in ***Training*** and ***Testing***

In [4]:
#A vector that has sample of training values
index = sample(2,nrow(mydata),replace = TRUE,prob=c(0.7,0.3))

- Training set - 70%<br>
- Testing Set - 30%

In [5]:
#Training data
Training= mydata[index==1,]

#Testing data
Testing = mydata[index==2,]

#Random Forest Model
RFM =randomForest(Species~.,data=Training)

#Evaluating Model Accuracy
Species_Pred = predict(RFM,Testing)

- Species_Pred column is added to the **Testing** Dataset

In [6]:
Testing$Species_Pred = Species_Pred
print(Testing)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Species_Pred
1            5.1         3.5          1.4         0.2     setosa       setosa
3            4.7         3.2          1.3         0.2     setosa       setosa
4            4.6         3.1          1.5         0.2     setosa       setosa
6            5.4         3.9          1.7         0.4     setosa       setosa
8            5.0         3.4          1.5         0.2     setosa       setosa
11           5.4         3.7          1.5         0.2     setosa       setosa
13           4.8         3.0          1.4         0.1     setosa       setosa
18           5.1         3.5          1.4         0.3     setosa       setosa
19           5.7         3.8          1.7         0.3     setosa       setosa
21           5.4         3.4          1.7         0.2     setosa       setosa
23           4.6         3.6          1.0         0.2     setosa       setosa
25           4.8         3.4          1.9         0.2     setosa

In [7]:
#Building confusion matrix
CFM = table(Testing$Species,Testing$Species_Pred)
CFM

            
             setosa versicolor virginica
  setosa         24          0         0
  versicolor      0         16         2
  virginica       0          0        15

- **Calculate Accuracy**

In [8]:
Classification_Accuracy = sum(diag(CFM)/sum(CFM)) 
Classification_Accuracy