## Random Forest
The **Random Forest** is one of the most powerful machine learning algorithms available today. It is a supervised machine learning algorithm that can be used for both classification (predicts a discrete-valued output, i.e. a class) and regression (predicts a continuous-valued output) tasks. 

In this notebook, I describe how this can be used for a classification task with the popular Iris dataset.

Random forest consists of many decision trees. It is kind of forming forest of trees.

### How Random forests work?
A random forest consists of a group (an ensemble) of individual decision trees. Therefore, the technique is called **Ensemble Learning**. A large group of uncorrelated decision trees can produce more accurate and stable results than any of individual decision trees.

When you train a random forest for a classification task, you actually train a group of decision trees. Then you obtain the predictions of all the individual trees and predict the class that gets the most votes. 

Although some individual trees produce wrong predictions, many can produce accurate predictions. As a group, they can move towards accurate predictions. This is called the wisdom of the crowd. 

In [1]:
#Import necessary modules

import numpy as np 

import pandas as pd 

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier 

from sklearn.metrics import accuracy_score 

from sklearn.datasets import load_iris

import sklearn.metrics as metrics

#### Iris Dataset
Iris is a family of flower which contains three type of flower called setosa ,versicolor ,virginica .

**Problem:**  The problem is that, we have given some features of a flower, and based on these features we have to identify which flower belongs to which category.

**Solution:** Know we now this type of problems belong to classification  problems. We can solve this by using supervised machine learning classification algorithm.

In [2]:
#Loading datasets 

iris_data = load_iris() 

iris=pd.DataFrame(iris_data.data)

#shape of datasets 

print ("Dataset Shape: ", iris.shape) 

#first five sample 

print ("Dataset: ",iris.head())  

Dataset Shape:  (150, 4)
Dataset:       0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2


In [3]:
# printing categories (setosa, versicolor,virginica)

print(iris_data.target_names)

# printing features of flower 

print(iris_data.feature_names)

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [4]:
X = iris.values[:, 0:4] 

Y = iris_data.target

We split the datasets into training and testing data. Training data is used to train the model and testing data is used to check the performance of model.

In [5]:
# Splitting the dataset into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)

In [6]:
#defining random forest classifier 

clfr= RandomForestClassifier(random_state = 100)

# Performing training 

clfr.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)

In [7]:
#making prediction 

Y_pred=clfr.predict(X_test)

Y_pred

array([2, 0, 2, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 0,
       2, 0, 1, 2, 1, 0, 1, 2, 1, 1, 1, 0, 0, 1, 0, 1, 2, 2, 0, 1, 2, 2,
       0])

Now, let's check the accuracy of our model.

In [8]:
print("Accuracy:",metrics.accuracy_score(y_test, Y_pred))

cm = np.array(metrics.confusion_matrix(y_test,Y_pred))

cm
#pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Accuracy: 0.9555555555555556


array([[16,  0,  0],
       [ 0, 10,  1],
       [ 0,  1, 17]])

In [9]:
#making prediction on new data 

clfr.predict([[3, 5, 4, 2]])        #the predicted label should be Viriginia

array([2])