# Random Forest
## Overview
- [1. Bagging](#1)
- [2. What is Random Forest?](#2)
- [3. Important Features of Random Forest](#3)
- [4. Important Hyperparameters](#4)
- [5. Pseudo-code](#5)
- [6. Implementation](#6)
- [7. Advantages and Disadvantages](#7)
- [8. References](#8)

<a name='1' ></a>
## 1. Bagging  
**Ensemble learning** is a general meta approach to machine learning that seeks better predictive performance by combining the predictions from multiple models. 

There are certain rules that we need to follow while creating an ensemble model,

- Diversity: All the models that we have created should be diverse and independent of each other. Each model that we would have created can have different features but all of them should be independent.
- Acceptability: All the models should be acceptable and should perform good to some extent. We can assure this by evaluating against a random model and check if our model performs better than it.

Although there are a seemingly unlimited number of ensembles that you can develop for your predictive modeling problem, there are three methods that dominate the field of ensemble learning including: Bagging , Boosting, and Stacking. In this lecture, we only focus on **Bagging**. 

### Bagging 
**Bagging**, also known as **Bootstrap Aggregation** is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as **row sampling**. This step of row sampling with replacement is called **bootstrap**. Now each model is trained **independently** which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as **aggregation**.

<div style="width:image width px; font-size:80%; text-align:center;"><img src='images/Bagging.png' alt="alternate text" width="width" height="height" style="width:600px;height:350px;" />  </div>


*Note: When sampling is performed without replacement, it is called pasting. In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.*

<a name='2' ></a>
## 2. What is Random Forest?
*Random forest is a supervised learning algorithm. The "forest" it builds is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.*

Two key concepts that give it the name random:
- A random sampling of training data set when building trees
- Random subsets of features considered when splitting nodes

<a name='3' ></a>
## 3. Important Features of Random Forest
- **Diversity**: Not all attributes/variables/features are considered while making an individual tree, each tree is different.
- **Immune to the curse of dimensionality**: Since each tree does not consider all the features, the feature space is reduced.
- **Parallelization**: Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.
- **Train-Test split**: In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
- **Stability**: Stability arises because the result is based on majority voting/averaging.

<a name='4' ></a>
## 4. Important Hyperparameters
Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.

*Following hyperparameters increases the predictive power:*
- **n_estimators**: number of trees the algorithm builds before averaging the predictions.
- **criterion**: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain.
- **max_features**: maximum number of features random forest considers splitting a node.
- **min_samples_leaf**: determines the minimum number of leaves required to split an internal node.
- **min_impurity_decrease**: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

*Following hyperparameters increases the speed:*
- **n_jobs**: it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor but if the value is -1 there is no limit.
- **random_state**: controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and if it has been given the same hyperparameters and the same training data.
- **oob_score**: OOB means out of the bag. It is a random forest cross-validation method. In this one-third of the sample is not used to train the data instead used to evaluate its performance. These samples are called out of bag samples.

<a name='5' ></a>
## 5. Pseudo-code
Given a training set $X = x_1, ..., x_n$ with responses $Y = y_1, ..., y_n$, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
1. Sample, with replacement, $n$ training examples from $X, Y$; call these $X_b, Y_b$.
2. Train a classification or regression tree $f_b$ on $X_b, Y_b$.

After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual regression trees on $x'$:
- For regression problem: $\hat{f} = \frac{1}{B} \sum_{b=1}^{B} f_b(x')$
- For classification problem: $\hat{f} = mode(\hat{Y})$, where $\hat{Y} = \{ f_b(x') \}_{b=1}^{B}$, f_b(x') is categorical (discrete) variable.

<a name='6' ></a>
## 6. Implementation
The below is Random Forest's code for classification problem.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from scipy import stats

In [2]:
# IMPLEMENTATION
class RandomForest():
    """Implement Random Forest classifier from scratch using Decision Tree."""
    
    def __init__(
        self,
        n_estimators=100,
        criterion='gini', 
        max_depth=None,
        min_samples_leaf=1,
        max_features='sqrt', 
        min_impurity_decrease=0.0,
        random_state=0
    ):
        """
        Some important parameters in Random Forest.
        
        Args:
            n_estimators (int): The number of trees in the forest, default 100.
            criterion (str): The function to measure the quality of a split, default gini.
            max_depth (int): The maximum depth of the tree, default None.
            min_samples_leaf (int): The minimum number of samples required to be at a leaf node.
            max_features (int): The number of features to consider when looking for the best split,default sqrt.
            min_impurity_decrease (int): A node will be split if this split induces a decrease of 
            the impurity greater than or equal to this value.
            random_state (int): Controls randomness of the sample, default 0.

        """
        self.n_estimators = n_estimators
        self.criterion =  criterion
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state
        
    def fit(self, X, y):
        """Training model."""
        # Specify the number of features in each tree
        self.n_samples, self.n_features = X.shape
        if self.max_features == 'sqrt':
            self.max_feature = int(np.sqrt(self.n_features))
            
        # Loop through all trees in the forest
        self.tree_lst = []
        for i in range(self.n_estimators):
            X_train, _, y_train, _ = train_test_split(
                X, 
                y, 
                test_size=0.3, 
                random_state=self.random_state + i
            )
            tree = DecisionTreeClassifier(
                criterion = self.criterion,
                max_depth = self.max_depth,
                min_samples_leaf = self.min_samples_leaf,
                max_features = self.max_features,
                random_state = self.random_state
            )
            tree.fit(X_train, y_train)
            self.tree_lst.append(tree)
    
    def predict(self, X_test):
        """Predict labels for X_test."""
        predict_arr = []
        for tree in self.tree_lst:
            predict = tree.predict(X_test)
            predict_arr.append(predict)
            
        predicted_labels = np.squeeze(stats.mode(predict_arr, axis=0)[0])
        return predicted_labels

In [3]:
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Make dataset
X, y = datasets.make_blobs(n_samples=300, n_features=10, centers=3, cluster_std=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Initialize and fit model
clf = RandomForest(
    n_estimators=100,
    criterion='gini', 
    max_depth=10,
    min_samples_leaf=1,
    max_features='sqrt', 
    min_impurity_decrease=0.0,
    random_state=0
)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print(f'Accurary on train set: {accuracy_score(y_train, clf.predict(X_train))}')
print(f'Accurary on test set: {accuracy_score(y_test, y_pred)}')

Accurary on train set: 1.0
Accurary on test set: 0.9111111111111111


<a name='7' ></a>
## 7. Advantages and Disadvantages
**Advantages of Random Forest algorithm:**
- Can handle large amounts of data and a large number of features.
- Can be used for both classification and regression tasks.
- The algorithm is easy to implement and interpret.
- Random Forest algorithm is less prone to overfitting than other decision tree algorithms. It creates as many trees on the subset of the data and combines the output of all the trees. In this way it reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy. All the trees in a random forest are diverse and independent of each other. 
- It is very stable because the majority vote is taken combining the results of all those trees. It is not prone to over fitting
- It is immune to the curse of dimensionality- since we take only a subset of rows & columns the feature space is reduced considerably
- It is parallelizable since all the trees are independent of each other, we can run each model separately and so it is parallelizable

**Disadvantages of Random Forest algorithm:**
- The algorithm can be slow for real-time predictions because it has multiple decision trees.
- The algorithm may not work well with highly skewed data.
- The algorithm requires more computational resources than other decision tree algorithms.
- It can be less accurate when the data set is small.

**More:** Run time depends upon 3 things- the number of trees (t), the depth of each tree (d) and the number of rows in each tree (m). Run time for training is $O(t*mlogd)$

<a name='8' ></a>
## 8. References
- [https://en.wikipedia.org/wiki/Random_forest](https://en.wikipedia.org/wiki/Random_forest)
- [https://builtin.com/data-science/random-forest-algorithm](https://builtin.com/data-science/random-forest-algorithm)
- [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/](https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/)
- [https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/](https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/)
- [https://madhuramiah.medium.com/introduction-to-ensembling-techniques-bagging-1458cfdb150c](https://madhuramiah.medium.com/introduction-to-ensembling-techniques-bagging-1458cfdb150c)
- [https://www.kdnuggets.com/2020/01/random-forest-powerful-ensemble-learning-algorithm.html](https://www.kdnuggets.com/2020/01/random-forest-powerful-ensemble-learning-algorithm.html)