<center><img src=https://quantdare.com/wp-content/uploads/2015/01/Random-forest-vs-simple-tree-1-800x420.jpg width=1000px alt="hilarious random forest visualisation"></center>
<center>don't tell me this isn't hilariously well-fitting to the topic, found it here: quantdare.com</center>

# <center>Random forests⚙️</center>

**What you can expect from this notebook:** Since I did a notebook on decision trees [here](https://www.kaggle.com/code/vincentbrunner/ml-from-scratch-decision-trees#1.-basic-intuition-behind-decision-trees), and decision tree's alone aren't much of a useful tool in most cases, I thought I just do something similar but with one of the most common ensembling algorithms which is ***based on decision trees***: random forests.

<div class="alert alert-block alert-info">👉If you're just interested in the complete, with comments documented implementation of a random forest regressor using just numpy and the copy module, feel free to click on show hidden code: </div>

In [None]:
#  used for the implementation of the algorithm
import numpy as np
import copy

class RandomForestRegressor:
    def __init__(self, n_estimators, max_depth=None, min_samples_split=20, max_features=0.5, min_impurity_decrease=0):
        self.n_estimators = n_estimators
        #  with the max_features parameter the proportion of the randomly considered features at every split is determined
        self.base_estimator = DecisionTreeRegressor(max_depth=max_depth, 
                                                    min_samples_split=min_samples_split, max_features=max_features, 
                                                    min_impurity_decrease=min_impurity_decrease)
        self.estimators = None
        
    def fit(self, X, y):
        self.estimators = []
        #  repeat n times:
        for estimator_i in range(self.n_estimators):
            #  bootstrap the dataset
            idx = np.random.randint(low=0, high=len(X), size=len(X))
            X_bs = X[idx, :]
            y_bs = y[idx]
            
            #  fit estimator on bootstraped sample
            new_estimator = copy.copy(self.base_estimator)
            new_estimator.fit(X_bs, y_bs)
            
            #  save estimator
            self.estimators.append(new_estimator)
    
    def predict(self, X):
        predictions = np.stack([estimator.predict(X) for estimator in self.estimators], axis=1)
        return predictions.mean(axis=1)

****

# intuition
The main idea behind the random forest algorithm is to **combine multiple decision trees** to make a **better estimator** than the individual decision trees, reducing the variance (but slightly increasing the bias).

<center><img src=https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif width=1000px></center>
<center> image source: blog.tensorflow.org </center>
<br>

Since combining identical decision trees wouldn't make much sense, **randomness** is introduced to the construction of each tree. This is done in the following way:
* training on a random selection of data points (around 2/3 of the original training set) -> this is done by bootstrapping/sampling from the training data with replacement
* considering a random subset of features at each split while fitting the decision tree estimators

To make a prediction, each tree makes a prediction which then gets **aggregated** into one, final prediction.

This leads to a reduction in variance of the model compared to individual decision trees but also to a slight increase of the bias due to fitting the trees on less training data (**a more detailed explanation of the variance-reducing aspect of bagging can be found in the next section**)



***this is the obvious strength of random forests compared to individual decision trees:*** decision trees are high variance models being very likely to overfit. By reducing the variance and still keeping the bias as low as possible the effectiveness of the model on testing data increases

****

# Bagging: the ensembling method random forests are based on
**Main goal: reducing variance**<br>
**Combining estimators by aggregating results**

Bagging: ***bootstrap aggregation***<br>

Steps:
* Bootstrap sample (sample from training set with replacement) -> results in 2/3 of the data
* Train estimator on bootstrapped sample
* Aggregate predictions of estimators

**Variance reducing aspect of Bagging:**
* Assuming that all estimators have the same underlying probability distribution of predictions
* Taking the prediction from each estimator and aggregating them is like aggregating a sample with sample size n where n is the number of estimators
* This way the variance of the sample mean(the prediction of the bagging algorithm) can be written as $\large SE^2=(\frac{\sigma}{\sqrt{n}})^2=\frac{\sigma^2}{n}$ according to the central limit theorem
* **As n/the number of estimators increases, the variance of the sample mean therefore shrinks and the sample mean becomes closer to the true population mean.**

**Variance reducing aspect of Bagging taking the correlation of the estimators into account**(***not absolutely necessary***):<br>

*Since the estimators are trained on the bootstraped samples which originate from the same trainingdata, there exists some amount of correlation between them.*
* Therefore the formula above doesn't exactly describe the variance cause it assumes independance
* The standart error can be ***roughly*** adjusted by multiplying by $\large \sqrt{\frac{1+p}{1-p}}$ (not perfect but models it quite well)
* so $\large SE_{corrected}^2 = (\frac{\sigma\sqrt{1+p}}{\sqrt{n(1-p)}})^2 = \frac{(1+p)\sigma^2}{(1-p)n} = \frac{p\sigma^2+\sigma^2}{(1-p)n}$ 
* $\large \lim \limits_{p\to1}\frac{p\sigma^2+\sigma^2}{(1-p)n} = \infty$
* $\large \lim \limits_{p\to0}\frac{p\sigma^2+\sigma^2}{(1-p)n} = \frac{\sigma^2}{n}$
* This is quite intuitive: As the correlation decreases the variance becomes just the usual CLT formula for sampling variance. As the correlation becomes closer to 1, the variance becomes infinitely large, obviously that isn't actualy that way, cause the variance of the sample mean distribution can't get larger than the population variance. So actualy it **tends toward $\sigma^2$**

**This shows that in order to decrease variance with ensembling, the estimators have to be decorrelated as far as possible besides just ensembling a large amount of them**<br>
-> that's why bagging bootstraps the training data


****

# the algorithm step for step

As already mentioned in the last section, the random forest algorithm is a **bagging method**.<br>
**To decorrelate the estimators further additional randomness is introduced in the training part of the individual decision trees.**

**quick bagging recap:**
* ***bagging = bootstrap aggregation***
* **bootstrapping**: drawing samples from the training set like it would be the population -> generating "new training sets"
    * then training multiple estimators on different bootstrap samples
* **aggregation**: combining individual data points into one
    * gets used to combine the individual predictions of the estimators
    
The random forest uses that concept and adds an additional element of randomness to each individual tree by just considering a random group of n features at each split done in the decision tree (if you're not familiar with the concept of splitting in decision trees I'd recommend reading through the corresponding section in my notebook ["ml from scratch: 🌳decision trees🌳"](https://www.kaggle.com/code/vincentbrunner/ml-from-scratch-decision-trees#1.-basic-intuition-behind-decision-trees)).

So the algorithm step for step looks something like the following (for regression):

**initialise** $\large M\>\epsilon\>\mathbb{N}$ <br>
**initialise** $\large n\>\epsilon\>(1, N)$<br>

* **for** $\large \>m...M$**:**<br>
    1. **sample X from training set S with replacement (X ~ S) -> bootstrap sample**
    2. **fit decision tree $\large h_m(x)$ on bootstraped sample X considering n random features at each split** 

$\large F(x) = \frac{\sum_{m=1}^Mh_m(x)}{M}$<br>

where M is the number of estimators to be ensembled and n is the number of random features considered at each split.<br>

For classification, the predictions are aggregated by majority vote instead of the mean.

****

# python implementation
#### libraries used:

In [None]:
#  used for the implementation of the algorithm
import numpy as np # linear algebra 
import copy # deep copies of objects -> estimators

#  estimator to ensemble:
from sklearn.tree import DecisionTreeRegressor

#  used for data handeling
import pandas as pd # loading and transforming data
from sklearn.model_selection import train_test_split # splitting data in train/test set

#### preparing the data:
no large-scale feature engeneering but just preparing the data to work with the algorithm:
* encode categories into numerical values:
    * I went with label encoding due to it generating fewer columns than one-hot encoding and therefore making the fitting process faster. Also, I often found it to perform slightly better when dealing with tree-based methods
* imputing missing values cause the algorithm can't handle nan's
* **no feature scaling:** trees aren't affected by features on different scales, due to the nature of greedy splitting
* **no feature selection:** trees perform their own feature selection when splitting so this is only required when you have large data and want to save time

In [None]:
house_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

# encode categorical features, fill nan values
for feature in house_data.columns:
    if house_data[feature].dtype == "object":
        #  categorical encoding: turns [a, b, b, c] into [1, 2, 2, 3]
        house_data[feature] = house_data[feature].astype("category").cat.codes 
        if house_data[feature].isna().sum() != 0:
            #  impute missing values with the mode of the corresponding variable
            house_data[feature].fillna(house_data[feature].mode(), inplace=True)
    else:
        if house_data[feature].isna().sum() != 0:
            #  impute missing values with the mean of the corresponding variable
            house_data[feature].fillna(house_data[feature].mean(), inplace=True)

In [None]:
#  spliting in train and test data
features = house_data.loc[:, house_data.columns!="SalePrice"].to_numpy()
labels = house_data.loc[:, "SalePrice"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3)
house_data.head()

#### implementing the algorithm:

In [None]:
class RandomForestRegressor:
    def __init__(self, n_estimators, max_depth=None, min_samples_split=20, max_features=0.5, min_impurity_decrease=0):
        self.n_estimators = n_estimators
        #  with the max_features parameter the proportion of the randomly considered features at every split is determined
        self.base_estimator = DecisionTreeRegressor(max_depth=max_depth, 
                                                    min_samples_split=min_samples_split, max_features=max_features, 
                                                    min_impurity_decrease=min_impurity_decrease)
        self.estimators = None
        
    def fit(self, X, y):
        #  initialise empty list to save estimators after fitting
        self.estimators = []
        #  repeat n times(where n is the amount of estimators):
        for estimator_i in range(self.n_estimators):
            #  bootstrap the dataset
            idx = np.random.randint(low=0, high=len(X), size=len(X)) # random indexes -> sampled with replacement
            X_bs = X[idx, :]
            y_bs = y[idx]
            
            #  fit estimator on bootstraped sample
            new_estimator = copy.copy(self.base_estimator)
            new_estimator.fit(X_bs, y_bs)
            
            #  save estimator
            self.estimators.append(new_estimator)
    
    def predict(self, X):
        """
        * every estimator makes his predictions in the shape (len(X)) -> [a, b, ..., len(X)]
        * stack prediction of estimators to have them row wise(each row corresponds to a sample) -> [[a1, a2], [b1, b2], ..., len(X)]
        * averaging the rows to have one final prediction per sample -> [mean(a1, a2), mean(b1, b2), ..., len(X)]
        """
        predictions = np.stack([estimator.predict(X) for estimator in self.estimators], axis=1) 
        return predictions.mean(axis=1)

#### fit and evaluate the model:

In [None]:
#  fit model
regressor = RandomForestRegressor(200)
regressor.fit(X_train, y_train)

#  make predictions:
predictions = regressor.predict(X_test)

#  calculate rmse:
rmse = np.sqrt(np.square(y_test - predictions)).mean()

#  calculate R2:
r2 = 1 - np.square(y_test - predictions).sum()/np.square(y_test - y_test.mean()).sum()

print(f"testing data: root mean squared error = {rmse}\nR-squared = {r2}")

#  make predictions:
predictions_train = regressor.predict(X_train)

#  calculate rmse:
rmse_train = np.sqrt(np.square(y_train - predictions_train)).mean()

#  calculate R2:
r2_train = 1 - np.square(y_train - predictions_train).sum()/np.square(y_train - y_train.mean()).sum()

print(f"training data: root mean squared error = {rmse_train}\nR-squared = {r2_train}")

**That's all for this notebook, have a great day and happy learning!👋**