![logo](https://github.com/donatellacea/DL_tutorials/blob/main/notebooks/figures/1128-191-max.png?raw=true)

# Modeling with Random Forests

In this Notebook we will show you to train a Random Forest Classifier. You will learn how to properly pre-process your dataset and how to tune your Random Forest model to achieve the best performance.

--------

### Setup Colab environment

If you installed the packages and requirments on your own machine, you can skip this section and start from the import section.
Otherwise you can follow and execute the tutorial on your browser. In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HelmholtzAI-Consultants-Munich/Zero2Hero---Introduction-to-XAI/blob/master/xai-FGC/tutorial_FGC_supplement.ipynb)



Now that you are visualizing the notebook in Colab, run the next cell to install the packages we will use.
There are few things you should follow in order to properly set the notebook up:

1. Warning: This notebook was not authored by Google. *Click* on 'Run anyway'.
2. When the installation commands are done, there might be "Restart runtime" button at the end of the output. Please, *click* it. 

In [None]:
# %pip install palmerpenguins
# %pip install pandas_profiling==3.1.0

By running the next cell you are going to create a folder in your Google Drive. All the files for this tutorial will be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions:
1. Permit this notebook to access your Google Drive files? *Click* on 'Yes', and select your account.
2. Google Drive for desktop wants to access your Google Account. *Click* on 'Allow'.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive. 

In [None]:
# Create a folder in your Google Drive
# from google.colab import drive                                                                          
# drive.mount('/content/drive')

In [None]:
# %cd drive/MyDrive

In [None]:
# Don't run this cell if you already cloned the repo in the first part of the tutorial
# !git clone https://github.com/HelmholtzAI-Consultants-Munich/Zero2Hero---Introduction-to-XAI.git 

In [None]:
# %cd Zero2Hero---Introduction-to-XAI/xai-FGC

### Import

In [1]:
# Load the required packages

import joblib
import numpy as np 
import pandas as pd
from pandas_profiling import ProfileReport

from palmerpenguins import load_penguins

from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Theory: Introduction to Random Forest

### Decision Trees

Decision Trees are a non-parametric supervised learning method used for classification and regression (here, we focus on classification tress). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A decision tree can be represented by a tree graph with one root node that contains the whole input data set, many internal nodes that represent the splitting points on the predictor variables X and terminal nodes that are not further split and contain a subgroups of Y belonging to a certain class as shown in the figure below.

<center><img src="figures/decision_tree.png" width="300" /></center>


<font size=1 color='grey'>

**Example of a classification tree:** Classification of a response variable into two classes (red and yellow) in the two-dimensional input feature space with a classification tree. The classification tree makes two splits on feature A and feature B and has three terminal nodes, representing the subgroups of the response variable. 

The first step in constructing a decision tree is to split the root node based on the predictor variable X into two daughter nodes to improve the homogeneity of the response variable Y in each daughter node compared to the root node. Maximizing the homogeneity of the response variable Y is equivalent to minimizing the node impurity in both daughter nodes. The node impurity for a classification problem with K classes can be measured by different impurity indices. The most popular impurity index is the **gini index** with values between 0 and 1. Small Gini index values indicate a more pure node than higher gini index values. This splitting step is repeated with the two daughter nodes, which now become internal nodes, to successively improve the homogeneity of the response variable in each daughter node until we reach a predefined stopping criterion. A natural stopping criterion is the node purity, where the tree is grown until
the terminal nodes are homogeneous, hence all members in a terminal node belong to the same class. This however, often leads to model overfitting, because the model starts being too complex and tends to learn the noise in the data as well. An appropriate stopping criterion serves the purpose of finding a balance between too complex models, which overfits the data, and too simple models, which underfits the data, both leading to a high generalization error. 

The constructed decision tree can then be used to predict the class of a new unseen observation: start at the root node, drop the new observation down the left or right daughter node, depending on its value of the predictor variable that was used at that split, repeat until a terminal node is reached. For each new observation that falls into a certain terminal node we will make the same prediction, which is the majority class of the response values y in that terminal node.

<center><img src="./figures/decision_tree_predictions.png" width="300" /></center>

Decision trees are very sensitive to changes in the input data and are prone to overfitting when constructing trees that are too complex. To avoid such problems one can build a model based on an ensemble of decision trees, trained on bootstrapped input data.


### Ensemble Learning

The predictive performance of weak ML models like decision trees can be improved by a technique called ensemble learning, which combines a group of weak predictor models, to form a strong ensemble learner. The idea behind ensemble learning is to improve the predictive performance by
reducing the variance of the predictor. Common methods for ensemble learning are Bagging (Bootstrap Aggregation), Boosting or Randomization. Here, the focus is on Bagging, a method introduced in 1996 by Breiman et al. (Breiman, 1996) that can be used to aggregate multiple decision trees to form a strong ensemble and is used in the machine learning algorithm Random Forest. 

In Bagging, trees are fully grown, hence have a low bias, but predictions are averaged over multiple trees which reduces variance. To build a bagged model, B bootstrap samples are drawn from the training set and on each bootstrap sample a decision tree is trained. To obtain the predicted class for a new observation, the majority class across all trained decision trees in the bagged model is calculated. The validation error of the bagged model can be calculated during the training phase through the Out of Bag (OOB) error. The OOB error is an unbiased estimate of the validation error because it is calculated on the OOB data, which is the data that was not used to train the bagged model. To calculate the OOB error, the majority vote of the predictions for a training observation over all decision trees, in which observation was part of the OOB data, is calculated. The fraction of OOB observations that were classified incorrectly is then the OOB error. It was shown that OOB error estimates are nearly identical to k-fold cross-validation estimates.

The benefits of Bagging, i.e. variance reduction, are limited by the amount of correlation between predictor models. If decision trees are build on the same set of feature it is common that their structure highly correlates. To decrease the overall amount of correlation in the ensemble, the predictor models have to be decorrelated. The solution to this problem is implemented in an algorithm called Random Forest. 

### Random Forest Algorithm

The Random Forest (RF) algorithm was introduced in 2001 by Breiman et al. (Breiman, 2001) and extends the Bagging algorithm by building an ensemble of decorrelated decision trees. Decision trees become correlated if only few features are strong predictors of the response variable, leading to the majority of decision trees having a similar structure (the strong predictor is used as first split in many trees) and therefore highly correlated predictions. To reduce the correlation between decision trees, RF performs random feature selection at each node prior to the selection of the optimal split. Hence, the reduction in node impurity is only computed on a random subset of predictor variables, which reduces the chance that strong predictors are always used as first splits.

In short, RF creates an ensemble of decision trees by fitting each decision trees to a different bootstrap sample, while selecting at each split a random subset of input features as candidates for splitting. The class of a new unseen observation x is then predicted asthe majority class across predictions for x made with all trees in the ensemble. By averaging the predictions over a large ensemble of high variance but low correlation and low bias decision trees, RF is able to improve the variance reduction of Bagging and efficiently reduce both components - bias and variance - of the generalization error. 

<center><img src="./figures/random_forest.png" width="800" /></center>

A RF model has several hyperparameters that have to be tuned during the training process. Two of them can have a major influence on the performance of the RF model: the number of decision trees in the model and the number of predictor variables that get randomly chosen at each split. As mentioned before, the generalization error of a RF model converges to an upper bound if the number of trees in the forest is large enough. Hence, the number of decision trees should be chosen as large as possible, limited by the available compute time, to improve the predictive power and avoid overfitting of the model. The number of randomly chosen predictor variables controls the amount of correlation between decision trees in the RF model. If we choose a value equal to the number of input features, the RF model reduces to Bagging on unpruned decision trees. As mentioned above, the generalization error of a RF model depends on the strength of each individual decision tree (bias) and the correlation between those decision trees (variance). By reducing the number of randomly selected features, we reduce the variance of the model but at the same time we increase the bias of each individual tree because we might not find the optimal predictor variable for each split. Hence, the number of randomly selected features is a tradeoff between bias and variance in the model and we can use the OOB error to find the best tradeoff for our model.

**It is time to load and explore the data that will be modeled with our RandomForestClassifier!**

## The Palmer Pinguins Dataset: Data loading and exploration

In this course, we will work with the **Palmer penguins dataset**, containing the information on 3 different species of penguins - Adelie, Chinstrap, and Gentoo - which were observed in the Palmer Archipelago near Palmer Station, Antarctica. The dataset consist of a total of 344 penguings, together with their size measurements, clutch observations, and blood isotope ratios. Our goal is to predict the species of Palmer penguins and find out the major differences among them.

<center><img src="./figures/penguins.png" width="500" /></center>

<font size=1> Source:\
https://pypi.org/project/palmerpenguins/#description \
https://allisonhorst.github.io/palmerpenguins/

In [2]:
# Load the data
penguins = load_penguins()

# Inspect the data
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


Before modeling it is important to add an exploratory data analysis step to get an understanding of the data. This step helps identifying patterns and problems in the data, as well as deciding which model or algorithm to use in the subsequent steps.

First, we will check how many samples and features the dataset has, if all values are filled, and how our target variable (Species) is distributed. To do so, we will use a package *pandas_profiling* and create a report within this notebook. Please, run the following cell to create the report:

In [None]:
profile = ProfileReport(penguins, title="Penguins dataset report", config_file='config_minimal.yml')
profile

By examining the report, try to answer the following questions:

<font color='green'>

#### Question 1: Are there missing values in the dataset? If yes, how do we deal with this?

<font color='grey'>

#### Your Answer: 

Yes, we have missing values in the dataset. There are different options how one can deal with this problem and the choosen strategy depends heavily on the dataset and the context we are in. We can, for example, just omit the cases with missing values (if we do not loose to many samples) or use a missing value imputation method.

<font color='green'>

#### Question 2: Do we need to preprocess some features such that Random Forest can work with them?

<font color='grey'>

#### Your Answer: 

Yes, we need to encode categorical features because RF can't work with string values. These are the features 'island', 'sex', and the target variable 'species'.


## Data preprocessing

### Missing values handling

Based on what we saw in the explorative analysis above, we need to do some preprocessing steps before we start training the model. First, we need to take care of the missing values. There are different options how one can deal with this problem and the strategy one chooses depends heavily on the dataset and the context we are in.

In this example, we will apply the most common approach and simply omit those cases with the missing data and analyse the remaining data. However, be careful with this technique - check how many instances we are left with afer the step because the much lower sample number could hinder the training process and we would need to think about other ways to solve the problem, like applying missing data imputation strategies. 

In [4]:
# Remove the instances with missing values and check how many we are left with:
print(f"Before omiting the missing values the dataset has {penguins.shape[0]} instances")
penguins.dropna(inplace=True)
print(f"After omiting the missing values the dataset has {penguins.shape[0]} instances")

Before omiting the missing values the dataset has 344 instances
After omiting the missing values the dataset has 333 instances


The new sample size is fully acceptable for the next step. Is our dataset ready to be used for training the model? Well...almost! What did we observe regarding feature transformation? Do we need to encode some of them? Yes, we do! 

### Encoding of categorical variables

Categorical features need to be encoded, i.e. turned into numerical data. This is essential because most machine learning models can only interpret numerical data and not data in a text form. As with many data preprocessing steps, there are multiple strategies one can apply to encode the categorical features. 

Here, we will use a simple **Label encoding** for the categorical features and for the target variable, which will transform the categorical feature values into unique integer values. 


In [5]:
data_penguins = pd.DataFrame(penguins.copy())

# Transform the target variable (Species) and the two categorical features (Sex, Island) with LabelEncoder

le1 = preprocessing.LabelEncoder()
data_penguins.species = le1.fit_transform(data_penguins.species)

# We can check how they are transformed:
print(pd.crosstab(penguins.species, data_penguins.species))
print('-----------------------')


le2 = preprocessing.LabelEncoder()
data_penguins.sex = le2.fit_transform(data_penguins.sex)

# We can check how they are transformed:
print(pd.crosstab(penguins.sex, data_penguins.sex))
print('-----------------------')


le3 = preprocessing.LabelEncoder()
data_penguins.island = le3.fit_transform(data_penguins.island)

# We can check how they are transformed:
print(pd.crosstab(penguins.island, data_penguins.island))
print('-----------------------')

species      0   1    2
species                
Adelie     146   0    0
Chinstrap    0  68    0
Gentoo       0   0  119
-----------------------
sex       0    1
sex             
female  165    0
male      0  168
-----------------------
island       0    1   2
island                 
Biscoe     163    0   0
Dream        0  123   0
Torgersen    0    0  47
-----------------------


Now our dataset has no missing data anymore and features are transformed in a way suitable for training. Let us save the preprocessed data since we will need it afterwards. Then, we are ready to start to train the Random Forest model! 

In [6]:
data_penguins.to_csv('./data/data_penguins_processed.csv')

## Training the Random Forest Classifier

### Hyperparameters of Random Forest

The hyperparameters of the model are configured up-front and are provided by the caller of the model before the model is trained. They guide the learning process for a specific dataset and hence, they are very important for training a machine learning model. 

Some important hyperparameters for Random Forest models:

- n_estimators = number of trees in the model
- criterion = a function to measure the quality of the split
- max_depth = maximal depth of the tree (the longest path between the root node and the leaf node)
- max_sample = which fraction of the original dataset is given to each tree in the forest
- max_features = maximum number of features to consider when doing a split

The full list of hyperparemeters of the Random Forest models can be found in the scikit-learn documentation.

In order to choose the optimal hyperparameters of the model, we will objectively search through different values for Random Forest hyperparameters and choose the set of hyperparameters that results in the model with the best performance on a given dataset. To do this, we will define a search space as a grid of hyperparameter values and evaluate every position in the grid. This hyperparemter optimization technique is called **grid-search**. 

<font color='green'>

#### Question 3: What influence can the hyperparameters *n_estimators* and *max_depth* have on the Random Forest model?

<font color='grey'>

#### Your Answer: 

- *max_depth* choosing a high tree depth can lead to overfitting of the Random Forest model
- *n_estimators* choosing a large number of estimators (decision trees) can improve the predictive power and avoid overfitting of the model

<font color='green'>

#### Question 4: Why can you use the OOB score (oob_score = True) to estimate the generalization score?

<font color='grey'>

#### Your Answer: 

Because the OOB score is calculated from the out-of-bag samples. Those samples are left out in training of the respective deicion tree and can be used as a validation set.

Now that we learned about the hyperparameters of Random Forest and had a look at the choices we have for the Random Forest algorithm, it is time to define the grid of hyperparameters we want to evaluate. The grid-search technique that we will use in this example, searches through every combination of the hyperparameters you define. Hence, the run time can increase very fast and it should be something to take into account when training the model. For the sake of example, we will define a rather small grid of hyperparameters and store it as a dictionary object. 

In [7]:
hyper_grid_classifier = {'n_estimators': [100, 1000], 
            'max_depth': [2, 5, 10], 
            'max_samples': [0.8],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2']
}

Feel free to change the grid based on your acquired knowledge and research on Random Forest hyperparameters! Just take care about the computation time for now. 

Now we will start the training process. First, we define an instance of the RandomForestClassifier. Then, we run the GridSearchCV with the 5-fold cross validation and by using the grid we defined. The model with the best hyperparameters is saved as the _best_estimator__ in the GridSearchCV instance. 

If you run the cell, it is going to take around 2 minutes. Otherwise, feel free to load the pre-trained model that we prepared for you by uncommenting and running the following cell:

In [8]:
# rf = joblib.load(open('./models/random_forest_penguins.joblib', 'rb'))

In [9]:
# A Random Forest instance from sklearn requires a separate input of feature matrix and target values. 
# Hence, we will first separate the target and feature columns. 
X_penguins = data_penguins.loc[:, data_penguins.columns != 'species']
y_penguins = data_penguins.species

# Define a classifier. We set the oob_score = True, as OOB is a good approximation of the test set score
classifier = RandomForestClassifier(oob_score=True, random_state=42, n_jobs=1)

# Define a grid search with 5-fold CV and fit 
gridsearch_classifier = GridSearchCV(classifier, hyper_grid_classifier, cv=5, verbose=1)
gridsearch_classifier.fit(X_penguins, y_penguins)

# Take the best estimator
rf = gridsearch_classifier.best_estimator_

Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [10]:
# Check the results
print('Parameters of best prediction model:')
print(gridsearch_classifier.best_params_)
print('OOB accuracy of prediction model:')
print(rf.oob_score_)

Parameters of best prediction model:
{'criterion': 'entropy', 'max_depth': 5, 'max_features': 'sqrt', 'max_samples': 0.8, 'n_estimators': 100}
OOB accuracy of prediction model:
0.984984984984985


If you were running the cell, please save your trained model for the future usage:

In [11]:
# Save the model with joblib
filename_model = './models/random_forest_penguins.joblib'
joblib.dump(rf, open(filename_model, 'wb'))

Great, now you trained your Random Forest model! And it scored with the high OOB accuracy of 98%! 