# An Introduction to the Random Forest Algorithm 

Author: Fatemeh (Fatima) Bagheri

Date Created: November 1, 2022, Last Modified: December 19, 2022

Notebook 1/2 of the DSECOP Module: An Introduction to the Random Forest algorithm.

## First, what is it?

Let's start with a question, how do you know if an exoplanet is a rocky planet or a Jovian (Jupiter-like) planet? In other words, how can we classify the exoplanets based on their types? There are a lot of methods to classify any data; one of the most robust methods is the **Random Forest** algorithm. Based on Wikipedia [https://en.wikipedia.org/wiki/Random_forest], here is the definition of the random forest method:

"*Random forests or random decision forests is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.*"

So, this definition has many jargon words that make it challenging to understand, especially when you are a beginner. In this lecture, we will explain all the terms in this definition and try to implement them to classify exoplanets using Python. 

Let's look again at the Wikipedia definition; the first interesting phrase is *random decision forests*. Forest means several trees (kinda!), so it basically means several random *decision trees*. Thus, for the first step, our task is to find out the meaning of a **decision tree**.



## Decision Tree

Suppose we have a sample of flags, like below, and want to know which flag is for which country by drawing a decision tree. 
![RF0-2.png](attachment:RF0-2.png)

To find out the country of the flag, we can ask a few questions: 

* are the stripes in the flag vertical or horizontal?
* how many different colors it has?
* does it have a symbol?

Based on these questions, we can make our decision tree:
![RandomForest1.png](attachment:RandomForest1.png)

The figure shows an example of a decision tree. The first tree level is level 0, where there is only one node including all the data; we call it the *root node*, followed by level 1, where there are two nodes, and level 2, where there are four nodes, and so on, they are called *interior nodes*. And the last level has 6 nodes which we call leaf nodes. The tree presented in the figure has 3 depth.

In this example, we have split the flags into groups with similar *features*, for instance, a group of flags with horizontal colors. Therefore, a decision tree is basically a flowchart where each node splits a group of data according to some feature variable. The goal of a decision tree is to split your data into groups such that every element in one group belongs to the same category. A decision tree is a type of non-parametric supervised learning algorithm that is used in both classification and regression problems. 

## Ensemble Learning Method 

Now let's go back again to the Wikipedia definition mentioned above. The second interesting phrase in the definition is *ensemble learning method*. So, the random forest method is an ensemble or group of decision trees. But why do we need more than one tree? The answer is rooted in the structure of the decision trees. Consider the example of flags; we can also classify the flags with a different decision tree like:
![RF2.png](attachment:RF2.png)

The point is having more than one decision tree enhances the accuracy of the model; *two heads is better than one!*

## Regression or Classification

Back again to the Wikipedia definition, as mentioned earlier, one of the goals of the random forest algorithm is to classify the data into groups with similar features. But it could also be used for regression problems. A classification problem is when you have a discrete output, such as binary classification, which is yes or no (0 or 1). In contrast, a regression problem is when the output is a continuous value. In the random forest algorithm, you can use decision trees to classify data or find the continuous output quantity. 


## Vote!

Last but not least is the voting concept in the random forest method! *The output of the random forest is the class selected by most trees*. It means that if you have a classification problem, the result will be defined by the majority of decision tree answers, or in a regression problem, the output is the average value of the number given by each tree.

## Training and Test Sets

At this point, we mostly understand the random forest algorithm. But still, there are some concepts that we should know before implementing the random forest method for our problem of classification of exoplanets.

The random forest method is a **supervised** algorithm, which means that we should train the model and then use it to have a prediction. So, we start with a dataset as a **training set**. For this dataset, we *know* the output; for example, in our case, we know the type of exoplanets in the training set. Using this dataset, the algorithm will then learn the relationship between the features and the parameters and apply that relationship to further classify a completely new dataset that we call it **test set**. For more information on training and test sets look at the Introduction to Deep Learning module.

## Bootstraping

As we said earlier, the random forest is an ensemble of decision trees. Each decision tree should be trained by a training set. There are two ways to use a training set in the random forest method: 1) you can use one training set for all decision trees (which is not recommended!); or 2) you can **bootstrap sample** of the training dataset for each decision tree. A bootstrap sample is a sample of the training dataset where a data point may appear more than once in the sample, which is called sampling with replacement. Bootstrapping samples make each tree more different and have less correlated predictions or prediction errors.

