# 8. Random Forest

The Random forest or Random Decision Forest is a **supervised Machine Learning algorithm** used for both **classification** and **regression** tasks. It is an **extension of decision tree algorithms** that addresses their limitations by **combining multiple trees** to create a more robust and accurate predictive model. 

Random forest is known for its ability to handle complex relationships, reduce overfitting, and provide insight into the importance of features.

## Overview
1. How it works
    * 1.1. What is a decision tree?
    * 1.2. What are ensemble methods?
    * 1.3. What is Bagging and Boosting?
    * 1.4. Steps in Random Forest
    * 1.5. Advantages
    * 1.6 Hyperparameters
2. Assumptions
3. Common pitfalls
4. Implementation examples
    * 4.1 Classification
    * 4.2 Regression

## 1. How it works
Random forest builds an ensemble of decision trees, each trained on a different subset of the data and considering a subset of features. 

**Ensemble methods** are machine learning techniques that **combine the predictions of multiple individual models** to produce a more robust overall prediction. The idea behind ensemble methods is to use the diversity of different models to improve the overall accuracy, stability, and generalization of the predictive model.

<img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN2KPoea9rFZo4nb0SZKrBrEUjNv-xaqB7gF6Htl5lY5AtOmKH1yFalD9Y6XHNNgtUYqsJCPUr-7a4MJIvdcubXogxerrskVqKfQGhKSpUyrnroLhEi6P5vMXqYE22J3_dnLRuWiBv5Nw/s0/Random+Forest+03.gif" width="800px">

### 1.1. **What is a decision tree?**
Decision trees are a popular and powerful tool used in various fields such as machine learning, data mining, and statistics. They provide a clear and intuitive way to make decisions based on data by modeling the relationships between different variables.

### 1.2. **What are ensemble methods?**
Ensemble learning models work just like a **group of diverse experts teaming up to make decisions**. In ensemble learning, different models, often of the same type or different types, team up to enhance predictive performance.

Some popular ensemble models include- **XGBoost**, AdaBoost, LightGBM, **Random Forest**, Bagging, Voting etc.


### 1.3. **What is Bagging and Boosting?**
* **Bagging** is an ensemble learning model, where **multiple weak models are trained on different subsets** of the training data and prediction is made by **averaging the prediction** of the weak models for regression problems and considering the **majority vote** for classification problems.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20210707140912/Bagging.png" width="800px">

* **Boosting** is an ensemble learning model, where **multiple weak models are trained sequentially**.  In this method, **each model tries to correct the errors made by the previous models**. Each model is trained on a modified version of the dataset, the instances that were misclassified by the previous models are given more weight. The final prediction is made by **weighted voting**.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20210707140911/Boosting.png" width="800px">

### 1.4. **Steps in Random Forest**
1. Select random K data points from the training set.
2. Build the decision trees associated with the selected data points(Subsets).
3. Choose the number N for decision trees that you want to build.
4. Repeat Step 1 and 2.
5. For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

### 1.5. **Advantages**
Random forest offers several **advantages over individual decision trees**:
* **Reduced overfitting:** By averaging predictions from multiple trees, random forest mitigates the risk of overfitting and provides better generalization.

* **Robustness**: Random forest is less sensitive to noisy data and outliers compared to a single decision tree.

* **Non-linearity handling:** It can capture complex nonlinear relationships between features and the target variable.

* **Feature importance:** Random forest quantifies the importance of each feature, aiding in feature selection and interpretation.

Random forest **calculates feature importance** based on how much a particular feature contributes to the overall predictive performance of the ensemble. The importance of a feature is assessed by measuring the decrease in a specific metric when the values of that feature are randomly permuted while keeping the other features constant.

### 1.6. **Hyperparameters**
Random forests have several **hyperparameters** that allow you to customize and fine-tune the behavior of the ensemble algorithm:

* **n_estimators**: The number of decision trees in the ensemble (forest). Increasing the number of trees generally improves performance until reaching a point of diminishing returns or overfitting.

* **max_depth**: The maximum depth of each decision tree. Limits the number of splits. Deeper trees can capture more complex relationships but are more prone to overfitting.

* **min_samples_split**: The minimum number of samples required to split an node further. It prevents nodes with very few samples from being split, potentially reducing noise.

* and [other hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) like `min_samples_leaf`, `max_features`, `criterion`, etc.

## 2. Assumptions
Random forest is a powerful ensemble learning algorithm that combines multiple decision trees to make predictions. 

Unlike some other machine learning algorithms, random forest has fewer assumptions.

However, it’s important to note that while individual decision trees have certain assumptions, the ensemble method helps to mitigate the impact of these assumptions:
* **Independence of features**: Decision trees assume that features are independent of each other. However, this assumption is less critical in random forests because each tree is trained on a different subset of features.

* **Linear relationships**: Decision trees assume that the relationships between features and the target variable are linear. Random forest, being an ensemble of decision trees, can capture both linear and nonlinear relationships in the data due to the diversity of trees it comprises.

* **Homoscedasticity**: Decision trees do not make explicit assumptions about the homoscedasticity (constant variance) of errors. Similarly, random forest, being a combination of decision trees, is not directly affected by this assumption.

* **Normality of residuals**: Decision trees do not rely on the assumption of normality of residuals, and the random forest algorithm inherits this flexibility. However, if you’re using random forest as part of a broader analysis that assumes normality (for example, hypothesis testing), you should consider this aspect in your overall approach.

* **Feature scaling**: Random forest is relatively insensitive to the scale of features. It doesn’t require features to be standardized or normalized, unlike some other algorithms, such as gradient
boosting or K-means clustering.

* **Multicollinearity**: Random forest can handle multicollinearity (high correlation between features) because it selects a random subset of features at each split. This helps to reduce the impact of correlated features on the model’s performance.

## 3. Common pitfalls
While random forest is a powerful algorithm, it’s important to be aware of potential pitfalls:

* **Overfitting with too many trees:** Although random forest reduces overfitting, using an excessive number of trees can still lead to unnecessary computational complexity.

* **Bias toward dominant classes:** In imbalanced datasets, random forest might favor the majority class due to its inherent averaging mechanism.

* **Computation and memory:** Training a large random forest can be computationally expensive and memory-intensive.

* **Feature selection:** While random forest provides feature importance, it might not always identify the optimal subset of features for a specific problem.

## 4. Implementation examples
Let’s see how to implement random forest for classification and regression tasks using Python’s scikit-learn library.

In [1]:
# Import necessary libraries
import os

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score

from sklearn.datasets import fetch_california_housing

import warnings
warnings.filterwarnings('ignore')

# Main data directory
mainpath="../data/"

### 4.1 Classification

In [2]:
filepath = "titanic/titanic3.csv"
fullpath = os.path.join(mainpath, filepath)
data = pd.read_csv(fullpath)  # Load the data

data.head() # Show the first 5 rows of the data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [4]:
# Drop rows with missing target values
data = data.dropna(subset=['survived'])
data = data.dropna(subset=['fare'])

# Select relevant features and target variable
X = data[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = data['survived']

# Convert categorical variable 'Sex' to numerical using .loc
X.loc[:, 'sex'] = X['sex'].map({'female': 0, 'male': 1})

# Handle missing values in the 'Age' column using .loc
X.loc[:, 'age'].fillna(X['age'].median(), inplace=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_rep)

Accuracy: 0.80

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.86      0.84       156
           1       0.78      0.72      0.75       106

    accuracy                           0.80       262
   macro avg       0.80      0.79      0.79       262
weighted avg       0.80      0.80      0.80       262



### 4.2 Regression

In [None]:
# Load the California Housing dataset
california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target

# Select relevant features and target variable
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the regressor
rf_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the results
print(f"Mean Squared Error: {mse:.2f}")  # the smaller the better
print(f"R-squared Score: {r2:.2f}")  # how well the model fits the data (0-100%)

Mean Squared Error: 0.26
R-squared Score: 0.81
