# Random Forest 

Random Forest is an **ensemble learning method** used primarily for **regression** tasks, although it can be used for **classification** as well. 

It combines *multiple decision trees* to create a robust model, improving accuracy and reducing overfitting by *averaging predictions*.

### Quick note on ensemble models

Ensemble learning is a machine learning technique that combines the predictions from multiple models (often called "weak learners" or "base models") to produce a more accurate and robust prediction than any single model could achieve on its own

There are three main types of ensemble learning techniques:

1. Bagging (Bootstrap Aggregating): reduces variance by training multiple versions of the same model on *different subsets* of the training data (created by random sampling with replacement). The models' predictions are then averaged (for regression) or voted on (for classification) to produce a final prediction. Example: random forests

2. Boosting: models are trained *sequentially* to reduce bias and variance. Each new model corrects the errors of the previous one
Examples: AdaBoost (Adaptive Boosting, GBM, XGBoost, LightGBM, CatBoost (Gradient Boosting)

3. Stacking (Stacked Generalization): Combines predictions from multiple different models  using a meta-model (a higher-level model) to make final predictions. Stacking leverages the strengths of *diverse models*.
Example: a meta-model (e.g., logistic regression) takes the outputs of base models like Random Forest, Gradient Boosting, and an SVM as inputs and makes the final decision.

### How does Random Forests work?

1) Random Sampling and Bagging: The algorithm builds multiple decision trees using different random subsets (or "bags") of the training data, a technique called "bagging" (Bootstrap Aggregating). Each tree in the forest is trained on a unique subset of data, which introduces variability among the trees.

2) Feature Randomization: When splitting each node in a tree, Random Forest chooses a random subset of features rather than using all available features. This helps to create diverse trees, reducing the chances of overfitting.

3) Aggregation of Predictions: For regression, Random Forest takes the average of predictions from all trees, for classification it uses majority voting among trees. This averaging process stabilizes the prediction, making it more robust than predictions from a single decision tree.

### Pros and Cons

Random Forest is well suited for working with non-linear data, as it is built on decision trees that can capture complex patterns. Since a random subset of features is considered at each split, it performs well with high-dimensional data. Additionally, due to ensemble learning, it helps reduce overfitting compared to individual decision trees.

However, due to its complexity, it is less interpretable than a single decision tree. It is also more computationally intensive, as it requires training multiple decision trees, making it slower for large datasets.

Let's create a Random Forest model and apply it to perform a regression task.

In [1]:
# Import necessary libraries

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

### Generate synthetic data

In [2]:
# Generate data

X, y = np.random.rand(100, 5), np.random.rand(100)

### Split data for training and testing

In [3]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model training

In [4]:
# Initialize and train the model
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', random_state=42)
rf.fit(X_train, y_train)

### Make prediction and evaluate model's performance

In [5]:
# Predict and evaluate
y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 0.12028083604757842
