# Random Forest from Scratch 🌲🌲🌲

## Overview 📈  
In this project, I'm implementing a custom **Random Forest** algorithm from scratch using Python and NumPy. Random Forest is an ensemble learning technique that combines multiple decision trees to improve classification or regression performance by reducing overfitting and increasing accuracy.

### Key Concepts:
- **Random Forest**: A supervised machine learning algorithm that builds multiple decision trees and combines their predictions through majority voting (for classification) or averaging (for regression).
  
- **Decision Trees**: A fundamental component of Random Forests. Each tree splits the data into subsets based on feature values to make predictions.
  
- **Bootstrap Aggregating (Bagging)**: A technique used to create multiple training datasets by randomly sampling with replacement from the original data to train each decision tree in the forest.
  
- **Majority Voting**: In classification, the Random Forest model predicts the class that is most common among all the trees. In regression, it averages the predictions from all the trees.

---

## Objective 🎯  
The goal of this project is to:  
1. Implement a custom **Random Forest** model using decision trees as base learners.
   
2. Use **bagging** to train multiple decision trees on different subsets of the data.
   
3. Combine the predictions from multiple trees using **majority voting** (classification) or **averaging** (regression).
   
4. Understand the importance of **random feature selection** during tree construction to enhance model performance.

---

## Random Forest Explanation 🧠  

### Bootstrapping and Bagging  
To train each decision tree in the forest, we create a different training set using **bootstrapping**. This involves sampling with replacement from the original dataset so that each tree gets a slightly different version of the data. Bagging (Bootstrap Aggregating) is used to reduce overfitting and increase the robustness of the model.

### Decision Trees  
Each decision tree is trained using a subset of features and samples. The decision trees are built by recursively splitting the data into subsets based on feature values, using a criterion like **Gini impurity** (for classification) or **mean squared error** (for regression).

### Random Feature Selection  
To ensure that each tree is diverse, Random Forests randomly select a subset of features at each split. This prevents any single feature from dominating the model, leading to more accurate and generalized predictions.

---

### Combining Tree Predictions  
Once we have multiple trees in the forest, we combine their predictions to make the final prediction:
- **Classification**: We use **majority voting** where the most common class among all the trees is chosen as the final prediction.
  
- **Regression**: We use **averaging** where the predictions of all trees are averaged to make the final prediction.

---

## Implementation 🛠️

Below is the code for implementing a **Random Forest** from scratch. The `RandomForest` class includes methods to:  
1. **Train Multiple Trees**: Build several decision trees using different bootstrapped datasets.
   
2. **Predict Using Majority Voting or Averaging**: Combine predictions from all trees for classification or regression.
   
3. **Handle Random Feature Selection**: At each node split, select a random subset of features to consider, ensuring diversity in the trees.
________

In [2]:
import numpy as np

# Gini Impurity function: Measures how "impure" a split is
def gini_impurity(y):
    class_counts = np.bincount(y)  # Count the occurrences of each class
    probabilities = class_counts / len(y)  # Calculate the probability for each class
    return 1 - np.sum(probabilities ** 2)  # Gini formula: 1 - sum(p^2) for each class

# Best Split function: Finds the best feature and threshold to split on
def best_split(X, y):
    best_gini = float('inf')  # Start with a very large number for the best Gini score
    best_split = None  # To store the best split we find
    
    n_samples, n_features = X.shape  # Get the number of samples and features
    
    # Randomly select a subset of features
    feature_indices = np.random.choice(range(n_features), size=int(np.sqrt(n_features)), replace=False)
    
    # Continue with the current splitting logic for the selected subset
    # Loop through each feature
    for feature_index in feature_indices:
        # Sort the data by this feature
        sorted_indices = np.argsort(X[:, feature_index])
        sorted_X, sorted_y = X[sorted_indices], y[sorted_indices]  # Sort X and y accordingly
        
        # Loop through possible splits (between every pair of sorted values)
        for i in range(1, n_samples):
            left_y = sorted_y[:i]  # Left side of the split
            right_y = sorted_y[i:]  # Right side of the split
            
            # Calculate the Gini impurity for this split
            gini_left = gini_impurity(left_y)
            gini_right = gini_impurity(right_y)
            weighted_gini = (len(left_y) * gini_left + len(right_y) * gini_right) / n_samples
            
            # If this split is better, save it
            if weighted_gini < best_gini:
                best_gini = weighted_gini
                best_split = (feature_index, sorted_X[i-1, feature_index], left_y, right_y)
    
    return best_split  # Return the best feature index, threshold, and data splits

# Decision Tree Classifier: A simple implementation of a decision tree
class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth  # Maximum depth to control overfitting
        self.tree = None  # This will store the tree structure

    def fit(self, X, y):
        # Train the decision tree using the data (X, y)
        self.tree = self._build_tree(X, y)

    def _build_tree(self, X, y, depth=0):
        # If all samples belong to the same class, return that class
        if len(set(y)) == 1:
            return {"label": y[0]}  # Single class label
        
        # If maximum depth is reached, return the most frequent class
        if self.max_depth and depth >= self.max_depth:
            return {"label": np.bincount(y).argmax()}  # Majority class label
        
        # Find the best feature and threshold to split the data
        feature_index, threshold, left_y, right_y = best_split(X, y)
        
        # If no good split is found, return the most frequent class
        if feature_index is None:
            return {"label": np.bincount(y).argmax()}
        
        # Recursively build the left and right subtrees
        left_tree = self._build_tree(X[X[:, feature_index] <= threshold], left_y, depth + 1)
        right_tree = self._build_tree(X[X[:, feature_index] > threshold], right_y, depth + 1)
        
        # Return the tree structure
        return {
            "feature_index": feature_index,
            "threshold": threshold,
            "left": left_tree,
            "right": right_tree
        }
    
    def predict(self, X):
        # Predict the class for each sample
        return [self._predict_sample(sample, self.tree) for sample in X]
    
    def _predict_sample(self, sample, tree):
        # If we reach a leaf node, return the label
        if "label" in tree:
            return tree["label"]
        
        # Otherwise, check which side of the tree the sample goes to
        feature_value = sample[tree["feature_index"]]
        if feature_value <= tree["threshold"]:
            return self._predict_sample(sample, tree["left"])
        else:
            return self._predict_sample(sample, tree["right"])

# Random Forest Classifier: An ensemble of Decision Trees
class RandomForest:
    def __init__(self, n_trees=3, max_depth=None):
        self.n_trees = n_trees  # Number of trees in the forest
        self.max_depth = max_depth  # Maximum depth of each tree
        self.trees = []  # This will store all the decision trees

    def fit(self, X, y):
        # Train multiple decision trees using the same dataset
        for _ in range(self.n_trees):
            tree = DecisionTree(max_depth=self.max_depth)
            tree.fit(X, y)  # Train each tree on the full dataset
            self.trees.append(tree)  # Add the tree to the forest
    
    def predict(self, X):
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # For each sample, take the majority vote across all trees
        predictions = [np.bincount(predictions).argmax() for predictions in tree_predictions.T]
        
        # Convert predictions to Python native ints for cleaner output
        return [int(pred) for pred in predictions]

# Example usage:
# Example dataset (X: features, y: labels)
X = np.array([[2, 3], [10, 15], [3, 4], [9, 14], [1, 1], [7, 10]])
y = np.array([0, 1, 0, 1, 0, 1])

# Train a Random Forest
rf = RandomForest(n_trees=3, max_depth=3)
rf.fit(X, y)

# Make predictions
predictions = rf.predict(X)
print(f"Predictions: {predictions}")

Predictions: [0, 1, 0, 1, 0, 1]


# When to Use Random Forest 🌲🌲🌲

Random Forest is a powerful model that works well in many situations. Here are some cases where it’s especially useful:

- **High Variability in Data**: Random Forest is great for datasets with a lot of different values because it creates many decision trees from random parts of the data, making it more reliable and less likely to overfit.
  > - **Overfitting Prevention**: Since it combines the predictions from multiple trees, it helps prevent the model from being too specific to the training data.
  
  <img src="./figures/overfit.png" alt="gradient" width="900" hight= "400"/> 

- **Complex Data Patterns**: Random Forest can handle both simple and complex relationships between features, making it good for data that is not straightforward.

- **Handling Missing Data**: Random Forest can work with missing data by using different subsets of the data to build each tree, which helps fill in gaps.

- **Feature Importance**: Random Forest tells you which features (columns) in your data are the most important for making predictions.

- **Classification and Regression**: Random Forest can be used for both types of tasks: classifying things into categories and predicting continuous values.

---

# Pros of Random Forest ✅

- **Better Accuracy**: By combining the results of many trees, Random Forest gives more accurate predictions than using just one tree.

- **Stability**: Random Forest is good at handling noisy or uneven data, so it’s more stable and less likely to make mistakes.

- **Works Well with Lots of Features**: You don’t need to worry much about too many features in your data because Random Forest can handle them well.

- **Feature Importance**: It helps you figure out which features are the most important for making predictions, giving you insights into your data.

- **Works for Many Types of Problems**: Random Forest works for both predicting categories (classification) and continuous values (regression), making it very flexible.

---

# Cons of Random Forest ❌

- **Complexity**: Random Forest can be slow and require a lot of memory, especially for big datasets.

- **Hard to Explain**: While individual decision trees are easy to understand, the combined forest of trees can be hard to explain, which can be a problem in some situations.

- **Slower Predictions**: Since it uses many trees to make a decision, predictions can take longer compared to simpler models.

- **Bias with Imbalanced Data**: Random Forest might favor the more common class in data that is unbalanced (e.g., too many positive or negative examples).

- **Not Ideal for Small Datasets**: For small datasets, simpler models like decision trees or logistic regression might work better than Random Forest.

---

## Conclusion 🎯🌲🌲🌲

Random Forest is a flexible and strong tool that works well for many types of data, especially when dealing with a lot of features, complexity, or noise. While it can be slow and hard to explain, it is great for improving accuracy and handling both classification and prediction tasks. Use it carefully, and it can make your machine learning projects more powerful.