<div style="text-align:left;">
  <a href="https://code213.tech/" target="_blank">
    <img src="../images/code213.PNG" alt="QWorld">
  </a>
  <p><em>prepared by Latreche Sara</em></p>
</div>

# Random Forest From Scratch: A Beginner-Friendly Guide

In this notebook, we will explore the **Random Forest** algorithm, a powerful ensemble learning method used for classification and regression tasks.

We will cover:
- The intuition behind Random Forests
- Important concepts such as decision trees, impurity measures (Gini impurity)
- How trees are built and combined
- Step-by-step mathematical definitions
- A blueprint for implementing Random Forest from scratch

This notebook contains **only theoretical explanations and formulas** in markdown cells — a complete conceptual guide before coding.

## What is a Decision Tree?

A decision tree is a flowchart-like tree structure where:

- Each internal node represents a **decision** based on a feature and threshold.
- Each branch represents the outcome of the decision.
- Each leaf node represents a **class label** (for classification) or a value (for regression).

### Intuition:

A decision tree partitions the data step-by-step, trying to group samples with the same label together, so that the final leaves are as "pure" as possible.

## Measuring Node Purity: Gini Impurity

To build a good tree, we need to decide **where to split** the data. This requires a measure of how "impure" a node is.

The **Gini impurity** quantifies this impurity for classification problems.

### Definition:

$$
Gini(y) = 1 - \sum_{i=1}^C p_i^2
$$

Where:

- $(D\)$ is the current dataset (node),
- $\(k\)$ is the number of classes,
- $\(p_i\) $is the proportion of class \(i\) samples in \(D\).

### Interpretation:

- $\(Gini = 0\)$ means the node is pure (all samples same class).
- The higher the Gini, the more mixed the classes are.

## Dataset Splitting and Weighted Gini

When we split a node, it divides data into two subsets:

- $D_{\text{left}}$: samples where feature $\leq$ threshold  
- $D_{\text{right}}$: samples where feature $>$ threshold

We want to find the split that **minimizes the weighted average impurity** of the two subsets:

$$
Gini_{\text{split}} = \frac{|D_{\text{left}}|}{|D|} \cdot Gini(D_{\text{left}}) + \frac{|D_{\text{right}}|}{|D|} \cdot Gini(D_{\text{right}})
$$

The goal is to find the **feature and threshold** that yields the lowest $Gini_{\text{split}}$.

## Building a Decision Tree — When to Stop?

During tree construction, we stop splitting when:

- All samples in a node belong to the same class (pure node),
- Maximum tree depth is reached,
- The node contains fewer than a minimum number of samples,
- Or when no split improves impurity.

At these points, the node becomes a **leaf node**, assigned the most common class label.

## Intuition Behind Random Forest

Random Forest combines many decision trees to improve prediction and reduce overfitting.

Key ideas:
- **Bootstrap sampling**: Each tree is trained on a random subset (with replacement) of the data.
- **Random feature selection**: At each split, only a random subset of features is considered.
- This randomness **decorrelates** trees, making the ensemble more robust.

The final prediction is obtained by **aggregating** the predictions of all trees, usually by majority vote (classification).

## Random Forest Algorithm — Step-by-Step

Given training data $(X, y)$:

1. For $t = 1, \ldots, T$ (number of trees):
    - Draw a **bootstrap sample** $D_t$ by sampling $N$ examples from $(X, y)$ **with replacement**.
    - Build a decision tree $h_t$ using $D_t$, where:
        - At each split, select a random subset of $m$ features.
        - Choose the best split among these features to minimize impurity (Gini).
        - Grow the tree fully or until stopping criteria are met.
2. To predict a new sample $x$, aggregate predictions from all trees:
    $$
    \hat{y} = \text{majority vote} \big( h_1(x), h_2(x), \ldots, h_T(x) \big)
    $$


## Summary and Next Steps

- Decision Trees split data to form pure leaves using impurity measures like Gini impurity.
- Random Forest builds an ensemble of decision trees trained on random subsets of data and features.
- Aggregating many diverse trees improves generalization and reduces overfitting.

In the next notebook cells, we will implement:

- Gini impurity calculation,
- Dataset splitting,
- Decision tree building,
- Random forest ensemble training,
- And prediction functions.

This conceptual foundation will make coding these algorithms easier to understand!

In [None]:
# gini impurity function

def gini_impurity(y):
    # Step 1: Check if the list y is empty
   
    
    # Step 2: Calculate the class probabilities using Counter
    # Method: Use Counter(y).values() to get counts of each class
    # Convert counts to probabilities by dividing by total samples len(y)
    
    # Step 3: Calculate Gini impurity: 1 - sum of squared probabilities


In [None]:
# dataset splitting function
def split_dataset(X, y, feature, threshold):
    # Step 1: Create a boolean mask where feature values <= threshold (left split)
    
    # Step 2: Create complement mask for right split
    
    # Step 3: Use masks to index X and y, returning left and right splits


In [None]:
def best_split(X, y, feature_subset):
    # Step 1: Initialize best split variables
    # Variables: best_feature, best_threshold, best_gini, best_split
    best_feature, best_threshold, best_gini, best_split = None, None, float('inf'), None
    
    # Step 2: Loop over each feature in the subset
    # Loop variable: feature
    for feature in feature_subset:
        # Step 3: Find unique threshold candidates from feature values
        # Method: np.unique, Variable: thresholds
        
        # Step 4: For each threshold, split dataset and compute weighted Gini impurity
        # Loop variable: threshold
        for threshold in thresholds:
            # Method: split_dataset, Variables: X_left, y_left, X_right, y_right
            
            # Step 5: Skip if one side is empty (invalid split)
            # Variables: y_left, y_right (length check)
            
            
            # Step 6: Calculate weighted average Gini impurity for this split
            # Method: gini_impurity, Variable: gini
            
            # Step 7: Update best split if current Gini is lower
            # Variables updated: best_gini, best_feature, best_threshold, best_split
            if gini < best_gini:
               
    
    # Step 8: Return best feature, threshold, and split subsets
    # Return: best_feature, best_threshold, best_split


In [1]:
import numpy as np
from collections import Counter
import random

# Gini impurity function
def gini_impurity(y):
    if len(y) == 0:
        return 0
    probs = np.array(list(Counter(y).values())) / len(y)
    return 1 - np.sum(probs ** 2)

# Function to split dataset based on feature and threshold
def split_dataset(X, y, feature, threshold):
    left_idx = X[:, feature] <= threshold
    right_idx = ~left_idx
    return X[left_idx], y[left_idx], X[right_idx], y[right_idx]

# Function to find the best split
def best_split(X, y, feature_subset):
    best_feature, best_threshold, best_gini, best_split = None, None, float('inf'), None
    for feature in feature_subset:
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            X_left, y_left, X_right, y_right = split_dataset(X, y, feature, threshold)
            if len(y_left) == 0 or len(y_right) == 0:
                continue
            gini = (len(y_left) * gini_impurity(y_left) + len(y_right) * gini_impurity(y_right)) / len(y)
            if gini < best_gini:
                best_gini = gini
                best_feature = feature
                best_threshold = threshold
                best_split = (X_left, y_left, X_right, y_right)
    return best_feature, best_threshold, best_split

# Tree node class
class TreeNode:
    def __init__(self, feature=None, threshold=None, left=None, right=None, prediction=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.prediction = prediction

# Function to build the tree
def build_tree(X, y, depth=0, max_depth=5, min_samples=2):
    if len(y) <= min_samples or depth >= max_depth or gini_impurity(y) == 0:
        prediction = Counter(y).most_common(1)[0][0]
        return TreeNode(prediction=prediction)

    feature_subset = random.sample(range(X.shape[1]), int(np.sqrt(X.shape[1])) or 1)
    feature, threshold, split = best_split(X, y, feature_subset)

    if feature is None:
        prediction = Counter(y).most_common(1)[0][0]
        return TreeNode(prediction=prediction)

    X_left, y_left, X_right, y_right = split
    left_node = build_tree(X_left, y_left, depth + 1, max_depth, min_samples)
    right_node = build_tree(X_right, y_right, depth + 1, max_depth, min_samples)
    return TreeNode(feature, threshold, left_node, right_node)

# Prediction function for a tree
def predict_tree(node, x):
    if node.feature is None:
        return node.prediction
    if x[node.feature] <= node.threshold:
        return predict_tree(node.left, x)
    else:
        return predict_tree(node.right, x)

# Random forest class
class RandomForest:
    def __init__(self, n_trees=10, max_depth=5, min_samples=2):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples = min_samples
        self.trees = []

    def fit(self, X, y):
        n_samples = X.shape[0]
        for _ in range(self.n_trees):
            idxs = np.random.choice(n_samples, n_samples, replace=True)
            X_sample, y_sample = X[idxs], y[idxs]
            tree = build_tree(X_sample, y_sample, 0, self.max_depth, self.min_samples)
            self.trees.append(tree)

    def predict(self, X):
        all_preds = np.array([[predict_tree(tree, x) for x in X] for tree in self.trees])
        return [Counter(all_preds[:, i]).most_common(1)[0][0] for i in range(X.shape[0])]

    def print_trees(self):
        for i, tree in enumerate(self.trees):
            print(f"Tree {i+1}:")
            print_tree(tree)
            print("\n")

# Helper function to print the tree

def print_tree(node, indent=""):
    if node.feature is None:
        print(indent + f"Predict: {node.prediction}")
    else:
        print(indent + f"Feature {node.feature} <= {node.threshold}")
        print_tree(node.left, indent + "  ")
        print(indent + f"Feature {node.feature} > {node.threshold}")
        print_tree(node.right, indent + "  ")

# Example usage
X_train = np.vstack([
    np.column_stack((np.random.randn(50) + 1, np.random.randn(50) + 1)),
    np.column_stack((np.random.randn(50) - 1, np.random.randn(50) - 1))
])
y_train = np.array([0]*50 + [1]*50)

rf = RandomForest(n_trees=10, max_depth=5, min_samples=2)
rf.fit(X_train, y_train)

print(f"Number of trees in the forest: {len(rf.trees)}")
rf.print_trees()


Number of trees in the forest: 10
Tree 1:
Feature 0 <= 0.6480914957655985
  Feature 0 <= -0.5637579631780603
    Feature 0 <= -0.9631613385037517
      Predict: 1
    Feature 0 > -0.9631613385037517
      Feature 1 <= -0.011701951630812002
        Predict: 1
      Feature 1 > -0.011701951630812002
        Predict: 0
  Feature 0 > -0.5637579631780603
    Feature 0 <= 0.6187306455587391
      Feature 0 <= 0.5050595390095056
        Feature 0 <= -0.24379204862887738
          Predict: 0
        Feature 0 > -0.24379204862887738
          Predict: 1
      Feature 0 > 0.5050595390095056
        Predict: 0
    Feature 0 > 0.6187306455587391
      Predict: 1
Feature 0 > 0.6480914957655985
  Feature 0 <= 1.4798734979244375
    Predict: 0
  Feature 0 > 1.4798734979244375
    Feature 0 <= 1.484900192274098
      Predict: 1
    Feature 0 > 1.484900192274098
      Predict: 0


Tree 2:
Feature 1 <= -0.15647560181783404
  Predict: 1
Feature 1 > -0.15647560181783404
  Feature 1 <= 0.674808475189397
  

## Conclusion

Random Forest is a powerful ensemble learning method that combines multiple decision trees to improve classification or regression performance. By using bootstrap sampling and random feature selection at each split, Random Forest reduces overfitting and increases model robustness. 

Key takeaways:
- **Ensemble of trees:** Combines many weak learners into a strong one.
- **Randomness:** Injected via bootstrap samples and feature subsets, enhancing generalization.
- **Gini impurity:** Measures node impurity and guides splits for better class separation.
- **Majority voting:** Aggregates predictions from all trees for final output.

Understanding these fundamentals enables you to implement and tune Random Forest models effectively from scratch.
