<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Supervised%20Learning%20Algorithms/Random%20Forest/random_forest_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest from Scratch (No ML Libraries)

In this notebook, we’ll:

- Understand how Random Forest works
- Use multiple decision trees (weak learners)
- Train on different random subsets of the data
- Make predictions using **majority voting**
- Implement everything using only NumPy

## What is Random Forest?

Random Forest is an **ensemble method** that builds multiple decision trees and combines their predictions.

Key ideas:
- **Bootstrap sampling**: each tree trains on a random subset of the data (with replacement)
- **Random feature selection**: (optional) trees use a random subset of features
- **Majority voting**: for classification, each tree votes and the most common label wins

Benefits:
- Handles overfitting better than a single tree
- More stable and accurate


## Imports + Toy Dataset


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Simple 1D binary classification data
X = np.array([
    [2.7], [1.3], [3.0], [1.0], [3.2], [4.1],
    [1.1], [1.8], [3.5], [3.7]
])
y = np.array([0, 0, 1, 0, 1, 1, 0, 0, 1, 1])


print(X)
print(y)

[[2.7]
 [1.3]
 [3. ]
 [1. ]
 [3.2]
 [4.1]
 [1.1]
 [1.8]
 [3.5]
 [3.7]]
[0 0 1 0 1 1 0 0 1 1]


## Gini + Split + Tree (Reuse from Decision Tree)

In [None]:
def gini(y):
    classes = np.unique(y)
    impurity = 1
    for c in classes:
        p = np.sum(y == c) / len(y)
        impurity -= p ** 2
    return impurity

def split_dataset(X, y, threshold):
    left_idx = X[:, 0] < threshold
    right_idx = ~left_idx
    return X[left_idx], y[left_idx], X[right_idx], y[right_idx]

def best_split(X, y):
    best_gini = 1
    best_threshold = None
    thresholds = np.unique(X[:, 0])
    for t in thresholds:
        _, y_left, _, y_right = split_dataset(X, y, t)
        if len(y_left) == 0 or len(y_right) == 0:
            continue
        g = (len(y_left) * gini(y_left) + len(y_right) * gini(y_right)) / len(y)
        if g < best_gini:
            best_gini = g
            best_threshold = t
    return best_threshold, best_gini

def build_tree(X, y, depth=0, max_depth=2):
    if depth >= max_depth or len(set(y)) == 1:
        return {'leaf': True, 'class': int(np.round(np.mean(y)))}
    threshold, _ = best_split(X, y)
    if threshold is None:
        return {'leaf': True, 'class': int(np.round(np.mean(y)))}
    X_left, y_left, X_right, y_right = split_dataset(X, y, threshold)
    return {
        'leaf': False,
        'threshold': threshold,
        'left': build_tree(X_left, y_left, depth + 1, max_depth),
        'right': build_tree(X_right, y_right, depth + 1, max_depth)
    }

def predict_tree(tree, x):
    if tree['leaf']:
        return tree['class']
    if x[0] < tree['threshold']:
        return predict_tree(tree['left'], x)
    else:
        return predict_tree(tree['right'], x)

## Build Random Forest

In [None]:
def bootstrap_sample(X, y):
    n_samples = X.shape[0]
    idx = np.random.choice(n_samples, n_samples, replace=True)
    return X[idx], y[idx]

def build_forest(X, y, n_trees=5, max_depth=2):
    forest = []
    for _ in range(n_trees):
        X_sample, y_sample = bootstrap_sample(X, y)
        tree = build_tree(X_sample, y_sample, max_depth=max_depth)
        forest.append(tree)
    return forest

## Predict Using Majority Voting

In [None]:
from collections import Counter

def predict_forest(forest, x):
    predictions = [predict_tree(tree, x) for tree in forest]
    most_common = Counter(predictions).most_common(1)[0][0]
    return most_common

## Train and Test the Forest

In [None]:
forest = build_forest(X, y, n_trees=5, max_depth=2)

# Test predictions
X_test = np.array([[1.5], [3.8], [2.0]])
for x in X_test:
    print(f"Input {x[0]} → Predicted Class: {predict_forest(forest, x)}")


Input 1.5 → Predicted Class: 0
Input 3.8 → Predicted Class: 1
Input 2.0 → Predicted Class: 0


## Summary

- Implemented Random Forest using multiple Decision Trees
- Used bootstrap sampling for training diversity
- Combined predictions with majority voting
- Built everything using pure NumPy and Python

Random Forest is a strong and interpretable ML model, and understanding it deeply is great for interviews and real-world ML work.
