RANDOM FOREST !!!

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. Instead of relying on a single decision tree (which can be sensitive to noise or overfit the training data), random forests create many trees during training, each using a different random subset of the data and features — a technique known as bagging (Bootstrap Aggregating). For classification tasks, each tree votes for a class label, and the majority vote becomes the final prediction. For regression, the final output is the average of all tree predictions. Random forests are robust to noise, handle missing data well, and perform well even without feature scaling. Because of the randomness in data and feature selection, they generalize better than individual decision trees and often achieve higher accuracy in real-world tasks.

For the scratch implementation, we are going to use the same code from decision trees. But we will built Random forest classifier on top of it.

Let's Gooo....

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from collections import Counter




def gini(y):
    classes = np.unique(y)
    impurity = 1.0
    for cls in classes:
        p = np.sum(y == cls) / len(y)
        impurity -= p ** 2
    return impurity

def best_split(X, y):
    best_feature, best_threshold, best_gini = None, None, 1.0
    n_samples, n_features = X.shape

    for feature in range(n_features):
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            left = y[X[:, feature] <= threshold]
            right = y[X[:, feature] > threshold]

            if len(right) == 0 or len(left) ==0:
                continue

            gini_left = gini(left)
            gini_right = gini(right)
            weighted_gini = (len(right) * gini(right) + len(left) * gini(left)) / len(y)

            if weighted_gini < best_gini:
             best_gini = weighted_gini
             best_feature = feature
             best_threshold = threshold

    return best_feature , best_threshold

class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, *,value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

def build_tree(X,y, depth = 0, max_depth = 5):
    if len(set(y)) == 1 or depth >= max_depth:
        most_common = np.bincount(y).argmax()
        return Node(value = most_common)

    feature, threshold = best_split(X, y)
    if feature is None:
        most_common = np.bincount(y).argmax()
        return Node(value = most_common)

    left_indices = X[:, feature] <= threshold
    right_indices = X[:,feature] > threshold

    left_subtree = build_tree(X[left_indices], y[left_indices], depth + 1, max_depth = 3)
    right_subtree = build_tree(X[right_indices], y[right_indices], depth + 1, max_depth = 3)

    return Node(feature, threshold, left_subtree, right_subtree)

def predict(sample, tree):
    if tree.value is not None:
        return tree.value
    if sample[tree.feature] <= tree.threshold:
        return predict(sample, tree.left)
    else:
        return predict(sample, tree.right)

def predict_all(X, tree):
    return np.array([predict(sample, tree) for sample in X])

class RandomForestClassifierScratch:
    def __init__(self, n_estimators=10, max_depth=5, min_samples_split=2):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []

    def fit(self, X, y):
        self.trees = []
        n_samples = X.shape[0]

        for _ in range(self.n_estimators):
            # Bootstrap sampling (with replacement)
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_sample = X[indices]
            y_sample = y[indices]

            tree = build_tree(X_sample, y_sample, max_depth=self.max_depth)
            self.trees.append(tree)

    def predict(self, X):
        # Get predictions from each tree
        tree_preds = np.array([predict_all(X, tree) for tree in self.trees])
        # Majority vote
        majority_votes = np.apply_along_axis(lambda x: Counter(x).most_common(1)[0][0], axis=0, arr=tree_preds)
        return majority_votes


df = pd.read_csv("../data/Titanic-Dataset.csv")              

'''I kept the same code, just added the Random forest classifier and changed the dataset from Iris to Titanic'''

df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']]  
y = df['Survived'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


rf = RandomForestClassifierScratch(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)

train_preds = rf.predict(X_train)
test_preds = rf.predict(X_test)

train_acc = np.mean(train_preds == y_train)
test_acc = np.mean(test_preds == y_test)

print("Training Accuracy:", round(train_acc * 100, 2), "%")
print("Test Accuracy:", round(test_acc * 100, 2), "%")


Training Accuracy: 83.54 %
Test Accuracy: 76.92 %


In this analysis, both the Decision Tree and Random Forest classifiers were applied to the Titanic dataset. The Decision Tree achieved a training accuracy of approximately 81.37% and a test accuracy of 76.22%, while the Random Forest slightly improved these figures to 83.54% and 76.92% respectively. Although the Random Forest combines multiple decision trees to reduce overfitting and improve generalization, the marginal improvement here suggests that the dataset’s size and complexity are not sufficient to highlight the full strength of ensemble methods. Additionally, the features used may not provide enough variability or richness for the Random Forest to significantly outperform a single tree. Still, the Random Forest shows better robustness and slightly enhanced predictive performance, which supports its use in more complex or noisy real-world datasets.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import fetch_openml

print("Downloading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data, mnist.target.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Random Forest...")
rf = RandomForestClassifier(n_estimators=100, max_depth=20, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"\n Train Accuracy: {train_acc * 100:.2f}%")
print(f" Test Accuracy:  {test_acc * 100:.2f}%")

print("\nClassification Report:\n", classification_report(y_test, y_test_pred))


Downloading MNIST dataset...
Training Random Forest...

 Train Accuracy: 99.94%
 Test Accuracy:  96.52%

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.98      1343
           1       0.98      0.98      0.98      1600
           2       0.95      0.97      0.96      1380
           3       0.96      0.95      0.95      1433
           4       0.96      0.96      0.96      1295
           5       0.97      0.96      0.96      1273
           6       0.98      0.98      0.98      1396
           7       0.97      0.96      0.97      1503
           8       0.95      0.95      0.95      1357
           9       0.94      0.95      0.95      1420

    accuracy                           0.97     14000
   macro avg       0.97      0.96      0.96     14000
weighted avg       0.97      0.97      0.97     14000



The Random Forest algorithm showed a massive performance improvement on the MNIST dataset compared to the Titanic dataset. This modest improvement over a single decision tree is due to the limited and noisy nature of the Titanic features. In contrast, on the MNIST dataset—a large, clean, and high-dimensional image dataset of handwritten digits—Random Forest achieved an impressive 99.94% training accuracy and 96.52% test accuracy. This dramatic leap is because Random Forest thrives in environments with complex patterns and abundant features. The large size of MNIST allows each tree in the forest to learn distinct visual patterns, and the ensemble effectively generalizes across the dataset. This comparison highlights that while Random Forest can offer stability and slight gains on small, simple datasets, its true strength emerges with rich, high-dimensional data where individual decision trees capture diverse aspects of the problem space.