<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Decision%20Tree/decision_tree_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree from Scratch (No Libraries)

In this notebook, we'll:

- Understand how Decision Trees work
- Implement a basic binary classification tree using NumPy
- Split data using Gini Impurity
- Train and test the tree on simple data

## What Is a Decision Tree?

A decision tree splits data based on feature values to predict outcomes.

Each internal node splits the data based on a condition (e.g., `x < 2.5`), and each leaf node makes a final prediction (class 0 or 1).

We’ll use **Gini Impurity** to decide the best split.

$$[
Gini = 1 - \sum_{i=1}^{n} p_i^2
]$$

Where $$( p_i )$$ is the proportion of class `i` in a node.

## Imports and Dataset

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Simple binary classification dataset
X = np.array([
    [2.7], [1.3], [3.0], [1.0], [3.2], [4.1],
    [1.1], [1.8], [3.5], [3.7]
])
y = np.array([0, 0, 1, 0, 1, 1, 0, 0, 1, 1])
print(X)
print(y)

[[2.7]
 [1.3]
 [3. ]
 [1. ]
 [3.2]
 [4.1]
 [1.1]
 [1.8]
 [3.5]
 [3.7]]
[0 0 1 0 1 1 0 0 1 1]


## Gini Impurity Function

In [2]:
def gini(y):
    classes = np.unique(y)
    impurity = 1
    for c in classes:
        p = np.sum(y == c) / len(y)
        impurity -= p ** 2
    return impurity


## Split Function

In [3]:
def split_dataset(X, y, threshold):
    left_idx = X[:, 0] < threshold
    right_idx = ~left_idx
    return X[left_idx], y[left_idx], X[right_idx], y[right_idx]

## Find Best Split

In [4]:
def best_split(X, y):
    best_gini = 1
    best_threshold = None

    thresholds = np.unique(X[:, 0])
    for t in thresholds:
        _, y_left, _, y_right = split_dataset(X, y, t)
        if len(y_left) == 0 or len(y_right) == 0:
            continue
        g = (len(y_left) * gini(y_left) + len(y_right) * gini(y_right)) / len(y)
        if g < best_gini:
            best_gini = g
            best_threshold = t

    return best_threshold, best_gini

## Build Tree Recursively (Depth = 1 or 2)

In [5]:
def build_tree(X, y, depth=0, max_depth=2):
    if depth >= max_depth or len(set(y)) == 1:
        return {'leaf': True, 'class': int(np.round(np.mean(y)))}

    threshold, _ = best_split(X, y)
    if threshold is None:
        return {'leaf': True, 'class': int(np.round(np.mean(y)))}

    X_left, y_left, X_right, y_right = split_dataset(X, y, threshold)

    return {
        'leaf': False,
        'threshold': threshold,
        'left': build_tree(X_left, y_left, depth + 1, max_depth),
        'right': build_tree(X_right, y_right, depth + 1, max_depth)
    }

## Prediction Function

In [6]:
def predict(tree, x):
    if tree['leaf']:
        return tree['class']
    if x[0] < tree['threshold']:
        return predict(tree['left'], x)
    else:
        return predict(tree['right'], x)

## Train the Tree

In [7]:
tree = build_tree(X, y, max_depth=2)
print("Trained Tree:", tree)

Trained Tree: {'leaf': False, 'threshold': np.float64(3.0), 'left': {'leaf': True, 'class': 0}, 'right': {'leaf': True, 'class': 1}}


## Test Predictions

In [8]:
X_test = np.array([[1.5], [3.8], [2.0]])
for x in X_test:
    print(f"Input {x[0]} → Predicted Class: {predict(tree, x)}")

Input 1.5 → Predicted Class: 0
Input 3.8 → Predicted Class: 1
Input 2.0 → Predicted Class: 0


## Summary

- Built a simple binary decision tree from scratch using NumPy
- Used Gini Impurity to choose the best splits
- Trained recursively with limited depth
- Can now predict class labels based on numeric features

This is a simplified version of the real algorithm, good for understanding how trees make decisions.
