<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Unsupervised%20Learning%20Algorithms/Anomaly%20Detection/Isolation%20Forest/isolation_forest_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What Is Isolation Forest?
**Isolation Forest** is based on the idea that **anomalies** are easier to isolate than normal points. It builds random decision trees (called isolation trees), and:

* Anomalies have **shorter path lengths** to isolation
* Normal data requires more splits to be isolated


## 1. Generate or Load Sample Data


In [4]:
import random
import math

def generate_data(n_normal=100, n_anomalies=5):
    # Normal data centered at (0, 0)
    normal_data = [[random.gauss(0, 1), random.gauss(0, 1)] for _ in range(n_normal)]
    # Anomalies far away from the center
    anomalies = [[random.uniform(6, 8), random.uniform(6, 8)] for _ in range(n_anomalies)]
    return normal_data + anomalies

data = generate_data()
data

[[0.080691771632297, 0.29850432662008763],
 [0.9101455502669769, -0.9602858166432511],
 [-0.5791475171197923, -0.13677859262336092],
 [-0.09517996171911973, 0.27575797886486053],
 [-1.5564823783085016, -1.8613331484673319],
 [0.08945372569862192, 1.0801699575296329],
 [-0.24653072140917842, 0.7370249379969764],
 [-0.6280216151791731, -0.2976004045279304],
 [0.020662664892868433, 1.1251468696737177],
 [0.3272219038958852, -0.7978578928844358],
 [0.33775449779964895, 0.9149401861966111],
 [-0.38563564456838983, -0.9440700325249547],
 [1.0949950764325418, -0.9633396730237124],
 [-0.3508232695318873, 0.640161081335892],
 [-1.083781077367708, 0.417268102993136],
 [0.8320579154830998, 0.6125424966039748],
 [0.8578003211660923, -1.3442050012439228],
 [1.643490523805833, -0.6281383208566036],
 [-0.30745413830458446, 2.922909269110105],
 [-0.3652757474377666, -0.3540810361670781],
 [0.7682789467421207, 0.014813485764960128],
 [-1.0179529167220702, -0.8031188138538901],
 [-0.7597268265882922, -0

## 2. Harmonic number approximation for average path length

In [5]:
def c(n):
    if n <= 1:
        return 0
    return 2 * (math.log(n - 1) + 0.5772156649) - (2 * (n - 1) / n)

## 3. Build isolation tree recursively

In [6]:
def build_tree(data, height_limit, current_height=0):
    if current_height >= height_limit or len(data) <= 1:
        return {'size': len(data)}  # Leaf node

    dim = random.randint(0, len(data[0]) - 1)  # Pick a random feature
    values = [row[dim] for row in data]
    split_value = random.uniform(min(values), max(values))

    left = [row for row in data if row[dim] < split_value]
    right = [row for row in data if row[dim] >= split_value]

    return {
        'split_attr': dim,
        'split_value': split_value,
        'left': build_tree(left, height_limit, current_height + 1),
        'right': build_tree(right, height_limit, current_height + 1)
    }

## 4. Calculate path length of a data point in a tree

In [7]:
def path_length(x, node, current_height=0):
    if 'size' in node:
        return current_height + c(node['size'])  # Reached a leaf

    split_attr = node['split_attr']
    split_value = node['split_value']

    if x[split_attr] < split_value:
        return path_length(x, node['left'], current_height + 1)
    else:
        return path_length(x, node['right'], current_height + 1)

## 5. Build multiple isolation trees (the forest)

In [8]:
def fit_forest(data, n_trees=100, sample_size=64):
    trees = []
    height_limit = math.ceil(math.log2(sample_size))

    for _ in range(n_trees):
        sample = random.sample(data, sample_size)
        tree = build_tree(sample, height_limit)
        trees.append(tree)

    return trees

## 6. Compute anomaly score for a data point

In [9]:
def anomaly_score(x, trees, sample_size):
    total_path_length = 0
    for tree in trees:
        total_path_length += path_length(x, tree)

    avg_path_length = total_path_length / len(trees)
    score = 2 ** (-avg_path_length / c(sample_size))
    return score  # Closer to 1 = anomaly

## 7. Predict anomaly labels for dataset

In [10]:
def predict(data, trees, sample_size, threshold=0.6):
    predictions = []
    scores = []

    for x in data:
        score = anomaly_score(x, trees, sample_size)
        scores.append(score)
        predictions.append(1 if score > threshold else 0)  # 1 = anomaly

    return predictions, scores

## 8. Run it all

In [12]:
n_trees = 100
sample_size = 64

trees = fit_forest(data, n_trees, sample_size)
predictions, scores = predict(data, trees, sample_size, threshold=0.6)

## 9.Display results

In [13]:
for i, (point, score, label) in enumerate(zip(data, scores, predictions)):
    status = "Anomaly 🚨" if label == 1 else "Normal ✅"
    print(f"Point {i}: {point} | Score: {score:.4f} | {status}")

Point 0: [0.080691771632297, 0.29850432662008763] | Score: 0.3914 | Normal ✅
Point 1: [0.9101455502669769, -0.9602858166432511] | Score: 0.4332 | Normal ✅
Point 2: [-0.5791475171197923, -0.13677859262336092] | Score: 0.4130 | Normal ✅
Point 3: [-0.09517996171911973, 0.27575797886486053] | Score: 0.3929 | Normal ✅
Point 4: [-1.5564823783085016, -1.8613331484673319] | Score: 0.5615 | Normal ✅
Point 5: [0.08945372569862192, 1.0801699575296329] | Score: 0.4257 | Normal ✅
Point 6: [-0.24653072140917842, 0.7370249379969764] | Score: 0.4081 | Normal ✅
Point 7: [-0.6280216151791731, -0.2976004045279304] | Score: 0.4134 | Normal ✅
Point 8: [0.020662664892868433, 1.1251468696737177] | Score: 0.4335 | Normal ✅
Point 9: [0.3272219038958852, -0.7978578928844358] | Score: 0.4127 | Normal ✅
Point 10: [0.33775449779964895, 0.9149401861966111] | Score: 0.4231 | Normal ✅
Point 11: [-0.38563564456838983, -0.9440700325249547] | Score: 0.4190 | Normal ✅
Point 12: [1.0949950764325418, -0.9633396730237124] |

## Summary

| Step | What it does                                        |
|:---- |:--------------------------------------------------- |
| 1    | Generates synthetic normal and anomalous data       |
| 2    | Calculates expected path length to normalize scores |
| 3-4  | Recursively builds an isolation tree                |
| 5    | Builds multiple trees to form the forest            |
| 6-7  | Computes anomaly score for each point               |
| 8-9  | Labels data as anomaly or normal based on threshold |


