# **Clustering: Isolation Forest**  


Manual Example on a Small Dataset

**Goal**: Detect anomalies by isolating observations in a tree-based fashion.

The idea is that anomalies are few and different, so the Isolation Forest algorithm isolates anomalies instead of profiling normal data points. It does this by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splits required to isolate a point is the anomaly score, with lower scores indicating anomalies.  

Dataset (Augmented)

| Point | Coordinates                     |
| ----- | ------------------------------- |
| A     | (1, 1)                          |
| B     | (2, 1)                          |
| C     | (4, 3)                          |
| D     | (5, 4)                          |
| E     | (3, 4)                          |
| F     | (10, 10) ‚õ≥Ô∏è (potential anomaly) |

* **Step 0: üí° Concept of Isolation Forest**  

The Isolation Forest algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the minimum and maximum of that feature.  

Anomalies are more likely to be isolated faster (i.e., in fewer splits) because they are rare and distant from the majority of the data.  

* **Step 1: üèóÔ∏è Build Isolation Trees (Intuitively)**  

Let‚Äôs simulate how the Isolation Forest algorithm builds trees.  

üîß Each tree is built by:  

1. Randomly picking a feature (e.g., x or y),  
2. Then randomly choosing a split value between the min and max of that feature,  
3. Dividing the data into left/right (like in a decision tree),  
4. Repeating the process recursively until every point is isolated in its own "box".   
üí° An anomaly tends to be far away from other points, so it gets isolated faster, i.e., in fewer splits.  

üß™ Example: Simulating One Tree  

Let‚Äôs build one tree with a few random splits.  

Initial Points:  

A (1,1), B (2,1), C (4,3), D (5,4), E (3,4), F (10,10)  

‚úÖ **Step-by-step simulation:**  

1. Randomly choose x-axis, split at $x = 6$  
    * All points with $x \le 6 \rightarrow$ left side  
    * Point F $(x = 10)$ goes to the right $\rightarrow$ F is immediately isolated!  
        üü¢ Path length for $F = 1$  
2. Now look at the left group: A, B, C, D, E  
    Randomly choose y-axis, split at $y = 2$  
    * Points A (1,1) and B (2,1) go left  
    * Others go right  
3. Let‚Äôs isolate A and B:  
    Random split on x = 1.5 $\rightarrow$  
    * A (x=1) goes left $\rightarrow$ A is isolated (Path length = 3)  
    * B (x=2) goes right $\rightarrow$ B is isolated (Path length = 3)  
4. Same with C, D, E on the other side‚Ä¶.   
    After several splits, they also get isolated, but it takes more steps.  

This process is repeated many times (e.g., 100 trees), with different random splits each time.  
For each point, we record how many splits were needed to isolate it in each tree.  

Then, in **Step 2**, we compute the average path length for each point across all trees.  

* **Step 2: üßÆ Average Path Lengths**  

Let‚Äôs say after building many trees, we get:  

| Point | Avg. Path Length           |
| ----- | -------------------------- |
| A     | 3.5                        |
| B     | 3.4                        |
| C     | 3.2                        |
| D     | 3.3                        |
| E     | 3.1                        |
| F     | **1.2**‚õ≥Ô∏è (very few splits)|

* **Step 3: üìâ Compute Anomaly Score**  

Now we convert these path lengths into a score between 0 and 1 that tells us how "anomalous" a point is.  
Here‚Äôs the formula:

$$s(x,n) = 2^{-\dfrac{-E(h(x))}{c(n)}}$$  

Where:  
* $E(h(x))$ is the average path length for point $x$ (from Step 2)  
* $c(n) \approx log(n) + 0.5772 - \dfrac{1}{n}$ is the average path length in a random binary tree with $n$ points. We use this to normalize, so scores are comparable.      

For our dataset with $6$ points:  
$$c(6) \approx log(6) + 0.5772 - \dfrac{1}{6} \approx 2.77$$  

Now compute scores:  
 
| Point | Path Length | Score                        | Interpretation       |
| ----- | ----------- | ---------------------------- | -------------------- |
| A     | 3.5         | $2^{-3.5/2.77} \approx 0.29$ | Normal               |
| B     | 3.4         | $\approx 0.31$               | Normal               |
| C     | 3.2         | $\approx 0.35$               | Normal               |
| D     | 3.3         | $\approx 0.33$               | Normal               |
| E     | 3.1         | $\approx 0.36$               | Normal               |
| F     | 1.2         | $2^{-1.2/2.77} \approx 0.66$ | üö® Potential anomaly |

‚úÖ Interpretation: The higher the score (closer to 1), the more likely the point is an anomaly. Threshold is often around 0.5.  

‚úÖ So point F stands out as a potential anomaly!  

* **Step 4: ‚úÖ Summary**  

* Isolation Forest isolates outliers using random splits.
* Outliers like F are isolated in fewer steps, leading to higher anomaly scores.
* The score is based on average path length, normalized by the sample size.  

‚úÖ Why Isolation Forest Works Well Here  

* No need for distance or density estimates (unlike k-NN, LOF).
* Scales well to large high-dimensional datasets.
* Effective on small datasets with distinct outliers (like F = (10,10)).  