Absolutely. Let’s enter the realm of anomaly detection, where the unusual is what we want. Here's your UTHU-style breakdown of:

---

## 🧩 **Isolation Forest Algorithm**  
🌲 *Random Partitioning, Path Length, and Scoring Outliers via Isolation Depth*

---

## **1. Conceptual Foundation**

### 🎯 Purpose & Relevance

Most ML models try to **predict common patterns** — but in anomaly detection, we want the opposite:  
Find the **weird**, **rare**, and **unexpected**.

**Isolation Forest** is built for this. It doesn't cluster or model data — it literally tries to **isolate** each point using random decision trees. Outliers get isolated **faster** than normal points.

> **Analogy**:  
> Imagine finding a celebrity in a crowd. You don’t need a whole biography — just a few quick questions (“Is he 7 feet tall?” “Is she wearing a red cape?”).  
> **Weird traits = short path to isolation.**

---

### 🧠 Key Terminology

| Term | Feynman Explanation |
|------|---------------------|
| **Isolation Forest** | A collection of trees that randomly split data to isolate points |
| **Path Length** | How many splits it takes to isolate a point |
| **Anomaly Score** | Shorter path → more likely to be an outlier |
| **Random Subsampling** | Randomly choosing feature + split value at each tree node |
| **Tree Ensemble** | Using many trees to get a reliable average path length |

---

### 💼 Use Cases

- Financial fraud detection (rare transactions)
- Intrusion detection in networks
- Industrial monitoring (e.g., sensor drift)
- Outlier rejection before training supervised models

```plaintext
    Have unlabeled data?
            ↓
   Want to flag rare points?
            ↓
     → Use Isolation Forest
            ↓
     Fast isolation → likely outlier
```

---

## **2. Mathematical Deep Dive** 🧮

### 📐 Core Equation

Let \( h(x) \) be the average path length to isolate point \( x \), and \( c(n) \) be the average path length in a random BST of \( n \) samples:

- **Anomaly Score**:
  $$
  s(x, n) = 2^{-\frac{h(x)}{c(n)}}
  $$

Where:
- \( s(x, n) \to 1 \) → strong anomaly (isolated quickly)  
- \( s(x, n) \to 0.5 \) → normal instance

---

### 🧲 Math Intuition

Random splits isolate extreme values fast:
- Outliers → isolated in **few splits** → **short path**  
- Normal points → take **more splits** → **longer path**

This is different from density methods (like LOF) — Isolation Forest doesn't estimate distance or density, just *splittability*.

---

### ⚠️ Assumptions & Constraints

| Assumes...                            | Pitfalls                             |
|--------------------------------------|--------------------------------------|
| Outliers are few and separable       | Fails if anomalies cluster together  |
| Features can be split meaningfully   | Poor performance on flat/noisy data  |
| Input features are **independent**   | Correlated features may confuse splits |

---

## **3. Critical Analysis** 🔍

| Feature               | Isolation Forest                 | Other Methods (LOF, One-Class SVM)  |
|-----------------------|----------------------------------|--------------------------------------|
| Speed                 | Extremely fast (random trees)    | Slower (distance or kernel based)    |
| Interpretability      | Medium (path length logic)       | Low                                  |
| Scaling to Big Data   | Excellent                        | Poor                                 |
| Non-Euclidean Spaces  | Good                             | Poor                                 |
| Handles High Dimensions| Yes                             | No (SVMs fail with curse of dim.)    |

---

### 🧬 Ethical Lens

- **False positives** can occur in underrepresented groups if their behavior appears “unusual” to the model
- Always validate flagged anomalies with **human review**, especially in finance or health contexts

---

### 🔬 Research Updates (Post-2020)

- **SCiForest**: Stream-compatible Isolation Forest  
- **Hybrid models**: Combine Isolation Forest with LSTM for time-series anomaly detection  
- **Explainable Anomaly Detection**: Feature attribution for outlier score (e.g., SHAP + iForest)

---

## **4. Interactive Elements** 🎯

### ✅ Concept Check

**Q: Why do outliers tend to have shorter path lengths in Isolation Forests?**

A. They are always smaller in value  
B. They get split later in trees  
C. They’re far from dense clusters, so isolated faster  
D. They have higher reconstruction error

✅ **Correct Answer: C**  
**Explanation**: Outliers lie on the fringes and get split off quickly — so they have short paths in many trees.

---

### 🧪 Code Debug Challenge

```python
# Buggy: incorrect scoring method
scores = model.predict(X_test)  # returns labels, not scores
```

**Fix:**

```python
scores = model.decision_function(X_test)  # Higher = more normal
anomaly_score = -scores  # Invert for anomaly interpretation
```

---

## **5. Glossary**

| Term | Meaning |
|------|--------|
| **Isolation Forest** | Ensemble of trees built to isolate points |
| **Path Length** | Number of splits needed to isolate a data point |
| **Anomaly Score** | Normalized value showing how quickly a point was isolated |
| **Decision Function** | Outputs raw anomaly scores |
| **Random Partitioning** | Splitting data randomly instead of optimizing like in decision trees |

---

## **6. Practical Considerations** ⚙️

### 🔧 Hyperparameters

| Param             | Description                          | Tip                              |
|------------------|--------------------------------------|----------------------------------|
| `n_estimators`   | Number of trees                      | 100–200 usually enough           |
| `max_samples`    | Size of subsample per tree           | 256 is default and efficient     |
| `contamination`  | % of expected outliers               | Set manually if known (e.g., 0.01) |

```python
model = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
```

---

### 📏 Evaluation Metrics

- **Precision, Recall, F1** (if labeled anomalies available)  
- **ROC-AUC** (binary classification)  
- **Visual threshold tuning** on score distribution

```python
from sklearn.metrics import roc_auc_score
roc_auc_score(y_true, -model.decision_function(X_test))
```

---

### ⚙️ Production Tips

- Always **normalize** data before fitting (e.g., `StandardScaler`)
- Visualize **score distributions** to tune decision threshold
- Combine with **domain rules** or **time-based filtering** in deployment

---

## **7. Full Python Code Cell** 🐍

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset with outliers
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_combined = np.vstack([X, X_outliers])

# Normalize
X_scaled = StandardScaler().fit_transform(X_combined)

# Fit Isolation Forest
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
model.fit(X_scaled)

# Get anomaly scores
scores = -model.decision_function(X_scaled)
labels = model.predict(X_scaled)

# Plot results
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=scores, cmap='coolwarm', s=30, edgecolors='k')
plt.colorbar(label="Anomaly Score (Higher = More Outlier)")
plt.title("Anomaly Detection with Isolation Forest")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
```

---

All done: Isolation Forest, explained and deployed 🔬  
Next: Want to compare it with One-Class SVM or LOF? Or jump into use cases and real-data examples like **fraud detection**?

Perfect. Time to **compare Isolation Forest** with its two most notable anomaly detection cousins:  
🔍 **One-Class SVM** and **Local Outlier Factor (LOF)**.

---

## 🧩 **Comparison with Other Methods**  
🎯 *Isolation Forest vs One-Class SVM vs LOF*  
(UTHU-style structured breakdown)

---

## **1. Conceptual Foundation**

### 🎯 Purpose & Relevance

Isolation Forest is **fast and tree-based**, but it’s not the only way to find anomalies.

Let’s contrast it with:
- **One-Class SVM (OCSVM)** — a kernel-based method that learns a decision boundary  
- **Local Outlier Factor (LOF)** — a density-based method that compares how crowded your neighborhood is

> **Analogy**:  
> - Isolation Forest: “How fast can I isolate you with random questions?”  
> - One-Class SVM: “Do you fall inside or outside the normal bubble?”  
> - LOF: “Are your neighbors unusually far away?”

---

### 🧠 Key Terminology

| Term              | Feynman-style Explanation |
|-------------------|---------------------------|
| **One-Class SVM** | Learns a boundary around “normal” — flags anything outside it |
| **LOF**           | Compares how far you are from your neighbors vs how far they are from theirs |
| **Kernel Method** | Projects data to higher space to make it separable |
| **Density Score** | LOF score indicating isolation in sparse areas |
| **Decision Boundary** | SVM's invisible line separating normal from abnormal |

---

### 💼 Use Cases

| Task Type                     | Best Method        |
|-------------------------------|--------------------|
| Huge datasets (millions)      | ✅ Isolation Forest |
| Small, complex feature sets   | ✅ One-Class SVM    |
| Detecting local group outliers| ✅ LOF              |
| Interpretability needed       | ✅ LOF (neighborhood logic) |

```plaintext
     Need to detect outliers?
             ↓
    +----------------------------+
    | High-dimensional, big data → Isolation Forest
    | Small dataset, tight clusters → One-Class SVM
    | Neighborhood-based outliers → LOF
```

---

## **2. Mathematical Deep Dive** 🧮

### 📐 Key Equations

#### **One-Class SVM**:
Finds a function \( f(x) \) such that:
$$
f(x) > 0 \Rightarrow \text{inlier}, \quad f(x) < 0 \Rightarrow \text{outlier}
$$

Uses kernel trick \( K(x, x') \) to define separation boundary.

#### **LOF**:
Outlier score is the **ratio of average local density** of neighbors vs self:
$$
\text{LOF}(x) = \frac{\sum_{i \in N_k(x)} \frac{\text{density}(i)}{\text{density}(x)}}{|N_k(x)|}
$$

---

### 🧲 Math Intuition

| Model           | “How it thinks” |
|----------------|------------------|
| **Isolation Forest** | Outliers split off quickly → low path length |
| **OCSVM**            | Fit a tight hypersphere → everything outside is weird |
| **LOF**              | Compare local density → if your area is empty, you’re weird |

---

### ⚠️ Assumptions & Constraints

| Method            | Works Best When...                 | Pitfalls                             |
|-------------------|------------------------------------|--------------------------------------|
| Isolation Forest  | High-dimensional, unlabeled, large data | Fails with correlated, subtle anomalies |
| One-Class SVM     | Few features, smooth boundaries    | Doesn’t scale, fails in high dims     |
| LOF               | Local context is meaningful        | Sensitive to `k`, hard to parallelize |

---

## **3. Critical Analysis** 🔍

| Feature               | Isolation Forest       | One-Class SVM       | LOF                   |
|------------------------|------------------------|----------------------|------------------------|
| Speed & Scalability    | ✅ Fast, parallelizable | ❌ Slow, scales poorly| ❌ Medium              |
| Handles High Dim Data  | ✅ Yes                 | ❌ Poorly             | ❌ Poorly              |
| Local Outliers         | ❌ Not ideal           | ❌ Weak               | ✅ Excellent           |
| Interpretable Logic    | Medium                | Low                  | ✅ High (neighborhood) |
| Hyperparam Sensitivity | Low                   | ✅ High (nu, gamma)   | ✅ High (k neighbors)  |

---

### 🧬 Ethical Lens

- LOF and OCSVM may **miss global anomalies** that look normal locally  
- Isolation Forest might **flag legitimate minority behaviors** as anomalies  
- Always validate anomaly results with **domain experts or post-hoc interpretable tools**

---

### 🔬 Research Updates (Post-2020)

- **Deep SVDD**: One-Class SVM generalized with deep networks  
- **LOF + Embeddings**: Works better when applied to **latent space** from an autoencoder  
- **Isolation Forest + SHAP**: Used to explain why a point was marked anomalous  

---

## **4. Interactive Elements** 🎯

### ✅ Concept Check

**Q: Which method is best for detecting anomalies within dense clusters?**

A. Isolation Forest  
B. One-Class SVM  
C. Local Outlier Factor  
D. PCA

✅ **Correct Answer: C**  
**Explanation**: LOF compares local densities, making it ideal for spotting points that are unusual *within* a cluster.

---

### 🧪 Code Debug

```python
# Buggy: using One-Class SVM without scaling
from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf').fit(X)
```

**Fix:**

```python
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
model = OneClassSVM(kernel='rbf').fit(X_scaled)
```

---

## **5. Glossary**

| Term | Definition |
|------|------------|
| **One-Class SVM** | Finds boundary around “normal” data using support vectors |
| **LOF (Local Outlier Factor)** | Flags points with low local density |
| **Isolation Forest** | Detects anomalies by random splitting |
| **Kernel Trick** | Projects data into higher dimensions for separation |
| **Contamination** | Assumed percentage of anomalies in data |

---

## **6. Practical Considerations** ⚙️

### 🔧 Hyperparameters

| Model | Key Params         | Heuristics                                |
|--------|--------------------|-------------------------------------------|
| iForest | `n_estimators`, `contamination` | Use 100 trees, set contamination if known |
| OCSVM  | `nu`, `gamma`       | `nu=0.05`, `gamma=1/n_features` works well |
| LOF    | `n_neighbors`       | `k=20` is common; tune based on data shape |

---

### 📏 Evaluation Metrics

- **Precision/Recall** for flagged outliers  
- **ROC AUC** if true labels are available  
- Use **score distributions** to set thresholds

```python
from sklearn.metrics import roc_auc_score
roc_auc_score(y_true, -lof.negative_outlier_factor_)
```

---

## **7. Full Python Code Cell** 🐍  
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_combined = np.vstack([X, X_outliers])
X_scaled = StandardScaler().fit_transform(X_combined)

# Fit models
models = {
    "Isolation Forest": IsolationForest(contamination=0.05).fit(X_scaled),
    "One-Class SVM": OneClassSVM(nu=0.05, kernel='rbf').fit(X_scaled),
    "Local Outlier Factor": LocalOutlierFactor(n_neighbors=20)
}

# Get predictions
labels = {
    "Isolation Forest": models["Isolation Forest"].predict(X_scaled),
    "One-Class SVM": models["One-Class SVM"].predict(X_scaled),
    "Local Outlier Factor": models["Local Outlier Factor"].fit_predict(X_scaled)
}

# Plot comparisons
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, (name, lbls) in enumerate(labels.items()):
    axes[i].scatter(X_scaled[:, 0], X_scaled[:, 1], c=lbls, cmap='coolwarm', s=30, edgecolors='k')
    axes[i].set_title(f"{name}")
    axes[i].grid(True)

plt.suptitle("Anomaly Detection: IF vs OCSVM vs LOF", fontsize=14)
plt.tight_layout()
plt.show()
```

---

✅ You now have a comparative framework to choose the **right anomaly detector** for the job.  
Next in queue: Want to roll into **real-world use cases** (fraud/system anomalies), or explore interpretability tools like **SHAP for outliers**?

Let’s land this anomaly detection plane with the **why-it-matters** piece:  
🛡️ **Real-World Use Cases of Isolation Forest and Anomaly Detection**

---

## 🧩 **Use Cases**  
🔍 *Fraud Detection, System Monitoring, and Beyond*  
(*UTHU-structured summary*)

---

## **1. Conceptual Foundation**

### 🎯 Purpose & Relevance

Anomalies may be **rare**, but they’re often **the most important events** in your data:
- A single fraudulent transaction
- One failing machine
- A strange login from a foreign country

Isolation Forest and its anomaly detection cousins exist **specifically to find these needles in haystacks** — and do it **without needing labels**.

> **Analogy**:  
> Imagine a security guard trained not on every criminal ever, but simply on spotting anyone who acts just a bit too… different.  
> That's what unsupervised anomaly detection does.

---

### 🧠 Key Terminology

| Term | Explanation |
|------|-------------|
| **Fraud Detection** | Spotting financial or behavioral events that deviate from norms |
| **System Monitoring** | Tracking sensors, servers, or users for unusual activity |
| **Concept Drift** | When normal behavior changes over time (e.g., seasonality) |
| **Outlier** | A data point that doesn’t conform to the expected pattern |
| **Online Detection** | Real-time anomaly flagging during data stream processing |

---

### 💼 Primary Use Cases

---

### 🏦 1. **Fraud Detection**

**Why it fits**:
- Fraud is usually **rare**, **subtle**, and **evolving**
- Isolation Forest works well with **high-dimensional**, unlabeled transaction logs

**Common Targets**:
- Credit card fraud (small % of transactions are fake)
- Insurance fraud (anomalous claims)
- E-commerce bots (abnormal browsing + purchases)

```plaintext
Transaction Logs → Feature Vectors → Isolation Forest → Outlier Scores → Flag frauds
```

**Note**: Labels are often delayed (i.e., fraud confirmed days later), so unsupervised detection is key.

---

### 🛠️ 2. **System Monitoring**

**Why it fits**:
- Systems are mostly stable — anomalies often indicate **failures**, **attacks**, or **deviations**
- Isolation Forest is fast and **stream-friendly**

**Common Targets**:
- Server CPU/memory spikes  
- Sensor drift or hardware degradation  
- Network intrusions (e.g., abnormal IP behavior)

```plaintext
Sensor Logs or API Events → Sliding Windows → Anomaly Scores → Alert or Auto-response
```

---

### 👤 3. **User Behavior Analytics (UBA)**

- Find **insider threats** (employees accessing strange files)
- Detect **credential theft** via login patterns
- Spot **outlier usage** in SaaS products (unusual click paths)

---

### 🚨 4. **Security & Intrusion Detection**

- Real-time detection of **network anomalies**  
- Complement to rule-based firewalls (more adaptive)
- Works in **zero-day attacks** where no labeled data exists

---

### 🧪 5. **Preprocessing for Supervised Models**

- Clean training sets by removing **label noise**
- Filter outliers from regression datasets
- Identify mislabeled samples for review

---

## **2. Mathematical Deep Dive** 🧮

The **core insight** for all these use cases:
$$
s(x) = 2^{-\frac{h(x)}{c(n)}} \Rightarrow \text{Lower } h(x) \text{ → Higher anomaly score}
$$

You can:
- Set thresholds (e.g., top 1% of scores)
- Use scores for **ranking**, **visualization**, or **alerting**

---

## **3. Critical Analysis** 🔍

| Use Case              | Why Isolation Forest Works       | Caveats / Challenges             |
|------------------------|----------------------------------|----------------------------------|
| Fraud Detection        | Handles high-dim, sparse data    | Fraud adapts — retrain often     |
| System Monitoring      | Works in real-time, fast scoring | Static model may miss drift      |
| Insider Threats        | No need for labeled “bad users”  | False positives if behavior varies |
| Sensor Outlier Cleaning| Captures rare, irregular spikes | Edge case anomalies may be lost  |

---

### 🧬 Ethical Lens

- False positives can **lock out legit users** (security) or **delay payouts** (insurance)
- Models may flag behavior from **minority groups** as abnormal simply due to underrepresentation
- Always pair anomaly detection with:
  - Human review  
  - Clear audit trail  
  - Fairness checks

---

### 🔬 Research Updates (Post-2020)

- **Anomaly explanation**: Using SHAP or LIME to explain outlier scores  
- **Drift-aware detectors**: Combine Isolation Forest with concept drift monitoring  
- **Hybrid pipelines**: Use anomaly scores as inputs to supervised fraud classifiers

---

## **4. Interactive Elements** 🎯

### ✅ Concept Check

**Q: Why is Isolation Forest effective in fraud detection tasks?**

A. It uses labels to detect rare cases  
B. It models every user behavior directly  
C. It isolates rare behaviors quickly without labels  
D. It requires deep neural networks

✅ **Correct Answer: C**  
**Explanation**: Isolation Forest doesn’t need labels and is built to quickly split off unusual data — perfect for rare-event detection like fraud.

---

## **5. Glossary**

| Term | Definition |
|------|------------|
| **Anomaly Detection** | Finding rare, unusual, or suspicious data points |
| **Fraud** | Intentionally deceptive behavior in finance, insurance, or systems |
| **System Monitoring** | Watching for performance or behavioral failures |
| **Concept Drift** | When “normal” behavior changes over time |
| **Outlier Score** | Value indicating how “weird” a point is, based on model logic |

---

## **6. Practical Considerations** ⚙️

### 🔧 Hyperparameters for Deployment

- `contamination`: Based on expected fraud/anomaly rate (e.g., `0.01`)
- Use **retraining** on rolling windows (daily/weekly)
- Always log **score + timestamp + context** for forensic review

---

### 📏 Evaluation Metrics

- **Precision / Recall** (for fraud or incident tickets)  
- **Time to detect** (real-time system monitoring)  
- **Manual review accuracy** (if human audit included)

---

### ⚙️ Production Tips

- Use Isolation Forest in **streaming mode** with mini-batches  
- For dynamic data, consider **online retraining or rolling windows**
- Flag outliers, but **don’t auto-punish** — route to analyst workflow

---

All done — you’ve now covered the **full loop** of Isolation Forest: theory → math → code → comparison → use cases.

Next move: want to wrap this with a mini project / case study (e.g., synthetic fraud dataset), or move into **Manifold Learning** with t-SNE + UMAP?

Let’s close the Isolation Forest chapter with a full **hands-on walkthrough**:  
💸 **Detecting Rare Transactions in Financial Data** — an end-to-end example using Isolation Forest.

---

## 🧩 **Example – Detecting Rare Transactions in Financial Data**  
(*UTHU-structured summary*)

---

## **1. Conceptual Foundation**

### 🎯 Purpose & Relevance

In real-world financial data:
- Fraudulent transactions are **extremely rare** (e.g., <1%)  
- We often don’t have labels in advance  
- Outliers often come from new fraud patterns → supervised models fail

Isolation Forest is **ideal** here:
- Doesn’t need labels  
- Fast on large transaction logs  
- Finds anomalies by **splitting off the weird**

> **Analogy**:  
> Picture thousands of credit card swipes.  
> Most are normal.  
> But one person just bought 6 Rolexes at 2AM from Tokyo.  
> Isolation Forest raises its hand and says, “That’s... different.”

---

### 🧠 Key Terminology

| Term | Feynman-style Explanation |
|------|---------------------------|
| **Transaction Vector** | All features of a transaction (amount, time, device, etc.) |
| **Anomaly Score** | Measure of how quickly a transaction was isolated |
| **Contamination Rate** | Fraction of data expected to be fraud |
| **Threshold** | Score cutoff to decide fraud or not |
| **False Positive** | Legit transaction wrongly flagged as fraud |

---

### 💼 Typical Features in Transaction Data

- `transaction_amount`
- `transaction_hour`
- `merchant_type` (encoded)
- `location_distance` from home
- `device_trust_score`

---

## **2. Mathematical Deep Dive** 🧮

### 📐 Core Logic

Isolation Forest estimates anomaly score using:
$$
s(x) = 2^{-\frac{h(x)}{c(n)}}
$$

Where:
- \( h(x) \): Average path length in trees
- \( c(n) \): Expected path length in random tree
- \( s(x) \to 1 \) → Very anomalous transaction

---

### 🧲 Math Intuition

- **High transaction amount + untrusted device + foreign country** = isolated quickly in trees  
- **Low score** = looks like other transactions  
- **High score** = odd, rare, suspect

---

### ⚠️ Assumptions & Constraints

- Fraud ≠ clustered → IF works better than LOF  
- Needs normalized input features  
- Will have **false positives** — use with analyst workflow

---

## **3. Practical Considerations** ⚙️

### 🔧 Hyperparameters

```python
model = IsolationForest(
    n_estimators=100,
    contamination=0.01,  # Assume ~1% fraud
    random_state=42
)
```

### 📏 Evaluation Metrics

- If labels available: **precision, recall, ROC AUC**
- If not: sample flagged transactions for **manual review**

---

## **4. Critical Analysis** 🔍

| Benefit                        | Risk                                |
|-------------------------------|--------------------------------------|
| Fast, scalable to big data    | Can flag legitimate large purchases |
| No labels required            | May miss subtle frauds in dense regions |
| Easy to plug into workflows   | Needs regular retraining as fraud evolves |

---

## **5. Glossary**

| Term | Definition |
|------|------------|
| **Fraud Score** | Model's confidence that a transaction is an anomaly |
| **Manual Review** | Analyst checks flagged transactions |
| **Path Length** | How many splits to isolate a point in a tree |
| **False Alarm** | Legit transaction marked as fraud |
| **Rolling Retrain** | Model is updated regularly to track new fraud patterns |

---

## **6. Full Python Code Cell** 🐍

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Simulated transaction dataset
np.random.seed(42)
n_legit = 1000
n_fraud = 15

# Legit transactions
legit = pd.DataFrame({
    'amount': np.random.normal(50, 10, n_legit),
    'hour': np.random.normal(13, 3, n_legit),
    'distance': np.random.normal(5, 2, n_legit),
    'device_trust': np.random.normal(0.9, 0.05, n_legit)
})

# Fraudulent transactions (outliers)
fraud = pd.DataFrame({
    'amount': np.random.normal(300, 50, n_fraud),
    'hour': np.random.normal(3, 1, n_fraud),
    'distance': np.random.normal(50, 10, n_fraud),
    'device_trust': np.random.normal(0.3, 0.1, n_fraud)
})

# Combine and scale
data = pd.concat([legit, fraud], ignore_index=True)
labels = np.array([0]*n_legit + [1]*n_fraud)  # 1 = fraud
X_scaled = StandardScaler().fit_transform(data)

# Fit Isolation Forest
model = IsolationForest(contamination=0.015, random_state=42)
model.fit(X_scaled)
scores = -model.decision_function(X_scaled)
preds = model.predict(X_scaled)  # -1 = anomaly, 1 = normal

# Plot results
plt.figure(figsize=(10, 5))
plt.hist(scores[:n_legit], bins=50, alpha=0.6, label='Legit')
plt.hist(scores[n_legit:], bins=10, alpha=0.9, label='Fraud')
plt.axvline(np.percentile(scores, 98.5), color='red', linestyle='--', label='Threshold')
plt.xlabel("Anomaly Score")
plt.ylabel("Number of Transactions")
plt.title("Anomaly Scores: Legit vs Fraud")
plt.legend()
plt.grid(True)
plt.show()

# Print performance
from sklearn.metrics import classification_report
mapped_preds = np.where(preds == -1, 1, 0)  # Map to 1 = fraud
print(classification_report(labels, mapped_preds, target_names=["Legit", "Fraud"]))
```

---

✅ You’ve just deployed an **unsupervised fraud detection model** — from feature simulation to anomaly scoring, thresholding, and evaluation.

Ready to head into the next UTHU chapter:  
🧠 **Manifold Learning with UMAP + t-SNE**?