# Real-World Use Case: Credit Card Fraud (Unsupervised)

## 1. The Problem
We have millions of transactions. We don't have good labels for fraud (because new types of fraud are invented every day). We want to flag transactions that look *suspiciously different* from valid ones.

## 2. Why Isolation Forest?
It assumes anomalies are "few and different". It is efficient on large datasets and works well when the "bad" class is extremely rare (< 0.1%).

## 3. Data Simulation
PCA vectors V1, V2... (typical of anonymous credit card data).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# 1. Generate Data
np.random.seed(42)
n_normal = 2000
n_fraud = 20

# Normal transactions (Dense cluster)
X_normal = np.random.normal(0, 1, (n_normal, 2))

# Fraud transactions (Scattered, far from center)
X_fraud = np.random.uniform(low=-4, high=4, size=(n_fraud, 2))
# Keep only ones far from origin
X_fraud = X_fraud[np.linalg.norm(X_fraud, axis=1) > 2.5]

X = np.concatenate([X_normal, X_fraud])

# 2. Pipeline
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train Isolation Forest
# contamination='auto' or 0.01 estimate
clf = IsolationForest(contamination=0.01, random_state=42)
y_pred = clf.fit_predict(X_scaled)

# 4. Visualize
plt.figure(figsize=(10, 6))
colors = np.array(['#377eb8', '#ff7f00'])
plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])
plt.title("Fraud Detection (Orange = Detected Anomaly)")
plt.show()

# How many fraud points did we catch?
# In unsupervised, we often just hand the 'Orange' points to an analyst.
n_outliers = (y_pred == -1).sum()
print(f"Flagged {n_outliers} transactions for review.")