# Anomaly Detection Basics
**CONFIDENTIAL - ZeTheta Algorithms Pvt Ltd**

## Learning Objectives
1. Understand what anomalies are
2. Implement basic Isolation Forest
3. Visualize anomalies
4. Evaluate detection performance

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_classification

print("✓ Libraries imported successfully")

## Step 1: Generate Sample Transaction Data

In [None]:
# Create sample dataset
np.random.seed(42)

# Normal transactions
normal_transactions = np.random.randn(1000, 2) * 20 + 100

# Anomalous transactions (outliers)
anomalies = np.random.randn(50, 2) * 50 + 200

# Combine
X = np.vstack([normal_transactions, anomalies])

print(f"Total transactions: {len(X)}")
print(f"Normal: {len(normal_transactions)}")
print(f"Anomalies: {len(anomalies)}")

## Step 2: Train Isolation Forest

In [None]:
# Create and train model
clf = IsolationForest(
    contamination=0.05,  # Expected proportion of outliers
    random_state=42
)

# Fit model
clf.fit(X)

# Predict (-1 for anomalies, 1 for normal)
predictions = clf.predict(X)

print(f"✓ Model trained")
print(f"Detected anomalies: {(predictions == -1).sum()}")

## Step 3: Visualize Results

In [None]:
# Plot
plt.figure(figsize=(10, 6))

# Normal points
normal_mask = predictions == 1
plt.scatter(X[normal_mask, 0], X[normal_mask, 1], 
           c='blue', label='Normal', alpha=0.5)

# Anomalies
anomaly_mask = predictions == -1
plt.scatter(X[anomaly_mask, 0], X[anomaly_mask, 1], 
           c='red', label='Anomaly', alpha=0.8, marker='x', s=100)

plt.xlabel('Feature 1 (e.g., Transaction Amount)')
plt.ylabel('Feature 2 (e.g., Transaction Frequency)')
plt.title('Anomaly Detection with Isolation Forest')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("✓ Visualization complete")

## Key Takeaways

1. **Isolation Forest** works by isolating anomalies (they're easier to separate)
2. **Contamination parameter** sets expected % of anomalies
3. **No labels needed** - it's unsupervised learning
4. **Fast and effective** for high-dimensional fraud data

## Next Steps
- Try different contamination values
- Test on real fraud datasets
- Compare with other algorithms (LOF, One-Class SVM)