# Deep Learning for IAM Anomaly Detection

This notebook demonstrates advanced deep learning techniques for Identity and Access Management (IAM) anomaly detection:

1. **LSTM Networks** - For sequential access pattern analysis
2. **Transformer Models** - For attention-based anomaly detection
3. **Performance Comparison** - vs. traditional Isolation Forest
4. **Visualization** - Attention mechanisms and temporal patterns

## When to Use Deep Learning?

- **LSTM**: Multi-step attack patterns, temporal sequences
- **Transformer**: Feature importance, context-aware detection
- **Traditional ML**: Quick baseline, limited data, interpretability

In [None]:
import sys
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path
sys.path.append('..')

from src.data.generators import IAMDataGenerator
from src.data.preprocessors import IAMDataPreprocessor
from src.models.anomaly_detector import AnomalyDetector
from src.models.lstm_detector import LSTMDetector
from src.models.transformer_detector import TransformerDetector

# Set random seed
np.random.seed(42)

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Libraries imported successfully!")

## 1. Generate Synthetic IAM Data

Create realistic access logs with both normal and anomalous patterns.

In [None]:
print("Generating synthetic IAM data...")

generator = IAMDataGenerator()
df = generator.generate_complete_dataset(
    num_users=100,
    normal_events_per_user=100,
    anomaly_percentage=0.10
)

print(f"\nGenerated {len(df)} access events")
print(f"Anomaly rate: {df['is_anomaly'].mean():.2%}")
print(f"\nColumns: {list(df.columns)}")

df.head()

## 2. Preprocess Data

Convert raw access logs into ML-ready features.

In [None]:
print("Preprocessing data...")

preprocessor = IAMDataPreprocessor()
df_processed = preprocessor.preprocess_for_training(df)

# Get features
feature_cols = preprocessor.get_feature_columns()
X = df_processed[feature_cols].values
y = df_processed['is_anomaly'].values

print(f"\nFeature matrix shape: {X.shape}")
print(f"Features ({len(feature_cols)}): {feature_cols[:10]}...")
print(f"\nClass distribution:")
print(f"  Normal: {(y == 0).sum()} ({(y == 0).mean():.1%})")
print(f"  Anomaly: {(y == 1).sum()} ({(y == 1).mean():.1%})")

## 3. Train/Test Split

Split data temporally to simulate real-world deployment.

In [None]:
from sklearn.model_selection import train_test_split

# Temporal split (first 80% for training)
split_idx = int(0.8 * len(X))

X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTest set anomaly rate: {y_test.mean():.2%}")

## 4. Baseline: Isolation Forest

Train traditional ML model as baseline for comparison.

In [None]:
print("Training Isolation Forest (baseline)...")

baseline = AnomalyDetector(
    algorithm='isolation_forest',
    contamination=0.10
)

baseline.train(X_train, feature_names=feature_cols)

# Evaluate
y_pred_baseline = (baseline.predict(X_test) == -1).astype(int)

from sklearn.metrics import classification_report, confusion_matrix

print("\nBaseline Performance (Isolation Forest):")
print(classification_report(y_test, y_pred_baseline, target_names=['Normal', 'Anomaly']))

## 5. LSTM Detector for Sequential Patterns

Train LSTM to detect multi-step attack patterns in access sequences.

In [None]:
print("Training LSTM Detector...\n")

lstm = LSTMDetector(
    sequence_length=10,
    n_features=X_train.shape[1],
    lstm_units=[64, 32],
    dropout_rate=0.2,
    threshold=0.5
)

lstm.train(
    X_train,
    y_train,
    feature_names=feature_cols,
    validation_split=0.2,
    epochs=30,
    batch_size=32,
    verbose=1
)

print("\nLSTM training complete!")

### LSTM Training History

In [None]:
history_lstm = lstm.get_training_history()

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Loss
axes[0, 0].plot(history_lstm['loss'], label='Training')
axes[0, 0].plot(history_lstm['val_loss'], label='Validation')
axes[0, 0].set_title('LSTM Loss Over Epochs')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True)

# Accuracy
axes[0, 1].plot(history_lstm['accuracy'], label='Training')
axes[0, 1].plot(history_lstm['val_accuracy'], label='Validation')
axes[0, 1].set_title('LSTM Accuracy Over Epochs')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True)

# Precision
axes[1, 0].plot(history_lstm['precision'], label='Training')
axes[1, 0].plot(history_lstm['val_precision'], label='Validation')
axes[1, 0].set_title('LSTM Precision Over Epochs')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Precision')
axes[1, 0].legend()
axes[1, 0].grid(True)

# Recall
axes[1, 1].plot(history_lstm['recall'], label='Training')
axes[1, 1].plot(history_lstm['val_recall'], label='Validation')
axes[1, 1].set_title('LSTM Recall Over Epochs')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Recall')
axes[1, 1].legend()
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

### LSTM Evaluation

In [None]:
print("Evaluating LSTM on test set...")

# Get predictions
y_pred_lstm = lstm.predict_classes(X_test)

# Account for sequence length offset
y_test_lstm = y_test[lstm.sequence_length - 1:]

print("\nLSTM Performance:")
print(classification_report(y_test_lstm, y_pred_lstm, target_names=['Normal', 'Anomaly']))

# Confusion matrix
cm_lstm = confusion_matrix(y_test_lstm, y_pred_lstm)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lstm, annot=True, fmt='d', cmap='Blues')
plt.title('LSTM Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## 6. Transformer Detector for Feature Analysis

Train Transformer with attention mechanism for interpretable anomaly detection.

In [None]:
print("Training Transformer Detector...\n")

transformer = TransformerDetector(
    n_features=X_train.shape[1],
    embed_dim=32,
    num_heads=4,
    ff_dim=64,
    dropout_rate=0.1,
    threshold=0.5
)

transformer.train(
    X_train,
    y_train,
    feature_names=feature_cols,
    validation_split=0.2,
    epochs=30,
    batch_size=32,
    verbose=1
)

print("\nTransformer training complete!")

### Transformer Training History

In [None]:
history_transformer = transformer.get_training_history()

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Loss
axes[0, 0].plot(history_transformer['loss'], label='Training')
axes[0, 0].plot(history_transformer['val_loss'], label='Validation')
axes[0, 0].set_title('Transformer Loss Over Epochs')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True)

# Accuracy
axes[0, 1].plot(history_transformer['accuracy'], label='Training')
axes[0, 1].plot(history_transformer['val_accuracy'], label='Validation')
axes[0, 1].set_title('Transformer Accuracy Over Epochs')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True)

# Precision
axes[1, 0].plot(history_transformer['precision'], label='Training')
axes[1, 0].plot(history_transformer['val_precision'], label='Validation')
axes[1, 0].set_title('Transformer Precision Over Epochs')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Precision')
axes[1, 0].legend()
axes[1, 0].grid(True)

# Recall
axes[1, 1].plot(history_transformer['recall'], label='Training')
axes[1, 1].plot(history_transformer['val_recall'], label='Validation')
axes[1, 1].set_title('Transformer Recall Over Epochs')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Recall')
axes[1, 1].legend()
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

### Transformer Evaluation

In [None]:
print("Evaluating Transformer on test set...")

# Get predictions
y_pred_transformer = transformer.predict_classes(X_test)

print("\nTransformer Performance:")
print(classification_report(y_test, y_pred_transformer, target_names=['Normal', 'Anomaly']))

# Confusion matrix
cm_transformer = confusion_matrix(y_test, y_pred_transformer)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_transformer, annot=True, fmt='d', cmap='Greens')
plt.title('Transformer Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## 7. Model Comparison

Compare all three approaches: Isolation Forest, LSTM, and Transformer.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Calculate metrics
models = {
    'Isolation Forest': {
        'y_true': y_test,
        'y_pred': y_pred_baseline
    },
    'LSTM': {
        'y_true': y_test_lstm,
        'y_pred': y_pred_lstm
    },
    'Transformer': {
        'y_true': y_test,
        'y_pred': y_pred_transformer
    }
}

comparison = []
for name, data in models.items():
    comparison.append({
        'Model': name,
        'Accuracy': accuracy_score(data['y_true'], data['y_pred']),
        'Precision': precision_score(data['y_true'], data['y_pred']),
        'Recall': recall_score(data['y_true'], data['y_pred']),
        'F1-Score': f1_score(data['y_true'], data['y_pred'])
    })

df_comparison = pd.DataFrame(comparison)
print("\nModel Comparison:")
print(df_comparison.to_string(index=False))

# Visualize comparison
df_comparison_plot = df_comparison.set_index('Model')
df_comparison_plot.plot(kind='bar', figsize=(12, 6), rot=0)
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Attention Visualization (Transformer)

Visualize which features the Transformer pays attention to.

In [None]:
print("Analyzing feature importance via Transformer attention...")

# Find anomalous samples
anomaly_indices = np.where(y_test == 1)[0]
sample_idx = anomaly_indices[0]

# Analyze sample
sample = X_test[sample_idx]
result = transformer.analyze_event(sample)

print(f"\nSample Analysis:")
print(f"  Anomaly: {result['is_anomaly']}")
print(f"  Probability: {result['anomaly_probability']:.4f}")
print(f"  Risk Level: {result['risk_level']}")

if 'top_features' in result:
    print(f"\n  Top Contributing Features:")
    for feat in result['top_features']:
        print(f"    - {feat['feature']}: {feat['value']:.3f} (importance: {feat['importance']:.3f})")
    
    # Visualize
    features = [f['feature'] for f in result['top_features']]
    importances = [f['importance'] for f in result['top_features']]
    
    plt.figure(figsize=(10, 6))
    plt.barh(features, importances, color='steelblue')
    plt.xlabel('Importance Score')
    plt.title('Top Features Contributing to Anomaly Detection')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

## 9. Multi-Step Attack Pattern Detection (LSTM)

Demonstrate LSTM's ability to detect attack sequences.

In [None]:
print("Detecting multi-step attack patterns with LSTM...")

# Get attack patterns
user_ids = df_processed['user_id'].values[split_idx:]
attack_patterns = lstm.detect_attack_patterns(X_test, user_ids)

print(f"\nDetected {len(attack_patterns)} suspicious sequences")
print("\nTop 5 Attack Patterns:")
print(attack_patterns.head().to_string(index=False))

# Visualize attack pattern distribution
if len(attack_patterns) > 0:
    pattern_counts = attack_patterns['pattern_type'].value_counts()
    
    plt.figure(figsize=(10, 6))
    pattern_counts.plot(kind='barh', color='coral')
    plt.title('Detected Attack Pattern Types', fontsize=14, fontweight='bold')
    plt.xlabel('Count')
    plt.ylabel('Pattern Type')
    plt.tight_layout()
    plt.show()

## 10. Real-Time Anomaly Scoring

Compare anomaly probabilities across all models.

In [None]:
print("Generating anomaly probability distributions...")

# Get probabilities
prob_baseline = -baseline.score_samples(X_test)
prob_lstm = lstm.predict(X_test)
prob_transformer = transformer.predict(X_test)

# Normalize baseline scores to [0, 1]
prob_baseline = (prob_baseline - prob_baseline.min()) / (prob_baseline.max() - prob_baseline.min())

# Plot distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Isolation Forest
axes[0].hist(prob_baseline[y_test == 0], bins=30, alpha=0.6, label='Normal', color='green')
axes[0].hist(prob_baseline[y_test == 1], bins=30, alpha=0.6, label='Anomaly', color='red')
axes[0].set_title('Isolation Forest')
axes[0].set_xlabel('Anomaly Score')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# LSTM
axes[1].hist(prob_lstm[y_test_lstm == 0], bins=30, alpha=0.6, label='Normal', color='green')
axes[1].hist(prob_lstm[y_test_lstm == 1], bins=30, alpha=0.6, label='Anomaly', color='red')
axes[1].set_title('LSTM')
axes[1].set_xlabel('Anomaly Probability')
axes[1].set_ylabel('Frequency')
axes[1].legend()

# Transformer
axes[2].hist(prob_transformer[y_test == 0], bins=30, alpha=0.6, label='Normal', color='green')
axes[2].hist(prob_transformer[y_test == 1], bins=30, alpha=0.6, label='Anomaly', color='red')
axes[2].set_title('Transformer')
axes[2].set_xlabel('Anomaly Probability')
axes[2].set_ylabel('Frequency')
axes[2].legend()

plt.tight_layout()
plt.show()

## Summary and Recommendations

### Key Findings:

1. **Deep Learning Advantages**:
   - LSTM excels at detecting temporal attack patterns
   - Transformer provides better interpretability via attention
   - Both outperform traditional ML on complex patterns

2. **Trade-offs**:
   - Training time: Isolation Forest (seconds) vs. Deep Learning (minutes)
   - Data requirements: Deep Learning needs more samples
   - Interpretability: Transformer > Isolation Forest > LSTM

3. **Use Case Recommendations**:
   - **Quick Baseline**: Isolation Forest
   - **Multi-step Attacks**: LSTM
   - **Feature Analysis**: Transformer
   - **Production**: Ensemble of all three

### Next Steps:

1. Deploy models via FastAPI
2. Implement ensemble voting
3. Add real-time streaming
4. Integrate with SIEM
5. Continuous retraining pipeline

## Save Models for Production

In [None]:
import os

# Create model directory
os.makedirs('../models/trained/deep_learning', exist_ok=True)

# Save LSTM
lstm.save('../models/trained/deep_learning/lstm')

# Save Transformer
transformer.save('../models/trained/deep_learning/transformer')

# Save baseline for comparison
baseline.save('../models/trained/anomaly_detector_baseline.joblib')

print("All models saved successfully!")