# Autoencoder for Anomaly Detection in CERN Data

This notebook demonstrates the complete workflow for training an autoencoder on Standard Model (SM) data and detecting Beyond Standard Model (BSM) anomalies.

## Overview

- **Training**: Autoencoder learns to reconstruct SM (background) data
- **Detection**: High reconstruction error indicates BSM (signal) anomalies
- **Application**: Search for bulk graviton signals in particle physics data

## 1. Setup and Imports

In [None]:
import os
import sys
import yaml
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add src to path
sys.path.append('../src')

from model import create_autoencoder, reconstruction_error
from data_utils import DataProcessor, generate_synthetic_data
from train import train_autoencoder
from evaluate import evaluate_anomaly_detection

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Setup complete!")

## 2. Load Configuration

In [None]:
# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded:")
print(f"Input dimension: {config['model']['input_dim']}")
print(f"Encoding dimensions: {config['model']['encoding_dims']}")
print(f"Latent dimension: {config['model']['latent_dim']}")
print(f"Batch size: {config['training']['batch_size']}")
print(f"Epochs: {config['training']['epochs']}")

## 3. Generate/Load Data

For demonstration purposes, we'll generate synthetic data. In practice, replace this with your actual CERN data.

In [None]:
# Generate synthetic data
sm_data, bsm_data = generate_synthetic_data(
    n_samples=10000,
    n_features=config['model']['input_dim'],
    anomaly_ratio=0.1,
    random_state=42
)

print(f"SM data shape: {sm_data.shape}")
print(f"BSM data shape: {bsm_data.shape}")

# Visualize data distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(sm_data[:, 0], bins=50, alpha=0.7, label='SM', color='blue')
axes[0].hist(bsm_data[:, 0], bins=50, alpha=0.7, label='BSM', color='red')
axes[0].set_xlabel('Feature 0')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Feature Distribution: SM vs BSM')
axes[0].legend()

axes[1].hist(np.mean(sm_data, axis=1), bins=50, alpha=0.7, label='SM', color='blue')
axes[1].hist(np.mean(bsm_data, axis=1), bins=50, alpha=0.7, label='BSM', color='red')
axes[1].set_xlabel('Mean of all features')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Mean Feature Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

## 4. Train Autoencoder

Train the autoencoder on SM data only (normal/background data).

In [None]:
# Train model
model, history = train_autoencoder(config, use_synthetic=True)

print("\nTraining completed!")

## 5. Visualize Training Progress

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('Training Progress: Loss')
axes[0].legend()
axes[0].grid(True)

# MAE
axes[1].plot(history.history['mae'], label='Training MAE')
axes[1].plot(history.history['val_mae'], label='Validation MAE')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MAE')
axes[1].set_title('Training Progress: MAE')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

## 6. Evaluate Anomaly Detection

Test the model's ability to detect BSM anomalies.

In [None]:
# Evaluate anomaly detection
metrics = evaluate_anomaly_detection(config, use_synthetic=True)

print("\nEvaluation completed!")

## 7. Visualize Latent Space

Explore the learned latent representations.

In [None]:
# Prepare data
data_processor = DataProcessor(config)
sm_sample = data_processor.preprocess_data(sm_data[:1000], fit_scaler=True)
bsm_sample = data_processor.preprocess_data(bsm_data[:100], fit_scaler=False)

# Encode data
sm_encoded = model.encode(sm_sample).numpy()
bsm_encoded = model.encode(bsm_sample).numpy()

# Visualize (first 2 dimensions)
plt.figure(figsize=(10, 8))
plt.scatter(sm_encoded[:, 0], sm_encoded[:, 1], 
           alpha=0.5, label='SM', c='blue', s=30)
plt.scatter(bsm_encoded[:, 0], bsm_encoded[:, 1], 
           alpha=0.7, label='BSM', c='red', s=30)
plt.xlabel('Latent Dimension 1')
plt.ylabel('Latent Dimension 2')
plt.title('Latent Space Representation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 8. Example Reconstructions

Visualize actual vs reconstructed samples.

In [None]:
# Select samples
sm_samples = sm_sample[:5]
bsm_samples = bsm_sample[:5]

# Get reconstructions
sm_reconstructed = model.predict(sm_samples)
bsm_reconstructed = model.predict(bsm_samples)

# Plot
fig, axes = plt.subplots(2, 5, figsize=(20, 8))

for i in range(5):
    # SM samples
    axes[0, i].plot(sm_samples[i], label='Original', alpha=0.7)
    axes[0, i].plot(sm_reconstructed[i], label='Reconstructed', alpha=0.7)
    axes[0, i].set_title(f'SM Sample {i+1}')
    axes[0, i].legend()
    axes[0, i].grid(True, alpha=0.3)
    
    # BSM samples
    axes[1, i].plot(bsm_samples[i], label='Original', alpha=0.7)
    axes[1, i].plot(bsm_reconstructed[i], label='Reconstructed', alpha=0.7)
    axes[1, i].set_title(f'BSM Sample {i+1}')
    axes[1, i].legend()
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Summary

The autoencoder successfully:
1. Learns to reconstruct SM (background) data with low error
2. Produces high reconstruction error for BSM (signal) data
3. Enables anomaly detection by thresholding reconstruction error

### Next Steps:
- Replace synthetic data with actual CERN particle physics data
- Tune hyperparameters for optimal performance
- Experiment with different architectures (VAE, convolutional layers, etc.)
- Apply to bulk graviton signal detection