# **Variational AutoEncoder for Anomaly Detection**

## Premise
The Variational Autoencoder (VAE) is a generative model that learns to encode data into a lower-dimensional latent space and then decode it back to reconstruct the original data. For anomaly detection, we exploit the fact that VAEs learn the normal data distribution, making anomalies result in higher reconstruction errors.

## Core Concepts with Simple Example
Let's use a simple 2D point dataset to illustrate each step.

Example Dataset:
```python
# Normal data: Points roughly following a circular pattern
normal_points = [
    (1.1, 0.8), (0.9, 1.2), (-0.8, 1.1), (-1.1, -0.9),
    (0.8, -1.1), (0.1, 0.2), (-0.2, -0.1)
]
# Anomaly: Point far from the pattern
anomaly = (4.0, 4.0)
```

## Derivation of the Loss Function used in Variational AutoEncoder

### 1. Bayes' Theorem:
$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$

*Example Interpretation:*
- x: Our 2D point (1.1, 0.8)
- z: Lower dimensional latent representation (e.g., single number)
- p(z|x): Probability of latent code given our point
- p(x|z): Probability of reconstructing our point from the latent code
- p(z): Prior beliefs about latent codes (standard normal distribution)
- p(x): Total probability of observing the point

### 2. Log of Marginal Likelihood:
$\log p(x) = \log \int p(x|z)p(z) \, dz$

*Example:*
For our point (1.1, 0.8), we'd need to:
1. Consider all possible latent codes z
2. For each z, multiply:
   - Probability of reconstructing (1.1, 0.8) from z
   - Probability of that z occurring
3. Sum all these products (integral)
4. Take the log

### 3. Variational Approximation ($q(z|x)$):
Introduce a simpler distribution $q(z|x)$ to approximate the posterior $p(z|x)$:
$\log p(x) = \log \int p(x|z)p(z) \frac{q(z|x)}{q(z|x)} \, dz$

*Example:*
Instead of computing exact probabilities, we use a neural network (encoder) to predict:
- Mean (μ) and variance (σ²) of a Gaussian distribution for our point (1.1, 0.8)
- e.g., μ = 0.5, σ² = 0.1

### 4. Jensen's Inequality (Lower Bound):
$$\begin{aligned}
\log p(x) &= \log \mathbb{E}_{q(z|x)}\left[\frac{p(x|z)p(z)}{q(z|x)}\right] \\
&\geq \int q(z|x) \log \frac{p(x|z)p(z)}{q(z|x)} \, dz \\
&= \mathbb{E}_{q(z|x)} [ \log p(x|z) + \log p(z) - \log q(z|x) ]
\end{aligned}$$

*Example:*
For our point, this means:
1. Sample latent codes from our predicted distribution (μ = 0.5, σ² = 0.1)
2. For each sample:
   - Reconstruct the point
   - Calculate reconstruction probability
   - Add prior probability
   - Subtract encoding probability
3. Average these values

### 5. Simplification with Gaussian Assumptions:
- **Prior**: $ p(z) = \mathcal{N}(z; 0, I) $
- **Variational Posterior**: $ q(z|x) = \mathcal{N}(z; \mu_q(x), \text{diag}(\sigma_q^2(x))) $
- **Likelihood**: $ p(x|z) = \mathcal{N}(x; f(z), \sigma^2I) $

*Example:*
For (1.1, 0.8):
- Prior: Assume latent codes follow standard normal
- Encoder predicts: μ = 0.5, σ² = 0.1
- Decoder predicts: reconstruction = (1.0, 0.9)

### 6. Reconstruction Error (Gaussian Likelihood):
$$\mathbb{E}_{q(z|x)} [ \log p(x|z) ] = \mathbb{E}_{q(z|x)}\left[-\frac{1}{2\sigma^2}\|x - f(z)\|^2 - \frac{n}{2}\log(2\pi\sigma^2)\right]$$

*Example:*
- Original point: (1.1, 0.8)
- Reconstructed point: (1.0, 0.9)
- Error = √((1.1-1.0)² + (0.8-0.9)²) = 0.141

### 7. KL Divergence:
$$D_{KL}(q(z|x) \| p(z)) = \frac{1}{2}\sum_{i=1}^{d} (\sigma_{q,i}^2 + \mu_{q,i}^2 - \log \sigma_{q,i}^2 - 1)$$

*Example:*
For our predicted distribution (μ = 0.5, σ² = 0.1):
KL = 0.5 * (0.1 + 0.5² - log(0.1) - 1) = 1.35

### 8. Reparameterization Trick:
$$z = \mu_q(x) + \sigma_q(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

*Example:*
Instead of sampling directly from N(0.5, 0.1):
1. Sample ε ~ N(0, 1), e.g., ε = 0.3
2. Compute: z = 0.5 + √0.1 * 0.3 = 0.595

### 9. Complete ELBO Loss:
$$\begin{aligned}
\text{ELBO} &= \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}\left[-\frac{1}{2\sigma^2}\|x - f(\mu_q(x) + \sigma_q(x) \odot \epsilon)\|^2\right] \\
&- \frac{1}{2}\sum_{i=1}^{d} (\sigma_{q,i}^2 + \mu_{q,i}^2 - \log \sigma_{q,i}^2 - 1) - \frac{n}{2}\log(2\pi\sigma^2)
\end{aligned}$$

*Example:*
Combining:
- Reconstruction error: 0.141
- KL divergence: 1.35
ELBO ≈ -0.141 - 1.35 = -1.491

### 10. Final Simplified ELBO Loss with $\beta$ (β-VAE):
$$\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}\left[-\frac{1}{2\sigma^2}\|x - f(\mu_q(x) + \sigma_q(x) \odot \epsilon)\|^2\right] - \beta D_{KL}(q(z|x)\|p(z))$$

*Example with β = 0.5:*
$$\mathcal{L}_{\beta\text{-VAE}} = -0.141 - 0.5 * 1.35 = -0.816$$

For anomaly detection:
- Normal point (1.1, 0.8): Loss = -0.816
- Anomaly point (4.0, 4.0): Loss ≈ -5.234 (much higher reconstruction error)

This higher loss for the anomaly point indicates it doesn't fit the learned normal data distribution, allowing us to detect it as an anomaly.

In [1]:
!pip3 install -q pyod==2.0.2

In [2]:
import matplotlib.pyplot as plt
from pyod.models.vae import VAE
from pyod.models.auto_encoder import AutoEncoder
from pyod.utils.data import (
    generate_data, evaluate_print
)
from sklearn.metrics import (
    balanced_accuracy_score, f1_score
)

plt.style.use('dark_background')

# Generate synthetic data
contamination = 0.1
n_train = 1000
n_test = 100
n_features = 2

X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, n_test=n_test, 
    n_features=n_features,
    contamination=contamination, random_state=1
)

# Train the VAE model
clf_name_vae = 'VAE'
vae_clf = VAE(epoch_num=30, 
              contamination=contamination, 
              beta=1.0)
vae_clf.fit(X_train)

Training: 100%|█████████████████████████████████| 30/30 [00:00<00:00, 31.24it/s]


In [3]:
# Train the AE model
clf_name_ae = 'AE'
ae_clf = AutoEncoder(epoch_num=30, 
                     contamination=contamination)
ae_clf.fit(X_train)

# Predictions and scores for VAE
y_test_pred_vae = vae_clf.predict(X_test)

# Predictions and scores for AE
y_test_pred_ae = ae_clf.predict(X_test)

from sklearn.metrics import (
    balanced_accuracy_score, f1_score
)

# Compute metrics function
def compute_metrics(y_true, y_pred):
    balanced_acc = balanced_accuracy_score(
        y_true, y_pred
    )
    f1 = f1_score(y_true, y_pred)
    return balanced_acc, f1

Training: 100%|█████████████████████████████████| 30/30 [00:00<00:00, 44.15it/s]


In [None]:
def visualize_detailed_results(X, y_true, y_pred, model_name, dataset_name, ax):
    # Compute metrics
    balanced_acc, f1 = compute_metrics(y_true, y_pred)
    
    # Plot points with different categories
    ax.scatter(X[(y_true == 1) & (y_pred == 1), 0], X[(y_true == 1) & (y_pred == 1), 1], 
               c='red', marker='x', label='True Positive (Anomaly)')
    ax.scatter(X[(y_true == 0) & (y_pred == 0), 0], X[(y_true == 0) & (y_pred == 0), 1], 
               c='green', marker='+', label='True Negative (Non-Anomaly)')
    ax.scatter(X[(y_true == 0) & (y_pred == 1), 0], X[(y_true == 0) & (y_pred == 1), 1], 
               c='orange', marker='*', label='False Positive (Non-Anomaly)')
    ax.scatter(X[(y_true == 1) & (y_pred == 0), 0], X[(y_true == 1) & (y_pred == 0), 1], 
               c='blue', marker='^', label='False Negative (Anomaly)')
    
    # Title with metrics
    ax.set_title(f"{model_name} - {dataset_name}\nBalanced Acc: {balanced_acc:.2f}, F1: {f1:.2f}")
    ax.set_xlabel("Feature 1")
    ax.set_ylabel("Feature 2")
    ax.legend(loc='upper left')

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(9, 9), dpi=300)

# Visualize results for VAE on test data
visualize_detailed_results(X_test, y_test, y_test_pred_vae, "VAE", "Test Data", axes[0, 0])

# Visualize results for AE on test data
visualize_detailed_results(X_test, y_test, y_test_pred_ae, "AE", "Test Data", axes[0, 1])

# Visualize results for VAE on training data
visualize_detailed_results(X_train, y_train, vae_clf.labels_, "VAE", "Training Data", axes[1, 0])

# Visualize results for AE on training data
visualize_detailed_results(X_train, y_train, ae_clf.labels_, "AE", "Training Data", axes[1, 1])

plt.tight_layout()
plt.show()


| **Aspect**                   | **Autoencoder (AE)**                                      | **Variational Autoencoder (VAE)**                          |
|------------------------------|-----------------------------------------------------------|------------------------------------------------------------|
| **Latent Space**              | Deterministic representation (fixed point for each input) | Probabilistic representation (distribution over latent variables) |
| **Latent Space Structure**    | No specific regularization; can be scattered and unstructured | Regularized to match a predefined prior (typically Gaussian), resulting in a smooth, continuous space |
| **Objective**                 | Minimize reconstruction error (e.g., MSE)                 | Minimize both reconstruction error and KL divergence to enforce latent space structure |
| **Encoder Output**            | Direct mapping to a single point in latent space          | Outputs parameters of a distribution (mean and variance) for each latent variable |
| **Generative Capability**     | Limited generative ability; may not generalize well for new data | Strong generative capability due to regularized latent space |
| **Latent Variable Interpolation** | Less smooth interpolation between latent variables        | Smooth interpolation due to the continuous nature of the latent space |
| **KL Divergence**             | Not used in the loss function                             | KL divergence term in the loss function regularizes the latent space |
| **Reconstruction**            | Reconstructs the input deterministically                  | Reconstructs the input probabilistically, sampling from the learned latent distribution |
| **Use Cases**                 | Mainly used for dimensionality reduction and reconstruction tasks | Used for generative modeling, data generation, and anomaly detection |
| **Regularization**            | None                                                     | Explicit regularization to ensure the latent space follows a known distribution (e.g., Gaussian) |