# VAE for Anomaly Detection: Project Documentation

## 1. Project Overview
This project implements a **Variational Autoencoder (VAE)** to detect anomalies in high-dimensional synthetic data. The core principle is that a VAE trained effectively on normal data will have a **low reconstruction error** for normal samples but a **high reconstruction error** for anomalies, as it hasn't learned to encode/decode the anomalous patterns effectively.

## 2. Data Generation & Preprocessing

The project generates a synthetic dataset designed to simulate a realistic anomaly detection scenario.

*   **Total Samples**: 5,000 samples with 20 features.
*   **Normal Data**: Generated using a Gaussian mixture model (`make_blobs` with 3 centers) to simulate multi-modal normal behavior.
*   **Anomalies**: Introducing a **contamination ratio of ~3%**. These anomalies are generated from a distinct Gaussian distribution with a shifted mean (shifted by +4.0), ensuring they form a coherent "attack" or "fault" cluster rather than just random noise.
*   **Preprocessing**: All data is normalized using `MinMaxScaler` to the [0, 1] range, which is crucial for Neural Network training and ensures stable convergence.
*   **Train/Test Split**: Stratified split (80/20), meaning both training and testing sets contain a small fraction of anomalies. This represents a robust "unsupervised" or "semi-supervised" setting where the training data is not perfectly clean.

## 3. VAE Architecture

The model uses a standard VAE architecture implemented in PyTorch:

### Encoder
The encoder compresses the high-dimensional input ($x$) into a lower-dimensional latent space ($z$).
*   **Layers**: Input $\rightarrow$ Hidden (Relu) $\rightarrow$ Hidden (Relu) $\rightarrow$ Latent Parameters ($\\mu$, $\\sigma$).
*   **Output**: Two vectors representing the mean ($\\mu$) and log-variance ($\\log \\sigma^2$) of the latent distribution.

### Reparameterization Trick
To allow backpropagation through the stochastic sampling process, the model uses the reparameterization trick:
$$ z = \\mu + \\sigma \\cdot \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, 1) $$
This allows the network to learn the parameters of the distribution while maintaining differentiability.

### Decoder
The decoder attempts to reconstruct the original input from the latent vector ($z$).
*   **Layers**: Latent $\rightarrow$ Hidden (Relu) $\rightarrow$ Hidden (Relu) $\rightarrow$ Output (Sigmoid).
*   **Output Strategy**: Uses `Sigmoid` activation to ensure outputs are in [0, 1], matching the scaled input data.

## 4. Training Methodology

The model is trained to minimize a composite **VAE Loss Function**:

$$ \\mathcal{L} = \\mathcal{L}_{recon} + \\beta \\cdot \\mathcal{L}_{KL} $$

1.  **Reconstruction Loss ($\\mathcal{L}_{recon}$)**: Measures how well the VAE reconstructs the input. calculated using **MSE (Mean Squared Error)**.
2.  **KL Divergence ($\\mathcal{L}_{KL}$)**: Regularizes the latent space by forcing the learned distribution to approximate a standard Normal distribution $\\mathcal{N}(0, 1)$.
3.  **Beta ($\\beta$) Scaling**: A hyperparameter that weights the importance of the KL divergence.
    *   **High $\\beta$**: Forces a very smooth latent space but might result in blurry reconstructions (posterior collapse).
    *   **Low $\\beta$**: Prioritizes reconstruction accuracy, potentially leading to overfitting but sharper definitions for anomalies.

## 5. Anomaly Detection Logic

Once trained, the VAE is used as a scorer:
1.  **Inference**: Pass a sample $x$ through the VAE to get reconstructed $\hat{x}$.
2.  **Scoring**: Calculate Reconstruction Error: $Error = \\sum (x - \\hat{x})^2$.
3.  **Decision**:
    *   **Normal**: Error $<$ Threshold
    *   **Anomaly**: Error $>$ Threshold

## 6. Optimization & Evaluation Results

The implementation performs a Grid Search to find the best configuration:
*   **Hyperparameters Swept**: $\\beta \\in [0.1, 1.0, 5.0]$ and Latent Dimension $\\in [2, 5, 10]$.
*   **Best Configuration Found**: **$\\beta=0.1$, Latent Dim=10**.
    *   Since the goal is anomaly detection, a lower $\\beta$ (0.1) often works better because it allows the model to "overfit" slightly to the normal data structure, making the contrast with anomalies sharper.

### Metrics Used
*   **AUC-ROC**: Measures the ability to distinguish between classes across all thresholds. (Achieved ~0.99 with best model).
*   **AUC-PR (Average Precision)**: Critical for imbalanced datasets (like anomaly detection). (Achieved ~0.90 with best model).
*   **F1-Score**: The harmonic mean of precision and recall.

### Final Threshold Selection
The optimal threshold is selected by maximizing the F1-score on the test set.
*   **Visual Validation**: histograms show a clear separation between the reconstruction error distributions of Normal vs. Anomaly samples.