# A Simple Autoencoder for Anomaly Detection

Anomaly detection is the task of finding anomalous data elements in a dataset. An anomaly is a data element that is an outlier with respect to the rest of the dataset.

We are going to train an autoencoder on the MNIST dataset (that only contains numbers), and then we will look into anomalies within the MNIST dataset (i.e., images within MNIST that are somehow different than the rest of the dataset).

Even though MNIST is a labeled datasets, we are going to disregard the labels for educational purposes and consider it as an unlabeled dataset.

In [1]:
import torch
import numpy as np
from torchvision import datasets
import torchvision.transforms as transforms
import multiprocessing
from tqdm import tqdm
from helpers import anomaly_detection_display
import pandas as pd
import sys
from pathlib import Path

# Ensure repeatibility
np.random.seed(10)


project_root_path = str(Path.cwd().parent)

if project_root_path not in sys.path:
    sys.path.insert(0, project_root_path)
    print(f"Added to sys.path: {project_root_path}")

from src.data import get_data_loaders


Added to sys.path: /Users/chaklader/Documents/Education/Udacity/Deep_Learning/Projects/2_landmark-classification-cnn


In [2]:
# This will get data loaders for the MNIST dataset for the train, validation
# and test dataset
data_loaders = get_data_loaders(batch_size=32)

Reusing cached mean and std
Dataset mean: tensor([0.4638, 0.4725, 0.4687]), std: tensor([0.2699, 0.2706, 0.3018])


### Visualize the Data

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
    
# obtain one batch of training images
dataiter = iter(data_loaders['train'])
images, labels = next(dataiter)
images = images.numpy()

# get one image from the batch
img = np.squeeze(images[0])

fig, sub = plt.subplots(figsize = (2,2)) 
sub.imshow(img, cmap='gray')
_ = sub.axis("off")

TypeError: Invalid shape (3, 224, 224) for image data


---

#### Linear Autoencoder

We'll train an autoencoder with these images by flattening them into vectors of length 784. The images from this dataset are already normalized such that the values are between 0 and 1. 

Here you will build a simple autoencoder. 

The encoder and decoder should be made of simple Multi-Layer Perceptrons. The units that connect the encoder and decoder will be the _compressed representation_ (also called _embedding_).

Since the images are normalized between 0 and 1, you will need to use a **sigmoid activation on the output layer** to get values that match this input value range.

For this exercise you are going to use a dimension for the embeddings of 32.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

"""
Dense Autoencoder (dense_autoencoder_solution.ipynb):
    A simple autoencoder using fully connected layers.

    Architecture:
        - Encoder: 28*28 → 128 → 64 → encoding_dim
        - Decoder: encoding_dim → 64 → 128 → 28*28
        - Activation: ReLU (hidden), Sigmoid (output)
        - Input: Flattened 28x28 images (784D)
        - Output: Reconstructed 28x28 images (0-1 range)

    Use Case: Basic autoencoder for flattened image data
    Strengths: Simple, good for understanding autoencoder concepts
    Weaknesses: Loses spatial information, less effective for images

CNN Autoencoder (cnn_autoencoder.ipynb):
    A convolutional autoencoder that preserves spatial information.

    Architecture:
        - Encoder: 
            Conv2d(1,16,3) → MaxPool(2) → 
            Conv2d(16,4,3) → MaxPool(2)  # 4x7x7 bottleneck
        - Decoder:
            ConvTranspose2d(4,16,2,stride=2) → 
            ConvTranspose2d(16,1,2,stride=2)
        - Activation: ReLU (hidden), Sigmoid (output)
        - Input: 1x28x28 images
        - Output: Reconstructed 1x28x28 images (0-1 range)

    Use Case: Image data where spatial relationships matter
    Strengths: Preserves spatial information, better feature extraction
    Weaknesses: More complex, requires more parameters

Key Differences:
    1. Dimensionality:
       - Dense: Flattens to 1D (loses spatial info)
       - CNN: Maintains 2D structure (preserves spatial info)
    
    2. Layer Types:
       - Dense: Fully connected (Linear) layers
       - CNN: Convolutional and transposed convolutional layers
    
    3. Bottleneck:
       - Dense: encoding_dim (configurable)
       - CNN: Fixed 4x7x7 feature maps
    
    4. Performance:
       - Dense: Faster training, less memory
       - CNN: Better reconstruction quality for images

A Convolutional Denoising Autoencoder for image denoising.

Architecture:
    Encoder (Downsampling Path):
        1. Conv2d(1,32,3) + ReLU + BatchNorm + MaxPool(2)
            - Input: 1x28x28
            - Output: 32x14x14
        2. Conv2d(32,16,3) + ReLU + BatchNorm + MaxPool(2)
            - Output: 16x7x7
        3. Conv2d(16,8,3) + ReLU + BatchNorm + MaxPool(2)
            - Output: 8x3x3 (bottleneck)
    
    Decoder (Upsampling Path):
        1. ConvTranspose2d(8,8,3,stride=2) + ReLU + BatchNorm
            - Output: 8x7x7
        2. ConvTranspose2d(8,16,2,stride=2) + ReLU + BatchNorm
            - Output: 16x14x14
        3. ConvTranspose2d(16,32,2,stride=2) + ReLU + BatchNorm
            - Output: 32x28x28
        4. Conv2d(32,1,3) + Sigmoid
            - Output: 1x28x28 (reconstructed image)

Key Features:
    - Uses strided convolutions for downsampling (instead of pooling)
    - Batch Normalization after each conv/transposed conv
    - ReLU activations for non-linearity
    - Sigmoid activation in final layer for pixel values in [0,1]
    - Skip connections could be added between encoder/decoder

Input/Output:
    - Input: Noisy image (1x28x28)
    - Output: Denoised image (1x28x28)

Note: The final output dimensions might need adjustment based on 
input size due to the transposed convolution operations.
"""
# define the NN architecture
class Autoencoder(nn.Module):
    
    def __init__(self, encoding_dim):
        super(Autoencoder, self).__init__()
        ## encoder ##
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, encoding_dim),
            nn.ReLU(),
            nn.BatchNorm1d(encoding_dim),

        )
        
        ## decoder ##
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, 28*28),
            nn.Sigmoid()
        )
        
        self.auto_encoder = nn.Sequential(
            nn.Flatten(),
            self.encoder,
            self.decoder
        )

    def forward(self, x):
        # define feedforward behavior 
        # and scale the *output* layer with a sigmoid activation function
        
        encoded = self.auto_encoder(x)
        
        # Reshape the output as an image
        # remember that the shape should be (batch_size, channel_count, height, width)
        return encoded.reshape((x.shape[0], 1, 28, 28))
    
# initialize the NN
encoding_dim = 32
model = Autoencoder(encoding_dim)

---
## Loss Function

As explained in the lesson, we can use the Mean Squared Error loss, which is called `MSELoss` in PyTorch:

In [None]:
# specify loss function
criterion = nn.MSELoss()

## Training

The training loop is similar to a normal training loop, however, this task is an unsupervised task. That means we do not need labels. The MNIST dataset does provide labels, of course, so we will just disregard them.

For this simple autoencoder we do not need the GPU, so we will train on the CPU.

In [None]:
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-2)

In [None]:
# number of epochs to train the model
n_epochs = 50

for epoch in range(1, n_epochs + 1):
    # monitor training loss
    train_loss = 0.0
    
    model.train()
        
    ###################
    # train the model #
    ###################
    for data in tqdm(desc="Training", total=len(data_loaders['train']), iterable=data_loaders['train']):
        # we disregard the labels. We use the Python convention of calling
        # an unused variable "_"
        images, _ = data

        # clear the gradients of all optimized variables
        optimizer.zero_grad()
        # forward pass: compute predicted outputs by passing inputs to the model
        outputs = model(images)
        # calculate the loss
        loss = criterion(outputs.flatten(), images.flatten())
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update running training loss
        train_loss += loss.item() * images.size(0)
    
    # Validation
    val_loss = 0
    with torch.no_grad():
        for data in tqdm(desc="Validating", total=len(data_loaders['valid']), iterable=data_loaders['valid']):
            # _ stands in for labels, here
            images, _ = data

            # forward pass: compute predicted outputs by passing inputs to the model
            outputs = model(images)
            # calculate the loss
            loss = criterion(outputs.flatten(), images.flatten())
            
            # update running training loss
            val_loss += loss.item() * images.size(0)
    
    # print avg training statistics
    train_loss /= len(data_loaders['train'])
    val_loss /= len(data_loaders['valid'])
    print("Epoch: {} \tTraining Loss: {:.6f}\tValid Loss: {:.6f}".format(epoch, train_loss, val_loss))

## Finding Anomalies
Now that our autoencoder is trained we can use it to find anomalies. Let's consider the test set. We loop over all the batches in the test set and we record the value of the loss for each example separately. The examples with the highest reconstruction loss are our anomalies. 

Indeed, if the reconstruction loss is high, that means that our trained autoencoder could not reconstruct them well. Indeed, what the autoencoder learned about our dataset during training is not enough to describe these examples, which means they are different than what the encoder has seen during training, i.e., they are anomalies (or at least they are the most uncharacteristic examples).

Let's have a look:

In [None]:
# Since this dataset is small we collect all the losses as well as
# the image and its reconstruction in a dictionary. In case of a
# larger dataset you might have to save on disk
# (won't fit in memory)
losses = {}

# We need the loss by example (not by batch)
loss_no_reduction = nn.MSELoss(reduction='none')

idx = 0

with torch.no_grad():
    for data in tqdm(desc="Testing", total=len(data_loaders['test']),
            iterable=data_loaders['test']
        ):

            images, _ = data
                        
            # forward pass: compute predicted outputs by passing inputs to the model
            outputs = model(images)
            
            # calculate the loss
            loss = loss_no_reduction(outputs, images)
            
            # Accumulate results per-example
            for i, l in enumerate(loss.mean(dim=[1, 2, 3])):
                losses[idx + i] = {
                    'loss': float(l.cpu().numpy()),
                    'image': images[i].numpy(),
                    'reconstructed': outputs[i].numpy()
                }
            
            idx += loss.shape[0]

# Let's save our results in a pandas DataFrame
df = pd.DataFrame(losses).T
df.head()

Let's now display the histogram of the loss. The elements on the right (with the higher loss) are the most uncharacteristic examples. Feel free to look into `helpers.py` to see how these plots are made:

In [None]:
from helpers import anomaly_detection_display

anomaly_detection_display(df)

Each of the bottom panels has the input in the first row and the reconstruction in the second row. 

Let's look at the first of the two panels. The most difficult numbers to reconstruct (the "anomalies" with the highest loss) are indeed pretty particular: they have some noise (like the vertical lines in some of the numbers), or are just not standard ways of drawing the respective numbers. As a result, the reconstructed images (second row) are not matching the inputs very well and the loss is high. These are anomalies.

The second panel instead shows numbers taken from the peak of the distribution, and look indeed much more standard. The autoencoder can reconstruct them much better which result in a lower loss.

In summary, the reconstruction loss can be used as a score proportional to how much a certain example is typical: a low loss means a typical example; a high loss means an atypical example,an anomaly.