# **Metrics**

This notebooks develops a brief theoretical and practical introduction to the metric Peak Signal-to-Noise Radio, Structural similarity index and Frechet inception Distance, GAN-specific, all of them arevery useful for comparing relationships between original and enhanced images and assessing super-resolution quality.

Author:  
@jwpr-dpr  
22-04-2025

## **Introduction**

Super-resolution (SR) is the task of reconstructing a high-resolution (HR) image from a low-resolution (LR) input. Among the most powerful and popular models for this task are SRGANs — Super-Resolution Generative Adversarial Networks — which generate photorealistic high-resolution images by combining a content loss with an adversarial loss. While GAN-based models significantly improve the perceptual quality of images, they introduce a new challenge: how to evaluate the quality of the generated outputs effectively.

Traditionally, super-resolution algorithms have been evaluated using pixel-wise similarity metrics like PSNR and SSIM, which compare the generated image to a known ground truth HR image. However, these metrics often fail to align with human visual perception. For example, images with high PSNR values can look overly smooth or blurry, while perceptually sharper images produced by SRGANs might score lower on PSNR despite looking more realistic.

The goal of SR is not only to create an image that is numerically similar to the original HR image, but also to create one that looks natural and visually pleasing to a human observer. This dual objective leads to a conflict:

* Pixel-wise metrics (e.g., PSNR, SSIM) reward faithful reproduction of pixel values, favoring blurrier outputs from traditional interpolation or CNN-based SR methods.

* Perceptual metrics (e.g., FID, LPIPS) reward photorealism, encouraging GANs to "hallucinate" plausible textures and details, which may not exist in the original image but look realistic.

Therefore, to fairly assess the performance of SRGANs, it's crucial to look beyond just PSNR.



## **Overview of metrics**

When evaluating the performance of super-resolution models—especially those using generative adversarial networks like SRGAN—it is essential to apply multiple metrics that capture different aspects of image quality. Each metric has its own assumptions, strengths, and weaknesses.

In this section, we’ll introduce the three key metrics we’ll use to evaluate SRGAN performance: PSNR, SSIM, and FID.

### PSNR (Peak Signal-to-Noise Ratio)
PSNR quantifies the pixel-level difference between the SR image and the HR ground truth. It is based on the Mean Squared Error (MSE) and is expressed in decibels (dB). Higher PSNR generally indicates better fidelity, but not necessarily better visual quality.

$$
PSNR(I,\hat{I}) = 10 * \log_{10} \left( \frac{max^{2}_{i}}{mse(I,\hat{I})} \right)
$$

Where  
$max_i$ is the maximum possible pixel value (e.g., 255 for 8-bit images)  
$I$: the ground truth HR image  
$\hat{I}$: the predicted SR image  

Strengths: Simple, fast, widely used

Weaknesses: Penalizes perceptual deviations even if they look better to the human eye

### SSIM (Structural Similarity Index)
SSIM evaluates image similarity by comparing luminance, contrast, and structure. It considers perceptual aspects of vision, making it better than PSNR for many tasks involving textures and edges.

Strengths: Correlates better with perceived visual quality

Weaknesses: Still limited to local patch-wise comparisons

### FID (Fréchet Inception Distance)
FID is a distribution-based metric that compares the statistics of features extracted by a deep network (usually InceptionV3) between real HR images and generated SR images. It is widely used to evaluate GAN outputs and correlates well with human judgment.

$$
FID = ||\mu_r - \mu_g ||^2 + Tr(\sum_r + \sum_g - 2 \cdot (\sum_r \sum_g)^1/2)
$$

With  
$\mu_r$,$ \sum_r$ mean and covariance of real image features  
$\mu_g$, $\sum_g$ mean and covariance of generated image features
Strengths: Sensitive to visual realism and diversity

Weaknesses: Requires a pre-trained model; sensitive to input preprocessing

|Metric | Use Case | Best For | Fails At|
|---|----|----|----|
|PSNR | Basic image reconstruction | Measuring pixel-wise fidelity | Perceptual quality|
|SSIM | Structure-aware tasks | Structural similarity | Complex textures|
|FID | GAN-based models | Perceptual realism | Exact fidelity to original|

## **How to create an evaluation function including all the metrics**

In [8]:
import numpy as np
from torchvision.models.inception import inception_v3
from torchvision.transforms import Resize, Normalize, ToTensor, Compose
from torchvision.transforms.functional import to_pil_image
from torch.nn.functional import adaptive_avg_pool2d
from skimage.metrics import peak_signal_noise_ratio 
from skimage.metrics import structural_similarity 
from scipy.linalg import sqrtm
import torch

FID (Fréchet Inception Distance) is a statistical comparison of two feature distributions extracted from images — typically high-level activations from a pretrained InceptionV3 network. The quality of the FID score depends heavily on how you prepare the images before feeding them into InceptionV3.

The network expects a certain input size and normalization, and deviating from that breaks the assumptions behind FID’s calculation. 

|Step | Why It Matters|
|---|----|
|Resize to 299x299 | InceptionV3 input requirement; ensures comparability with ImageNet features.|
|Convert to tensor | Switches from HWC to CHW and rescales to [0, 1].|
|Normalize | Brings range to [-1, 1], matching training-time normalization of the network.|

If you skip or change these steps:

* The Inception model will behave unpredictably.
* Your feature distributions will be invalid, and so will your FID.
* Your FID may look very low (good) or very high (bad), but won’t mean anything.

In [9]:
def preprocess_for_fid(imgs, device):
    # Normalize to [-1, 1] and resize to 299x299 (required by InceptionV3)
    transform = Compose([
        Resize((299, 299)),
        ToTensor(),
        Normalize([0.5]*3, [0.5]*3)  # from [0,1] to [-1,1]
    ])
    imgs_t = torch.stack([transform(to_pil_image((img*255).astype(np.uint8))) for img in imgs])
    return imgs_t.to(device)

def calculate_fid(features1, features2):
    mu1, sigma1 = np.mean(features1, axis=0), np.cov(features1, rowvar=False)
    mu2, sigma2 = np.mean(features2, axis=0), np.cov(features2, rowvar=False)
    diff = mu1 - mu2
    covmean = sqrtm(sigma1 @ sigma2)
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    fid = diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)
    return float(fid)

In [16]:
def evaluate_srgan(sr_images, hr_images, device="cuda"):
    """
    Evaluates SRGAN results using PSNR, SSIM, and FID.

    Args:
        sr_images (List[np.ndarray]): Super-resolved images (H, W, C) in [0, 1]
        hr_images (List[np.ndarray]): High-res ground truth images (H, W, C) in [0, 1]
        device (str): "cuda" or "cpu"

    Returns:
        dict: {"PSNR": float, "SSIM": float, "FID": float}
    """

    # Transforms for InceptionV3: resize to 299x299 and normalize
    inception_transform = Compose([
        Resize((299, 299), interpolation=Image.BICUBIC),
        ToTensor(),  # Converts [H, W, C] to [C, H, W] in [0, 1]
        Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # For [-1, 1] range
    ])

    # InceptionV3 model for FID (no aux logits, output features only)
    inception = inception_v3(pretrained=True, transform_input=False)
    inception.fc = torch.nn.Identity()  # Output 2048-D features
    inception.eval().to(device)

    psnr_scores, ssim_scores = [], []
    sr_tensors, hr_tensors = [], []

    for sr, hr in zip(sr_images, hr_images):
        # PSNR & SSIM (on original resolution)
        psnr = peak_signal_noise_ratio(hr, sr, data_range=1.0)
        ssim = structural_similarity(hr, sr, win_size=3, multichannel=True, data_range=1.0)
        psnr_scores.append(psnr)
        ssim_scores.append(ssim)

        # FID preprocessing
        sr_img = Image.fromarray((sr * 255).astype(np.uint8))
        hr_img = Image.fromarray((hr * 255).astype(np.uint8))
        sr_tensor = inception_transform(sr_img)
        hr_tensor = inception_transform(hr_img)
        sr_tensors.append(sr_tensor)
        hr_tensors.append(hr_tensor)

    # Stack tensors and send to device
    sr_batch = torch.stack(sr_tensors).to(device)
    hr_batch = torch.stack(hr_tensors).to(device)

    with torch.no_grad():
        features_sr = inception(sr_batch).cpu().numpy()
        features_hr = inception(hr_batch).cpu().numpy()

    # Compute FID
    fid_score = calculate_fid(features_hr, features_sr)

    return {
        "PSNR": np.mean(psnr_scores),
        "SSIM": np.mean(ssim_scores),
        "FID": fid_score
    }

This is a simple routine for loading images from CIFAR10 database. Loading images from a validation set would look very different from this structure.

In [11]:
import torchvision.transforms as T
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import numpy as np
from PIL import Image
import torch

def load_cifar_lr_hr_pairs(
    dataset="CIFAR10",
    root="../data",
    split="test",
    upscale_factor=4,
    max_images=None
):
    """
    Loads HR images from CIFAR and generates corresponding SR (upscaled LR) images.

    Args:
        dataset (str): "CIFAR10" (default) or "CIFAR100".
        root (str): Directory where the dataset will be downloaded/stored.
        split (str): "train" or "test".
        upscale_factor (int): Factor to downscale and then upscale for LR simulation.
        max_images (int): If set, limits number of images to load.

    Returns:
        sr_images (List[np.ndarray]): Upscaled LR images in [0, 1], shape (H, W, 3)
        hr_images (List[np.ndarray]): Original CIFAR images in [0, 1], shape (H, W, 3)
    """
    # Define dataset
    DatasetClass = CIFAR10 
    cifar_data = DatasetClass(
        root=root,
        train=(split == "train"),
        download=True,
        transform=T.ToTensor()  # Get tensor in [0, 1]
    )

    # Downscale and upscale transforms
    downscale = T.Resize(32 // upscale_factor, interpolation=T.InterpolationMode.BICUBIC)
    upscale = T.Resize(32, interpolation=T.InterpolationMode.BICUBIC)

    sr_images, hr_images = [], []
    for i, (img_tensor, _) in enumerate(cifar_data):
        if max_images and i >= max_images:
            break
        img_pil = T.ToPILImage()(img_tensor)

        # HR image (original)
        hr = np.asarray(img_pil).astype(np.float32) / 255.0

        # SR image (down-up simulation)
        lr = downscale(img_pil)
        sr = upscale(lr)
        sr = np.asarray(sr).astype(np.float32) / 255.0

        hr_images.append(hr)
        sr_images.append(sr)

    return sr_images, hr_images


In [17]:

sr_images, hr_images = load_cifar_lr_hr_pairs(
    dataset="CIFAR10", split="test", upscale_factor=4, max_images=1000)

results = evaluate_srgan(sr_images, hr_images, device="cpu")
print("Evaluation Results:")
for metric, value in results.items():
    print(f"{metric}: {value:.4f}")

Evaluation Results:
PSNR: 21.1966
SSIM: 0.6843
FID: 184.9726


## **Interpreting Results**

1. PSNR: 21.20 dB
PSNR (Peak Signal-to-Noise Ratio) measures the pixel-level similarity between your SR (super-resolved) images and HR (high-resolution ground truth) images.

Higher is better

Typical range:

* 30 dB: Excellent (close to HR)
* 25–30 dB: Good
* 20–25 dB: Acceptable
* <20 dB: Poor

Interpretation:
A PSNR of 21.20 dB is in the acceptable range. It means the SR images resemble the HR images somewhat, but with noticeable differences at the pixel level — possibly artifacts or blurring.

2. SSIM: 0.6843
SSIM (Structural Similarity Index) evaluates the perceptual similarity between SR and HR images based on structure, luminance, and contrast.

Closer to 1 is better

Typical range:

* 0.90: Very high perceptual similarity
* 0.75–0.90: Good
* 0.60–0.75: Moderate
* <0.60: Low

Interpretation:
An SSIM of 0.6843 suggests a moderate level of perceptual similarity. The general structure is preserved, but the finer details (textures, edges) may be degraded or different from the original.

3. FID: 184.97
FID (Fréchet Inception Distance) measures the distributional difference between the features of SR and HR images (using a pretrained model like InceptionV3). Lower is better.

Lower is better

Typical range (varies by dataset):

* <10: Excellent (often achieved by the best models on large datasets)
* 10–50: Good
* 50–100: Fair
* 100: Poor

Interpretation:
An FID of 184.97 is quite high, indicating your model-generated images are far from matching the distribution of real HR images.

Of course, the mediocre-poor results we are seeing in FID are expected, as we are using images with very low resolution, this metric focusing on quality, is not going to perform well, in the other hand we are using downgraded images from a set of original images, reason why the metrics that focus in image integrity respect to the originals, perform much better. The important thing here is, we have managed to create a function that receives a batch of images and generates metrics succesfully. Now we can adequately evaluate our models.