# Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN)

---

## A Reproducibility Effort

In this blog-post, we compile our efforts to reproduce the posited results from table 1 of the paper [Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network](https://arxiv.org/abs/1609.05158) by Shi et al. Building upon the code effort by Jeffrey Yeo (yjn870) [[Github repo](https://github.com/yjn870/ESPCN-pytorch)], we made certain improvements and additionally reproduced the experiments for 4K and video from the paper, which were not a part of Yeo's previous work. We present our reproducibility effort as follows.

<img src="thumbnails/DL_Teaser.png" alt="Drawing" style="width: 400px;"/>

## Image Super-Resolution

In deep learning, especially in computer vision and related research, many researchers have been focusing on the *ill-posed* image **super-resolution** (SR) problem, which involves upscaling a low resolution image into a high resolution space. This technique could be used to restore image quality and could also enhance general image processing. In fields like facial recognition, medical imaging and even satellite imaging, super-resolution has been widely applied. Its broad use-case has allowed it to become one of the most popular topics in Computer Vision. 

## Why do we need *ESPCN*?
 
In previous SR models like [SRCNN](https://arxiv.org/abs/1501.00092) and [TNRD](https://arxiv.org/pdf/1508.02848), the super-resolution operation was carried out in the high-resolution (HR) space. The sub-optimality and additional computational complexity of this approach motivated the development of **Efficient Sub-Pixel Convolution layer** (ESPCN). Upscaling the resolution of low-resolution (LR) images before the image enhancement step is a major pain-point behind the increase in computational complexity. In convolutional networks, this complexity severly influences the speed of the implementation. Moreover, traditional interpolation methods used in super-resolution (previously) fail to capture additional (crucial) information required to solve this ill-posed problem!

Notably, in the proposed ESPCN, feature maps are extracted in the LR space and upscaling of images (from LR to HR) is performed in the final layer of the network. Super-resolving HR data from LR feature-maps in this way greatly increases the efficiency of the SR model as most of the computation is done in the smaller LR space. What's more? in ESPCN, no explicit interpolation filter is used which means that the network is able to implicitly learn the processing necesaary for super-resolution. It is thus able to learn a better mapping from low-resolution image to high-resolution image compared to using a single fixed filter.

![ESPCN with 2 CNN layers and 1 Sub-Pixel Convolution Layer](thumbnails/fig1.png)

##### ESPCN with 2 CNN layers and 1 Sub-Pixel Convolution Layer
In the proposed architecture, firstly, an *L* layer convolutional neural network is directly applied to
the LR image after which a *sub-pixel convolutional layer* upscales the LR feature maps to generate the super-resolved image. Additionally, a *deconvolution layer* is also added which is a more generic form of the interpolation filter. More information can be captured when using this additional deconvolution layer. For brevity of content, the mathematically treatment has been abstracted and can be found on the poster.

In order to verify that ESPCN could actually outperform the previous super-resolution algorithm, we reproduce this paper by using experiments as following.

As a part of our reproducibility effort we verify whether the proposed ESPCN can actually outperform previous super-resolution models. 

- We validate the proposed approach using images and videos from publicly available benchmark datasets.
- As an enhancement we propose to use **GELU** as an activation function for the ESPCN model.
- We also develop a **video Super-Resolution** pipeline from scratch (not a part of the previous code)

## Experiment Setup

### Dataset Description

The image datasets used for evaluation and ablation studies are publicly available, benchmark datasets. 
1. Training data: 
    - The Timofte dataset, which contains 91-images.
2. Three evaluation datasets:
    - Set5 Images
    - Set14 Images
    - BSD500 Images
    
Our expermental results are captured in the following sections.

As for video experiments, in the paper, the author uses publicly available **Xiph** dataset. As video super-resolution was not part of the reprocibility task, we decided to use publically available **4K Images** dataset [Link](https://www.kaggle.com/evgeniumakov/images4k) to train our model. We present some images from the video-SR results obtained through our experiments with this model.
 
According to the paper, the author ran the experiments on a **K2 GPU** while in our case (*in a bid to obtain standardized results*) we used the **Tesla K80 GPU** available through **Google Colab**.

### The experiment involves two important steps,
1. Generating low-resolution images from the given data. (HR space -> LR space)
    - We first perform bicubic re-sampling of the images which are downscaled by a factor of 3 ( `scale` parameter),
    - Followed by 17x17 pixel sub-sampling of the original HR images.
2. Eventually, the model applies a periodic-shuffling operation these low-res images (subsamples of original HR images) to train and evaluate the model.

*Additionally, we work with the **Y-channel (luminance)** as it is most effectively observed.*

#### Data Prepearation

In [None]:
def prepare(args):
    h5_file = h5py.File(args.output_path, 'w')

    lr_patches = []
    hr_patches = []

    for image_path in sorted(glob.glob('{}/*'.format(args.images_dir))):
        hr = pil_image.open(image_path).convert('RGB')
        hr_width = (hr.width // args.scale) * args.scale
        hr_height = (hr.height // args.scale) * args.scale
        hr = hr.resize((hr_width, hr_height), resample=pil_image.BICUBIC)
        lr = hr.resize((hr_width // args.scale, hr_height // args.scale), resample=pil_image.BICUBIC)
        hr = np.array(hr).astype(np.float32)
        lr = np.array(lr).astype(np.float32)
        hr = convert_rgb_to_y(hr)
        lr = convert_rgb_to_y(lr)

        for i in range(0, lr.shape[0] - args.patch_size + 1, args.stride):
            for j in range(0, lr.shape[1] - args.patch_size + 1, args.stride):
                lr_patches.append(lr[i:i + args.patch_size, j:j + args.patch_size])
                hr_patches.append(hr[i * args.scale:i * args.scale + args.patch_size * args.scale, j * args.scale:j * args.scale + args.patch_size * args.scale])

    lr_patches = np.array(lr_patches)
    hr_patches = np.array(hr_patches)

    h5_file.create_dataset('lr', data=lr_patches)
    h5_file.create_dataset('hr', data=hr_patches)

    h5_file.close()
    
# The input parameters for data preparation.
parser = argparse.ArgumentParser()
    parser.add_argument('--images-dir', type=str, required=True)
    parser.add_argument('--output-path', type=str, required=True)
    parser.add_argument('--scale', type=int, default=3)
    parser.add_argument('--patch-size', type=int, default=17)
    parser.add_argument('--stride', type=int, default=13)
    parser.add_argument('--eval', action='store_true')

## Running the experiment
### Network framework

In [3]:
'''
This is the design of our ESPCN networks, including the intialization 
weights and forward methods
'''
import math
from torch import nn

'''
Based on the explanation given in the paper,
Number of layers (l) = 3 -> 2 CNN + 1 Sub-pixel
Kernel Input is of the form (f_i, n_i) where (5,64) -> (3, 32) -> 3 
''' 

class ESPCN(nn.Module):
    def __init__(self, scale_factor, num_channels=1):
        super(ESPCN, self).__init__()
        self.first_part = nn.Sequential(
            nn.Conv2d(num_channels, 64, kernel_size=5, padding=5//2),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=3, padding=3//2),
            nn.ReLU(),
        )
        self.last_part = nn.Sequential(
            nn.Conv2d(32, num_channels * (scale_factor ** 2), kernel_size=3, padding=3 // 2),
            nn.PixelShuffle(scale_factor)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                if m.in_channels == 32:
                    nn.init.normal_(m.weight.data, mean=0.0, std=0.001)
                    nn.init.zeros_(m.bias.data)
                else:
                    nn.init.normal_(m.weight.data, mean=0.0, std=math.sqrt(2/(m.out_channels*m.weight.data[0][0].numel())))
                    nn.init.zeros_(m.bias.data)

    def forward(self, x):
        x = self.first_part(x)
        x = self.last_part(x)
        return x

Our chosen hyperparameters are as follows:

| Hyper-parameters | Value |
| :--- | ------ |
| `Scale` | **3** |
| `learning rate` | **1e-3** |
| `batch-size` | **16** |
| `number of epochs` | **200** |
| `number of workers` | **8** |

### Image super resolution
#### Training
The code for training of the ESPCN model is as following.

For training, the dataset we used including _91 images dataset_ and also _4k image dataset_.

In [None]:
import argparse
import os
import copy

import torch
from torch import nn
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm

from models import ESPCN
from datasets import TrainDataset, EvalDataset
from utils import AverageMeter, calc_psnr


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-file', type=str, required=True)
    parser.add_argument('--eval-file', type=str, required=True)
    parser.add_argument('--outputs-dir', type=str, required=True)
    parser.add_argument('--weights-file', type=str)
    parser.add_argument('--scale', type=int, default=3)
    parser.add_argument('--lr', type=float, default=1e-3)
    parser.add_argument('--batch-size', type=int, default=16)
    parser.add_argument('--num-epochs', type=int, default=200)
    parser.add_argument('--num-workers', type=int, default=8)
    parser.add_argument('--seed', type=int, default=123)
    args = parser.parse_args()

    args.outputs_dir = os.path.join(args.outputs_dir, 'x{}'.format(args.scale))

    if not os.path.exists(args.outputs_dir):
        os.makedirs(args.outputs_dir)

    cudnn.benchmark = True
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    torch.manual_seed(args.seed)

    model = ESPCN(scale_factor=args.scale).to(device)
    criterion = nn.MSELoss()
    optimizer = optim.Adam([
        {'params': model.first_part.parameters()},
        {'params': model.last_part.parameters(), 'lr': args.lr * 0.1}
    ], lr=args.lr)

    train_dataset = TrainDataset(args.train_file)
    train_dataloader = DataLoader(dataset=train_dataset,
                                  batch_size=args.batch_size,
                                  shuffle=True,
                                  num_workers=args.num_workers,
                                  pin_memory=True)
    eval_dataset = EvalDataset(args.eval_file)
    eval_dataloader = DataLoader(dataset=eval_dataset, batch_size=1)

    best_weights = copy.deepcopy(model.state_dict())
    best_epoch = 0
    best_psnr = 0.0

    for epoch in range(args.num_epochs):
        for param_group in optimizer.param_groups:
            param_group['lr'] = args.lr * (0.1 ** (epoch // int(args.num_epochs * 0.8)))

        model.train()
        epoch_losses = AverageMeter()

        with tqdm(total=(len(train_dataset) - len(train_dataset) % args.batch_size), ncols=80) as t:
            t.set_description('epoch: {}/{}'.format(epoch, args.num_epochs - 1))

            for data in train_dataloader:
                inputs, labels = data

                inputs = inputs.to(device)
                labels = labels.to(device)

                preds = model(inputs)

                loss = criterion(preds, labels)

                epoch_losses.update(loss.item(), len(inputs))

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                t.set_postfix(loss='{:.6f}'.format(epoch_losses.avg))
                t.update(len(inputs))

        torch.save(model.state_dict(), os.path.join(args.outputs_dir, 'epoch_{}.pth'.format(epoch)))

        model.eval()
        epoch_psnr = AverageMeter()

        for data in eval_dataloader:
            inputs, labels = data

            inputs = inputs.to(device)
            labels = labels.to(device)

            with torch.no_grad():
                preds = model(inputs).clamp(0.0, 1.0)

            epoch_psnr.update(calc_psnr(preds, labels), len(inputs))

        print('eval psnr: {:.2f}'.format(epoch_psnr.avg))

        if epoch_psnr.avg > best_psnr:
            best_epoch = epoch
            best_psnr = epoch_psnr.avg
            best_weights = copy.deepcopy(model.state_dict())

    print('best epoch: {}, psnr: {:.2f}'.format(best_epoch, best_psnr))
    torch.save(best_weights, os.path.join(args.outputs_dir, 'best.pth'))

Based on the open source code, we made some improvements on the network ourselves. And by changing the network structure, we could achieve better results compared to before.

#### Testing
The code for testing of the ESPCN model is as following.

For testing, the dataset we used including _Set5, Set14, BSD500_ and also _4k image dataset_.

In [None]:
import argparse

import torch
import torch.backends.cudnn as cudnn
import numpy as np
import PIL.Image as pil_image

from models import ESPCN
from utils import convert_ycbcr_to_rgb, preprocess, calc_psnr

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights-file', type=str, required=True)
    parser.add_argument('--image-file', type=str, required=True)
    parser.add_argument('--scale', type=int, default=3)
    args = parser.parse_args()
    
def test(args):

    cudnn.benchmark = True
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    model = ESPCN(scale_factor=args.scale).to(device)

    state_dict = model.state_dict()
    for n, p in torch.load(args.weights_file, map_location=lambda storage, loc: storage).items():
        if n in state_dict.keys():
            state_dict[n].copy_(p)
        else:
            raise KeyError(n)

    model.eval()

    image = pil_image.open(args.image_file).convert('RGB')

    image_width = (image.width // args.scale) * args.scale
    image_height = (image.height // args.scale) * args.scale

    hr = image.resize((image_width, image_height), resample=pil_image.BICUBIC)
    lr = hr.resize((hr.width // args.scale, hr.height // args.scale), resample=pil_image.BICUBIC)
    bicubic = lr.resize((lr.width * args.scale, lr.height * args.scale), resample=pil_image.BICUBIC)
    bicubic.save(args.image_file.replace('.', '_bicubic_x{}.'.format(args.scale)))

    lr, _ = preprocess(lr, device)
    hr, _ = preprocess(hr, device)
    _, ycbcr = preprocess(bicubic, device)

    with torch.no_grad():
        preds = model(lr).clamp(0.0, 1.0)

    psnr = calc_psnr(hr, preds)
    print('PSNR: {:.2f}'.format(psnr))

    preds = preds.mul(255.0).cpu().numpy().squeeze(0).squeeze(0)

    output = np.array([preds, ycbcr[..., 1], ycbcr[..., 2]]).transpose([1, 2, 0])
    output = np.clip(convert_ycbcr_to_rgb(output), 0.0, 255.0).astype(np.uint8)
    output = pil_image.fromarray(output)
    output.save(args.image_file.replace('.', '_espcn_x{}.'.format(args.scale)))

#### Image SR reproducibility results 

Based on the train and test set, we get a table of performance as following.

|_Dataset_ | _Scale_ | _relu_ | _tanh_ | _gelu_ | _paper_ |
|----------|---------|--------|--------|--------|---------|
| ` Set5`  |   3   |**33.13**|32.88|32.99|33.00|
| ` Set14`  |   3   |**29.49**|29.33|29.40|29.42|
| ` BSD500`  |   3   |**28.87**|28.69|28.77|28.62|
| ` 4K(test)`  |   3   |43.61| |46.25| |

* **Higher PSNR** values were achieved across the board as compared to the results posited in the paper.

* Additionally, it should be noted that the model taken from the paper is trained on ImageNet (~50,000 Images) whereas our experimental results were obtained using the 91-Images dataset which is significantly smaller.

We compare the results of ESPCN with Bicubic, and it can be clearly seen that ESPCN could obtain better results compared to Bicubic.

---
![Comparision between ESPCN and bicubic](thumbnails/result.jpg)

---


#### Hyperparameter tuning

In order to understand the effects and eventually tune the hyperparameters for our model, we carried out several ablation studies as follows.

<table><tr><td><img src="thumbnails/activations.png" alt="Drawing" style="width: 400px;"/></td><td><img src="thumbnails/psnrbatch.jpeg" alt="Drawing" style="width: 400px;"/>></td></tr></table>
<table><tr><td><img src="thumbnails/psnrlearningrate.jpeg" alt="Drawing" style="width: 400px;"/></td><td><img src="thumbnails/psnrscale.jpeg" alt="Drawing" style="width: 400px;"/>></td></tr></table>

From the experiments, we posit the following results:
* We observe optimal performance for `batch-size = 4`.
* The performance (PSNR value) drops significantly for `learning rate > 0.08`.
* `scale = 3` gives optimal performance i.e. the highest PSNR value (but it seems to be quite data dependent).
* Observed performance of activation functions based on PSNR scores: `ReLU > GELU > Tanh`.

---

## Video super resolution

Based on the image super resolution, we build a pipeline for video super resolution.

* Video pipeline takes in 1080p video file as input.
* Breaks down the video into frames (images) by sampling at 30 frames per second.
* Predicts each frame 4k super-resolution according to the trained model and combines the frames.
* Potential Enhancement: live video conversion.

The hyperparameter setting for video SR are as follows:

| Hyper-parameters | Value |
| :--- | ------ |
| `Scale` | **4** |
| `learning rate` | **1e-3** |
| `batch-size` | **16** |
| `number of epochs` | **200** |
| `number of workers` | **8** |

####  Results
We present the images (below) as a the result of video super resolution. We conclude that it does a pretty good job of sharpening the images.

<img src="thumbnails/superResVs1080P.png" alt="Drawing" style="width: 400px;"/>
<img src="thumbnails/heyenaVs.png" alt="Drawing" style="width: 400px;"/>
<img src="thumbnails/elephant.png" alt="Drawing" style="width: 400px;"/>

---