# Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN)


## A Reproducibility Effort

In this blog-post, we compile our efforts to reproduce the posited results from table 1 of the paper [Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network](https://arxiv.org/abs/1609.05158) by Shi et al. Building upon the code effort by Jeffrey Yeo (yjn870) [[Github repo](https://github.com/yjn870/ESPCN-pytorch)], we made certain improvements and additionally reproduced the experiments for 4K and video from the paper, which were not a part of Yeo's previous work. We present our reproducibility effort as follows.

## Image Super-Resolution

In deep learning, especially in computer vision and related research, many researchers have been focusing on the *ill-posed* image **super-resolution** (SR) problem, which involves upscaling a low resolution image into a high resolution space. This technique could be used to restore image quality and could also enhance general image processing. In fields like facial recognition, medical imaging and even satellite imaging, super-resolution has been widely applied. Its broad use-case has allowed it to become one of the most popular topics in Computer Vision. 

## Why do we need *ESPCN*?
 
In previous SR models like [SRCNN](https://arxiv.org/abs/1501.00092) and [TNRD](https://arxiv.org/pdf/1508.02848), the super-resolution operation was carried out in the high-resolution (HR) space. The sub-optimality and additional computational complexity of this approach motivated the development of **Efficient Sub-Pixel Convolution layer** (ESPCN). Upscaling the resolution of low-resolution (LR) images before the image enhancement step is a major pain-point behind the increase in computational complexity. In convolutional networks, this complexity severly influences the speed of the implementation. Moreover, traditional interpolation methods used in super-resolution (previously) fail to capture additional (crucial) information required to solve this ill-posed problem!

Notably, in the proposed ESPCN, feature maps are extracted in the LR space and upscaling of images (from LR to HR) is performed in the final layer of the network. Super-resolving HR data from LR feature-maps in this way greatly increases the efficiency of the SR model as most of the computation is done in the smaller LR space. What's more? in ESPCN, no explicit interpolation filter is used which means that the network is able to implicitly learn the processing necesaary for super-resolution. It is thus able to learn a better mapping from low-resolution image to high-resolution image compared to using a single fixed filter.

![ESPCN with 2 CNN layers and 1 Sub-Pixel Convolution Layer](thumbnails/fig1.png)

##### ESPCN with 2 CNN layers and 1 Sub-Pixel Convolution Layer
In the proposed architecture, firstly, an *L* layer convolutional neural network is directly applied to
the LR image after which a *sub-pixel convolutional layer* upscales the LR feature maps to generate the super-resolved image. Additionally, a *deconvolution layer* is also added which is a more generic form of the interpolation filter. More information can be captured when using this additional deconvolution layer.

In order to verify that ESPCN could actually outperform the previous super-resolution algorithm, we reproduce this paper by using experiments as following.

As a part of our reproducibility effort we verify whether the proposed ESPCN can actually outperform previous super-resolution models. 

- We validate the proposed approach using images and videos from publicly available benchmark datasets.
- As an enhancement we propose to use **GELU** as an activation function for the ESPCN model.
- We also develop a **video Super-Resolution** pipeline from scratch (not a part of the previous code)

## Experiment Setup

The two image datasets used for evaluation are public available benchmark datasets. The first one is the Timofte dataset, which contains 91 training images and two test dataset. The second one is 50,000 randomly selected images from ImageNet for the training.

As for video experiments, in the paper, the author uses publicly available Xiph database. *INPUT OUR VIDEO DATASET HERE*

According to the paper, the author ran the experiment on a K2 GPU while in our cases, we ran our experiment on our local computer, which is *INPUT YOUR COMPUTER GPU HERE* 

## Running the experiment
### Network framework

In [3]:
'''
This is the design of our ESPCN networks, including the intialization 
weights and forward methods
'''
import math
from torch import nn

'''
Based on the explanation given in the paper,
Number of layers (l) = 3 -> 2 CNN + 1 Sub-pixel
Kernel Input is of the form (f_i, n_i) where (5,64) -> (3, 32) -> 3 
''' 

class ESPCN(nn.Module):
    def __init__(self, scale_factor, num_channels=1):
        super(ESPCN, self).__init__()
        self.first_part = nn.Sequential(
            nn.Conv2d(num_channels, 64, kernel_size=5, padding=5//2),
            nn.Tanh(),
            nn.Conv2d(64, 32, kernel_size=3, padding=3//2),
            nn.Tanh(),
        )
        self.last_part = nn.Sequential(
            nn.Conv2d(32, num_channels * (scale_factor ** 2), kernel_size=3, padding=3 // 2),
            nn.PixelShuffle(scale_factor)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                if m.in_channels == 32:
                    nn.init.normal_(m.weight.data, mean=0.0, std=0.001)
                    nn.init.zeros_(m.bias.data)
                else:
                    nn.init.normal_(m.weight.data, mean=0.0, std=math.sqrt(2/(m.out_channels*m.weight.data[0][0].numel())))
                    nn.init.zeros_(m.bias.data)

    def forward(self, x):
        x = self.first_part(x)
        x = self.last_part(x)
        return x

### Hyperparameter Settings

Our chosen hyperparameters are as follows:

| Hyper-parameters | Value |
| :--- | ------ |
| `Scale` | **3** |
| `learning rate` | **1e-3** |
| `batch-size` | **16** |
| `number of epochs` | **200** |
| `number of workers` | **8** |

### Image super resolution
Training

In [None]:
import argparse
import os
import copy

import torch
from torch import nn
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm

from models import ESPCN
from datasets import TrainDataset, EvalDataset
from utils import AverageMeter, calc_psnr


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-file', type=str, required=True)
    parser.add_argument('--eval-file', type=str, required=True)
    parser.add_argument('--outputs-dir', type=str, required=True)
    parser.add_argument('--weights-file', type=str)
    parser.add_argument('--scale', type=int, default=3)
    parser.add_argument('--lr', type=float, default=1e-3)
    parser.add_argument('--batch-size', type=int, default=16)
    parser.add_argument('--num-epochs', type=int, default=200)
    parser.add_argument('--num-workers', type=int, default=8)
    parser.add_argument('--seed', type=int, default=123)
    args = parser.parse_args()

    args.outputs_dir = os.path.join(args.outputs_dir, 'x{}'.format(args.scale))

    if not os.path.exists(args.outputs_dir):
        os.makedirs(args.outputs_dir)

    cudnn.benchmark = True
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    torch.manual_seed(args.seed)

    model = ESPCN(scale_factor=args.scale).to(device)
    criterion = nn.MSELoss()
    optimizer = optim.Adam([
        {'params': model.first_part.parameters()},
        {'params': model.last_part.parameters(), 'lr': args.lr * 0.1}
    ], lr=args.lr)

    train_dataset = TrainDataset(args.train_file)
    train_dataloader = DataLoader(dataset=train_dataset,
                                  batch_size=args.batch_size,
                                  shuffle=True,
                                  num_workers=args.num_workers,
                                  pin_memory=True)
    eval_dataset = EvalDataset(args.eval_file)
    eval_dataloader = DataLoader(dataset=eval_dataset, batch_size=1)

    best_weights = copy.deepcopy(model.state_dict())
    best_epoch = 0
    best_psnr = 0.0

    for epoch in range(args.num_epochs):
        for param_group in optimizer.param_groups:
            param_group['lr'] = args.lr * (0.1 ** (epoch // int(args.num_epochs * 0.8)))

        model.train()
        epoch_losses = AverageMeter()

        with tqdm(total=(len(train_dataset) - len(train_dataset) % args.batch_size), ncols=80) as t:
            t.set_description('epoch: {}/{}'.format(epoch, args.num_epochs - 1))

            for data in train_dataloader:
                inputs, labels = data

                inputs = inputs.to(device)
                labels = labels.to(device)

                preds = model(inputs)

                loss = criterion(preds, labels)

                epoch_losses.update(loss.item(), len(inputs))

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                t.set_postfix(loss='{:.6f}'.format(epoch_losses.avg))
                t.update(len(inputs))

        torch.save(model.state_dict(), os.path.join(args.outputs_dir, 'epoch_{}.pth'.format(epoch)))

        model.eval()
        epoch_psnr = AverageMeter()

        for data in eval_dataloader:
            inputs, labels = data

            inputs = inputs.to(device)
            labels = labels.to(device)

            with torch.no_grad():
                preds = model(inputs).clamp(0.0, 1.0)

            epoch_psnr.update(calc_psnr(preds, labels), len(inputs))

        print('eval psnr: {:.2f}'.format(epoch_psnr.avg))

        if epoch_psnr.avg > best_psnr:
            best_epoch = epoch
            best_psnr = epoch_psnr.avg
            best_weights = copy.deepcopy(model.state_dict())

    print('best epoch: {}, psnr: {:.2f}'.format(best_epoch, best_psnr))
    torch.save(best_weights, os.path.join(args.outputs_dir, 'best.pth'))

Based on the open source code, we made some improvements on the network ourselves. And by changing the network structure, we could achieve better results compared to before.

Image SR reproducibility results 

Additional note about GELU

### Video super resolution