# Video Super Resolution


Anatol Kaczmarek 156038\
Dawid Siera 156044


## Problem Description

The problem we choose for this project was Super Resolution, but not the usual image super resolution, but video super resolution. The goal is to take a low resolution video, say 640x360 and create a higher resolution video, say 1280x720p from it with accuracy and image quality as high as possible and hopefully beating the common interpolation method. What the video part brings into the table is the fact that although video is just a sequence of images, consecutive frames usually share some similarities, and by looking at the previous and next frames we can try to predict the current frame with higher accuracy. The downside is increased computational time since we need to process multiple input images per one output frame, limitations in computational resources was the main problem we encountered. Since our solution makes use of information happening over time we will sometimes refer to it as temporal super resolution (TSR).


### Dataset

We used the Inter4k video dataset, it consists of 1000 videos recorded in 4k at 60 fps. The videos are short, around 5s which amount to 300 frames per video and 300,000 frames in whole dataset. dataset of such size was absolutely infeasible for us to process in a reasonable time, so we decided to only use a part of it. Moreover to save computational resources we resized the videos to smaller resolutions, we experimented a bit with different resolutions before settling on 360p to 720p upscaling. Although our model is Fully Convolutional Neural Network which makes is agnostic on the image size, we assumed that by focusing on specific resolution we would get the best results, that's why all the results from now on refer to 360p->720p upscaling.


You can check an one of the videos in both resolutions below:


In [1]:
from IPython.display import Video

Video('resources/test360.mp4')

In [2]:
Video('resources/test720.mp4')

## Model

We use a simple 6 layer fully convolutional model, based on convolutional layers and ReLU functions, but the first crucial step is to upscale the input image to the desired resolution. Since we do not use convolutional layers to increase the size of an input image we rely on regular bicubic interpolation to bring image to the desired size the we use our stack of trainable layers to reconstruct high quality image from low quality interpolation upscaling result. This also makes bicubic interpolation an ideal baseline for comparisons. The convolutional layers do not get padding because we didn't want the padded zeroes to interfere with image reconstructing effort, this mean that the output image is actually couple of pixels smaller than expected although this could be remedied by upsampling the image to slightly higher base size. Our model can be initialized with number of past/future frames to use in upscaling and upscale_factor (ratio of output and input resolutions), the concatenation happens in forward function where we stack the input frames along the channel dimension. We tried working with models using 1 and 2 future/past frames but we didn't get result high enough to justify higher computational demand so we sticked to 1 past/future frame. We also created a bigger variant of the architecture with bigger kernels and more neurons for comparison, unfortunately its increased training time forced us to decrease the number of epochs to the detriment of final results and the result was clearly underfitted model. Altough we can imagine someone with more computational resources making good use of the this bigger model we abandoned it in favor of smaller one.


Here is a diagram of our model's architecture (the smaller variant):

![](resources/model_diagram.png)

Below is the code for our model:


In [3]:
from torch import nn, Tensor
import torch


class TSRCNN_small(nn.Module):
    def __init__(self, frames_backward=2, frames_forward=2, upscale_factor=1.5):
        super(TSRCNN_small, self).__init__()
        self.layers = nn.Sequential(
            nn.Upsample(scale_factor=upscale_factor, mode='bicubic', align_corners=False),
            nn.Conv2d(3 * (frames_backward + 1 + frames_forward), 64, kernel_size=9, padding=0),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=1, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 3, kernel_size=5, padding=0)
        )

    def forward(self, back_frames: Tensor, low_res_frame: Tensor, forward_frames: Tensor) -> Tensor:
        x = torch.cat([back_frames, low_res_frame, forward_frames], dim=1)
        return self.layers(x)


class TSRCNN_large(nn.Module):
    def __init__(self, frames_backward=2, frames_forward=2, upscale_factor=1.5):
        super(TSRCNN_large, self).__init__()
        self.layers = nn.Sequential(
            nn.Upsample(scale_factor=upscale_factor, mode='bicubic', align_corners=False),
            nn.Conv2d(3 * (frames_backward + 1 + frames_forward), 128, kernel_size=11, padding=0),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.Conv2d(64, 3, kernel_size=7, padding=0)
        )

    def forward(self, back_frames: Tensor, low_res_frame: Tensor, forward_frames: Tensor) -> Tensor:
        x = torch.cat([back_frames, low_res_frame, forward_frames], dim=1)
        return self.layers(x)

As per torchtools, our smaller model has 51,203 parameters all of which are trainable, and they take 0.2 MB of memory. The bigger model has 222,723 trainable parameter which take 0.85 MB of memory.


In [4]:
from torchtools.utils import print_summary
model = TSRCNN_small(1, 1, 2)
back_frames = torch.randn(1, 3, 360, 640)
low_res_frame = torch.randn(1, 3, 360, 640)
forward_frames = torch.randn(1, 3, 360, 640)
print_summary(model, back_frames, low_res_frame, forward_frames)

---------------------------------------------------------------------------------------------------------------------
                            Layer (type)    Output shape     Param shape      Param #     FLOPs basic           FLOPs
                                 Input *     1x3x360x640
                                 Input *     1x3x360x640
                                 Input *     1x3x360x640
                   layers.0 (Upsample) *    1x9x720x1280                            0               0               0
                     layers.1 (Conv2d) *   1x64x712x1272     64x9x9x9+64       46,720  42,254,659,584  42,312,622,080
                       layers.2 (ReLU) *   1x64x712x1272                            0               0               0
                     layers.3 (Conv2d) *   1x32x712x1272    32x64x1x1+32        2,080   1,854,799,872   1,883,781,120
                       layers.4 (ReLU) *   1x32x712x1272                            0               0               0
   

{'flops': 46353682032,
 'flops_basic': 46264045056,
 'params': 51203,
 'params_with_aux': 51203}

In [5]:
model = TSRCNN_large(1, 1, 2)
print_summary(model, back_frames, low_res_frame, forward_frames)

---------------------------------------------------------------------------------------------------------------------
                            Layer (type)    Output shape     Param shape      Param #     FLOPs basic           FLOPs
                                 Input *     1x3x360x640
                                 Input *     1x3x360x640
                                 Input *     1x3x360x640
                   layers.0 (Upsample) *    1x9x720x1280                            0               0               0
                     layers.1 (Conv2d) *  1x128x710x1270 128x9x11x11+128      139,520 125,689,766,400 125,805,184,000
                       layers.2 (ReLU) *  1x128x710x1270                            0               0               0
                     layers.3 (Conv2d) *   1x64x708x1268   64x128x3x3+64       73,792  66,188,869,632  66,246,325,248
                       layers.4 (ReLU) *   1x64x708x1268                            0               0               0
   

{'flops': 200388940012,
 'flops_basic': 200213409024,
 'params': 222723,
 'params_with_aux': 222723}

## Training

The training procedure is controlled by two main classes of the program MultiTrainer and MultiVideoDataset, and it consists of the following steps:

1. Grouping the movies into batches
2. Loading of movie batch and creating the dataset
3. Splitting the dataset into training and validation sets
4. Training the model for n epochs, each epoch consists of:
   1. Training
   2. Validation

### Grouping the movies into batches

Ideally we would like to create a single dataset out of all the training movies, but due to very high memory usage during video loading (resulting from some bug in the library) we had to first split videos into batches and perform the whole training procedure on each batch separately. This is not ideal since the model doesn't get to see the whole dataset at once, but it was the only way to make it work. Parameter video_batch_size controls the number of videos in a single batch, on a 32GB RAM machine we were able to load 5 videos at once, but your mileage may vary.

### Loading of movie batches and creating the dataset

In this step for each movie in batch we load the movie from the disk and read it frame by frame appending each frame to an array, then we utilize sliding window to get sets of frames containing past and future frames for each frame in the movie (except for the first and last frames). The number of past and future frames is controlled by paramaters frames_back and frames_forward. The resulting arrays from each movie are then concatenated into a single array. When dataset is asked for some example it first resizes the frames to the resolution a model is supposed to upscale from and then returns the tuple of tensor containing past and future frames and tensor containing the current frame as well as current frame in an original resolution as a ground truth.

### Splitting the dataset into training and validation sets

The MultiVideoDataset object is then split into training and validation sets both of which are then wrapped in DataLoader objects

### Training

During training we first iterate over the training set and for each retrieved batch we predict the output of the model and compare against the ground truth to calculate metrics and loss, we then backpropagate the loss and update the model's weights. Next we iterate over the validation set and do the same but without backpropagation and weight updates. validation loss, and values of metrics get averaged over the whole validation set and reported as the result of the epoch. This process is repeated for n epochs.


Our program also uses mlflow to log the training process and save the models as well as Simple GUI created in streamlit which allows both to train and test models, and is the recommended way to run the program:
![train](resources/train.png)


GUI allows to choose all the necessary hyperparameters, if you have an already trained model in models/ directory you can choose it to "uptrain" it, otherwise you can train a new model from scratch. This page expects that the videos you provide will be at least the size of High resolution parameter, and they need to have 16:9 proportions


After you click start training progress bars and graphs will appear allowing you to monitor the training process. Program generates 3x3 graph grid, where first row concerns the subsequent batches of training, the second batches of validation, and the third epochs. The first column shows the value of PSNR metric, the second column shows the value of SSIM metric, and the third column shows the value of currently chosen loss function. The graphs are updated after each batch and epoch. Epochs loss and metrics are also logged to the mlflow and can be retrieved later. MLflow saves also the model after each epoch as well as parameters used in training.

![plots](resources/Adam_PNSR_252_10.png)


## Evaluation

Before we go to the evaluation procedure let's describe metrics and loss functions we used:


**PSNR (Peak Signal-to-Noise Ratio):**

$$
\text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}^2}{\text{MSE}}\right)
$$

Where:

- $\text{MAX}$ is the maximum possible pixel value of the image.
- $\text{MSE}$ is the Mean Squared Error between images


**SSIM (Structural Similarity Index):**
$$\text{SSIM}(x, y) = \frac{(2\mu*x\mu_y + C_1)(2\sigma*{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $$

Where:

- $\mu_x, \mu_y$ are the mean intensities of $x$ and $y$.
- $\sigma_x^2, \sigma_y^2$ are the variances of $x$ and $y$.
- $\sigma\_{xy}$ is the covariance between $x$ and $y$.
- $C_1 = (k_1 L)^2$ and $C_2 = (k_2 L)^2$ are constants to stabilize the division (typically $k_1 = 0.01$, $k_2 = 0.03$, and $L$ is the dynamic range of pixel values).


Based on these two metrics we introduce two loss functions:

- PNSR (Peak Noise-to-Signal Ratio): $\frac{1}{\text{PSNR}}$
- DSSIM (Structural Dissimilarity Index): $1 - \text{SSIM}$


Evaluation works similarly to the training procedure with only difference being that only a single video is being loaded and after the model upscales each frame it is being written to the new high resolution video
Evaluation can be run with in test mode which expect high resolution video input and will report values of metrics just during training, or in inference mode which expects low resolution video input and will generate upscaled video

![test](resources/test.png)


### Our Results

We wanted to test 3 optimizers AdamW, Adagrad, and SGD, and 3 loss functions MSE, PNSR, and DSSIM, additionally we wanted to test the larger version of the model, however due to the computational limitations we couldn't train each model on the whole data so we started with a smaller dataset of 36 videos (~10,000 frames) and uptrained only the models which got good results. In this first turn on 36 datasets and 10 epochs we trained in total 6 models:

- small model with AdamW optimizer and MSE loss
- small model with AdamW optimizer and PNSR loss
- small model with AdamW optimizer and DSSIM loss
- large model with AdamW optimizer and MSE loss
- small model with Adagrad optimizer and MSE loss
- small model with Adagrad optimizer and PNSR loss
- small model with SGD optimizer and MSE loss

**Hyper paramters**\
For each model we used 0.0001 learning rate as it is a popular choice and a single past and future frame. We tried using 2 frames but the results were not good enough to justify the increased computational demand. Each model was trained ans tested on 360->730p upscaling, because smaller image size allowed us to train the models faster. Batch size was set to 8 which was dictated by the VRAM capacity of the GPU.

**Training and inference times**\
Training of a single model in this configuration took around 2.5h this was also true for larger model trained with smaller number of epochs, and later when we training models on 108 videos with 3 epochs the time was also similar. Inference with smaller model takes around 50s per video, 6 frames per second, and with larger model around 75s per video, 4 frames per second.

Already in the training phase model trained with SGD optimizer gave results so poor that we resigned from training it on the two remaining losses. Adagrad's performance was similarly poor and we skipped training it on DSSIM loss. As mentioned before the large model was trained only for 3 epochs instead of 10 due to computational limitations which resulted in underfitting. After training we evaluated the models on 10 videos from the dataset (none of which were used in training) for each of them we calculate average values of metrics all frames, Since cubic interpolation is our baseline we also calculated the same metrics for it calculated the difference between the model metrics and cubic metrics. Additionally since having two separate metrics made it difficult to compare models we introduced an additional metric QM (Quality Measure) defined as: a geometric mean of PSNR and DSSIM.
$$\text{Quality Measure (QM)} = \sqrt{\text{PNSR}*\text{DSSIM}}$$
To compare models we also averaged the results over all 10 testing videos and calulated a score as number of videos in which given model outperformed the cubic interpolation. The results are presented in the table below:


In [6]:
import pandas as pd
pd.read_csv('resources/results1.csv')

Unnamed: 0,model,100.mp4_psnr,100.mp4_qm,100.mp4_ssim,1000.mp4_psnr,1000.mp4_qm,1000.mp4_ssim,200.mp4_psnr,200.mp4_qm,200.mp4_ssim,...,800.mp4_psnr,800.mp4_qm,800.mp4_ssim,900.mp4_psnr,900.mp4_qm,900.mp4_ssim,average_psnr,average_ssim,average_qm,score
0,large_360_720_36videos_AdamWopt_MSELossloss_1f...,-3.734211,-0.300641,-0.006046,-0.452009,-0.297245,-0.089115,0.450476,0.048059,0.002965,...,-0.11538,-0.022719,-0.004397,0.430492,0.071991,0.012013,-0.963966,-0.026104,-0.154099,0.5
1,small_360_720_36videos_Adagradopt_MSELossloss_...,-6.165242,-0.527323,-0.017874,-4.967614,-0.918187,-0.169263,-4.230602,-0.824915,-0.159152,...,-2.170785,-0.433223,-0.08458,-2.959594,-0.57853,-0.111705,-4.707401,-0.11174,-0.721222,0.0
2,small_360_720_36videos_Adagradopt_PNSRloss_1fb...,-5.767887,-0.491491,-0.0164,-4.04604,-0.665161,-0.10871,-3.62527,-0.693373,-0.131711,...,-1.761099,-0.347364,-0.067251,-2.554079,-0.48514,-0.091525,-3.921213,-0.082794,-0.569042,0.0
3,small_360_720_36videos_AdamWopt_DSSIMloss_1fb_...,-4.487203,-0.33939,-8.8e-05,0.177847,-0.037914,-0.018717,0.310632,0.070956,0.01538,...,0.103374,0.041477,0.011194,0.412186,0.097714,0.021601,-0.94944,-0.002913,-0.082781,0.6
4,small_360_720_36videos_AdamWopt_MSELossloss_1f...,-6.452513,-0.500426,-0.002212,0.173088,-0.048839,-0.022369,0.308065,0.034897,0.002735,...,0.070231,0.007311,0.000445,0.29406,0.054356,0.01002,-1.333145,-0.007172,-0.125245,0.4
5,small_360_720_36videos_AdamWopt_PNSRloss_1fb_1...,-1.699418,-0.134431,-0.00242,-0.463035,-0.281817,-0.083579,0.523455,0.05818,0.004271,...,0.133572,0.00814,-0.001122,0.514038,0.077649,0.01144,-0.700565,-0.029648,-0.145092,0.6
6,small_360_720_36videos_SGDopt_MSELossloss_1fb_...,-18.034602,-1.647082,-0.056695,-9.68254,-1.876009,-0.359083,-9.181472,-1.794721,-0.345958,...,-6.394648,-1.161814,-0.210196,-8.182921,-1.561708,-0.295159,-11.706562,-0.240023,-1.70424,0.0


Unfortunately in this first stage none of the trained models outperformed the interpolation on average, however AdamW models trained on DSSIM and PNSR outperformed the cubic interpolation on 6 out of 10 videos so we selected them for the next stage


In the next stage we uptrained the selected models with 108 more videos and 3 more epochs, the results are presented in the table below:


In [7]:
pd.read_csv('resources/results2.csv')

Unnamed: 0,model,100.mp4_psnr,100.mp4_qm,100.mp4_ssim,1000.mp4_psnr,1000.mp4_qm,1000.mp4_ssim,200.mp4_psnr,200.mp4_qm,200.mp4_ssim,...,800.mp4_psnr,800.mp4_qm,800.mp4_ssim,900.mp4_psnr,900.mp4_qm,900.mp4_ssim,average_psnr,average_ssim,average_qm,score
0,small_360_720_144videos_AdamWopt_DSSIMloss_1fb...,-7.91064,-0.612011,-0.000482,0.885894,0.105123,0.009631,0.22164,0.073816,0.019194,...,0.071826,0.060105,0.018558,0.352403,0.103225,0.025399,-0.626638,0.011379,-0.01292,0.8
1,small_360_720_144videos_AdamWopt_PNSRloss_1fb_...,0.397138,0.02776,-0.000419,0.752685,0.056118,-0.003395,0.967591,0.136301,0.018012,...,0.529495,0.073506,0.009657,0.798743,0.128736,0.0206,0.855944,0.004731,0.084644,0.9


Below are the plots of the metrics for the DSSIM model in the format I mentioned earlier:

![plots](resources/Adam_DSSIM_144_10.png)


And for the PNSR model:

![plots](resources/Adam_PNSR_144_10.png)


This time the model trained on the PNSR loss outperformed the cubic interpolation on average and on 9 out of 10 so we decided to give it even more training, once again uptraining with 108 videos and 3 epochs, the results are presented in the table below:


In [8]:
pd.read_csv('resources/results3.csv')

Unnamed: 0,model,100.mp4_psnr,100.mp4_qm,100.mp4_ssim,1000.mp4_psnr,1000.mp4_qm,1000.mp4_ssim,200.mp4_psnr,200.mp4_qm,200.mp4_ssim,...,800.mp4_psnr,800.mp4_qm,800.mp4_ssim,900.mp4_psnr,900.mp4_qm,900.mp4_ssim,average_psnr,average_ssim,average_qm,score
0,small_360_720_252videos_AdamWopt_PNSRloss_1fb_...,0.511012,0.038215,0.000199,1.020389,0.106209,0.00593,0.931552,0.134291,0.018415,...,0.796978,0.111497,0.014835,0.782417,0.128722,0.021093,0.959765,0.008715,0.105079,1.0


Finally our model came on top on all 10 videos and increased the average QM


Here are the plots for the final training session:

![plots](resources/Adam_PNSR_252_10.png)


### Extras: Resolution Agnostic Upscaling

Although we chose 360->730p upscaling for training and testing, our model architecture allows videos of arbitrary size to be upscaled, so we ran two additional rounds of testing, 720p->1080p and 720p->1440p, to find out how the model would fare on resolution different from the one it was trained on. The reason we chose those resolutions is that 720p->1440p features the same upscaling factor as 360p->720p of x2, while 720p->1080p has a scaling factor of x1.5, and we wanted to find out whether that would influence the results. The results for the 720p->1440p are presented in the table below:


In [9]:
pd.read_csv('resources/resultsQHD.csv')

Unnamed: 0,model,100.mp4_psnr,100.mp4_qm,100.mp4_ssim,1000.mp4_psnr,1000.mp4_qm,1000.mp4_ssim,200.mp4_psnr,200.mp4_qm,200.mp4_ssim,...,800.mp4_psnr,800.mp4_qm,800.mp4_ssim,900.mp4_psnr,900.mp4_qm,900.mp4_ssim,average_psnr,average_ssim,average_qm,score
0,small_360_720_252videos_AdamWopt_PNSRloss_1fb_...,0.037415,0.001239,-0.000393,1.152993,0.096681,0.000642,1.248082,0.128421,0.007854,...,1.262106,0.145414,0.012803,1.251351,0.148825,0.014457,1.191194,0.004677,0.110589,1.0


while results for the 720p->1080p are as follows:


In [10]:
pd.read_csv('resources/resultsFHD.csv')

Unnamed: 0,model,videos/FHD/100.mp4_psnr,videos/FHD/100.mp4_qm,videos/FHD/100.mp4_ssim,videos/FHD/1000.mp4_psnr,videos/FHD/1000.mp4_qm,videos/FHD/1000.mp4_ssim,videos/FHD/200.mp4_psnr,videos/FHD/200.mp4_qm,videos/FHD/200.mp4_ssim,...,videos/FHD/800.mp4_psnr,videos/FHD/800.mp4_qm,videos/FHD/800.mp4_ssim,videos/FHD/900.mp4_psnr,videos/FHD/900.mp4_qm,videos/FHD/900.mp4_ssim,average_psnr,average_ssim,average_qm,score
0,fine-models/small_360_720_252videos_AdamWopt_P...,-0.534127,-0.041081,-0.000958,0.66914,0.041786,-0.004324,1.06125,0.10055,0.003861,...,0.954282,0.099231,0.006251,0.944094,0.103377,0.007901,0.555545,0.000368,0.047386,0.8


We can see that model trained in such a way works perfectly outside of tis native resolutions as long as the upscale factor is the same, with 720p->1440p are even slightly better than with 360p->720p, but in 720p->1080p the performance drops significantly with model losing 2/10 videos to cubic interpolation.


### Extras: Own Dataset

We also recoded our own videos, of variable length amounting to around 40s or 2400 frames and tested our model on them. The results are presented in the table below:


In [11]:
pd.read_csv('resources/resultsCustom.csv')

Unnamed: 0,model,1_psnr,1_qm,1_ssim,2_psnr,2_qm,2_ssim,3_psnr,3_qm,3_ssim,...,5_psnr,5_qm,5_ssim,6_psnr,6_qm,6_ssim,average_psnr,average_ssim,average_qm,score
0,small_360_720_252videos_AdamWopt_PNSRloss_1fb_...,-0.286556,-0.020512,-0.000108,-0.287387,-0.02039,-7.3e-05,-0.327161,-0.023278,-0.000102,...,0.831936,0.096908,0.009307,0.735983,0.063764,0.002339,0.20695,0.005782,0.035566,0.5


### Runtime Environment

- Ubuntu 24.04.1, Linux 6.8.0
- Intel Core i5-13500
- 32GB RAM
- NVIDIA GeForce RTX 3060
- NVIDIA Driver 535.183.01
- CUDA 12.0


### References

- Dong, C., Loy, C.C., He, K. and Tang, X., 2015. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2), pp.295-307.
- Kappeler, A., Yoo, S., Dai, Q. and Katsaggelos, A.K., 2016. Video super-resolution with convolutional neural networks. IEEE transactions on computational imaging, 2(2), pp.109-122.
- https://medium.com/coinmonks/review-srcnn-super-resolution-3cb3a4f67a7c
- https://www.v7labs.com/blog/image-super-resolution-guide
- https://alexandrosstergiou.github.io/datasets/Inter4K/index.html (Dataset)


### Points

| Item                                         | Points |
| -------------------------------------------- | ------ |
| Super-Resolution                             | 3      |
| Own Architecture                             | 2      |
| Non-trivial solution (using mutliple frames) | 1      |
| At least 10,000 images                       | 1      |
| Our own dataset                              | 1      |
| Testing 3 optimizers                         | 1      |
| Testing 3 loss functions                     | 1      |
| MlFLow                                       | 1      |
| Streamlit GUI                                | 1      |
| DVC                                          | 2      |
| HPO                                          | 1      |
| Total                                        | 15     |


### [Github Repository](https://github.com/Dawid64/super-enhanced-resolution)
