# SRCNN Using Theano

This is a notebook on theano based Image-super resolution based on this [paper](https://arxiv.org/abs/1501.00092).
This implementation is based on the implementation of ['corochann'](https://github.com/corochann).
The original codes of this implementation can be found [here](https://github.com/corochann/theanonSR).
The images used have been taken from [here](https://github.com/jbhuang0604/SelfExSR)
This is all work of the original authors of the papers. The purpose of this notebook is only to make the process easier to comprehend and give a toy example of how to implement it.

## Introduction to Image Super_resolution

Single image super-resolution (SR), which aims at recovering a high-resolution image from a single low-resolution image, is a classical problem in computer vision. This problem is inherently ill-posed since a multiplicity of solutions exist for any given low-resolution pixel.

Such a problem is typically mitigated by constraining the solution space by strong prior information. To learn the prior,recent state-of-the-art methods mostly adopt example-based strategy. These methods either exploit internal similarities of the same image or learn mappingfunctions from external low- and high-resolution exemplar pairs.

In this paper a Deep Convolutional Neural Network has been developed to solve this problem.The proposed Super-Resolution-convoluted-neural-network, SRCNN, has several appealing properties. First, its structure is intentionally designed with simplicity in mind, and yet provides superior accuracy.Secondly, with morderate numbers of filters and layers, this method achieves fast speed for practical on-line usage even on a CPU.

## The model

Consider a single low-resolution image, we first upscale it to the desired size using bicubic interpolation, which is the only pre-processing we perform. Let us denote the interpolated image as $Y$. Our goal is to recover from $Y$ an image $F(Y)$ that is as similaras possible to the ground truth high-resolution image $X$. For the ease of presentation, we still call $Y$ a “low-resolution” image, although it has the same size as $X$. We wish to learn a mapping $F$, which conceptually consists of three operations: 

1) <b>Patch extraction and representation:</b> this operation extracts (overlapping) patches from the low-resolution image $Y$ and represents each patch as a high-dimensional vector. These vectors comprise a set of feature maps, of which the number equals to the dimensionality of the vectors. 

2) <b>Non-linear mapping:</b> this operation nonlinearly maps each high-dimensional vector onto another high-dimensional vector. Each mapped vector is conceptually the representation of a high-resolution patch. These vectors comprise another set of feature maps. 

3) <b>Reconstruction:</b> this operation aggregates the above high-resolution patch-wise representations to generate the final high-resolution image. This image is expected to be similar to the ground truth $X$. 



![model image](model.png)


This can be mathematically be represented as

1) 
\begin{equation}
F_1 (Y) = max(0,W_1*Y +B_1)
\end{equation}
where $W_1$ and $B_1$ represent the filters and biases respectively, and ’*’ denotes the convolution operation. Here,$ W_1$ corresponds to $n_1$ filters of support $c\times f_1 \times f_1$, where $c$ is the number of channels in the input image, $f_1$ is the spatial size of a filter.

2) 
\begin{equation}
F_2 (Y) = max(0,W_2*F_1(Y) +B_2)
\end{equation}
Here, $W_2$ contains $n_2$ filters of size $n_1\times f_2 \times f_2$, and $B_2$ is $n_2$-dimensional.

3) 
\begin{equation}
F(Y) = W_3*F_2(Y) +B_3
\end{equation}
Here $W_3$ corresponds to $c$ filters of a size $n_2 \times f_3\times f_3$, and $B_3$ is a $c$-dimensional vector.

## Training

Learning the end-to-end mapping function $F$ requires the estimation of network parameters $\Theta = \{ W_1;W_2;W_3;B_1;B_2;B_3\} $. This is achieved through min-imizing the loss between the reconstructed images $F(Y; \Theta)$ and the corresponding ground truth high-resolution images $X$. Given a set of high-resolution images $\{Xi\}$ and their corresponding low-resolution images $\{Yi\}$, we use Mean Squared Error (MSE) as the loss function: 

\begin{equation}
L(\Theta) = \frac{1}{n} \sum_{i=1}^{n}{{|| F(Y_i,\Theta) - X_i ||}^2},
\end{equation}
where $n$ is the number of training samples

In [2]:
## Load the data files
import sys
import os
h_r_folder_name = 'image-dataset/h_r'
l_r_folder_name = 'image-dataset/l_r'
h_r_files_list = os.listdir(h_r_folder_name)
l_r_files_list = os.listdir(l_r_folder_name)
h_r_files_list = [h_r_folder_name+'/'+ x for x in h_r_files_list]
l_r_files_list = [l_r_folder_name+'/'+ x for x in l_r_files_list]

In [3]:
# lets see some files
import numpy as np
import cv2
img_hr = cv2.imread(h_r_files_list[3])
img_lr = cv2.imread(l_r_files_list[3])
cv2.imshow('image',img_lr)
cv2.imshow('image2',img_hr)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [None]:
# Now if we apply the Ycbcr tranformation, they would be converted to the other color system
# which would be (wrongly) represented in the RGB color system as the follow
img_lr = cv2.cvtColor(img_lr, cv2.COLOR_BGR2YCR_CB)
img_hr = cv2.cvtColor(img_hr, cv2.COLOR_BGR2YCR_CB)
cv2.imshow('image',img_lr)
cv2.imshow('image2',img_hr)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [None]:
# Now we need to zoom the low resolution image using bicubic interpolation
# and convert them to Ycbcr format
# Along with that we need to make subsample pairs

no_of_files = len(h_r_files_list)

## For subsampling, let the subsamples have the image size 80px*80px
subimg_length = 32
stride_length = 32
stride_height = 32
subimg_height = 32

sample_img = cv2.imread(h_r_files_list[0])
subimg_for_length = int(np.shape(sample_img)[0])/(subimg_length+stride_length)
subimg_for_height = int(np.shape(sample_img)[1])/(subimg_height+stride_height)

# Final arrays 
X = np.ndarray(shape = (no_of_files*subimg_for_length*subimg_for_height,subimg_length,subimg_height,3))
Y = np.ndarray(shape = (no_of_files*subimg_for_length*subimg_for_height,subimg_length,subimg_height,3)) 

In [None]:
for i in range(no_of_files):
    # Zooming the images
    img_lr = cv2.imread(l_r_files_list[i])
    img_lr_zoomed = cv2.resize(img_lr, None,fx=2,fy=2, interpolation = cv2.INTER_CUBIC)
    img_hr = cv2.imread(h_r_files_list[i])
    # Now we need to convert it to YCbCr
    img_lr_zoomed = cv2.cvtColor(img_lr_zoomed, cv2.COLOR_BGR2YCR_CB)
    img_hr = cv2.cvtColor(img_hr, cv2.COLOR_BGR2YCR_CB)
    #img_lr_zoomed.resize(3,480,320)
    #img_hr.resize(3,480,320)
    # Now we need to make subsamples
    for j in range(subimg_for_length):
        for k in range(subimg_for_height):
            img_current_lr = img_lr_zoomed[j*(stride_length+ subimg_length):j*(stride_length+ subimg_length)+subimg_length,
                                           k*(stride_height+ subimg_height):k*(stride_height+ subimg_height)+subimg_height,:]
            img_current_hr = img_hr[j*(stride_length+ subimg_length):j*(stride_length+ subimg_length)+subimg_length,
                                    k*(stride_height+ subimg_height):k*(stride_height+ subimg_height)+subimg_height,:]
            X[i*subimg_for_height*subimg_for_length+j*subimg_for_height +k] = img_current_lr
            Y[i*subimg_for_height*subimg_for_length+j*subimg_for_height +k] = img_current_hr

## Next step
Now that our data is ready, we need to make the Convoluted Neural network. we use Lasagne on top of theano for this

In [None]:
### Now we have subimages.
## Now we can make the 3 layer convolution with filter sized 5,1,3
import theano
import theano.tensor as T
import lasagne

#from layer import ConvLayer
#from tools.image_processing import preprocess

In [None]:
# Prepare Theano variables for inputs and targets
input_var = T.tensor4('inputs')
target_var = T.tensor4('targets')

In [None]:
network = lasagne.layers.InputLayer(shape=(None, 3,None, None), input_var=input_var)
network = lasagne.layers.Conv2DLayer(
        network, num_filters=32,pad=2, filter_size=(5, 5),
        nonlinearity=lasagne.nonlinearities.rectify,
        W=lasagne.init.GlorotUniform())
network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(1, 1),
        nonlinearity=lasagne.nonlinearities.rectify,
        W=lasagne.init.GlorotUniform())
network = lasagne.layers.Conv2DLayer(
        network,pad=1, num_filters=3, filter_size=(3, 3),
        nonlinearity=lasagne.nonlinearities.rectify,
        W=lasagne.init.GlorotUniform())

In [None]:
#Create a loss expression for training, i.e., a scalar objective we want
# to minimize (for our multi-class problem, it is the cross-entropy loss):
prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.squared_error(prediction, target_var)
loss = loss.mean()

In [None]:
# Create update expressions for training, i.e., how to modify the
# parameters at each training step. Here, we'll use Stochastic Gradient
# Descent (SGD) with Nesterov momentum, but Lasagne offers plenty more.
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(loss, params, learning_rate=0.01,momentum=0.9)

In [None]:
# Create a loss expression for validation/testing. The crucial difference
# here is that we do a deterministic forward pass through the network,
# disabling dropout layers.
test_prediction = lasagne.layers.get_output(network,deterministic=True)
test_loss = lasagne.objectives.squared_error(test_prediction,target_var)
test_loss = test_loss.mean()

In [None]:
# Compile a function performing a training step on a mini-batch (by giving
# the updates dictionary) and returning the corresponding training loss:
train_fn = theano.function([input_var, target_var], loss, updates=updates)

# Compile a second function computing the validation loss and accuracy:
val_fn = theano.function([input_var, target_var], [test_loss])
# for true test set
predict_fn = theano.function([input_var],[test_prediction])


In [None]:
## We need to change the data shape
no_of_imgs = int(np.shape(X)[0])
#X.shape(no_of_imgs,3,subimg_length,subimg_height)
X = np.reshape(X,(no_of_imgs,3,subimg_length,subimg_height))
Y = np.reshape(Y,(no_of_imgs,3,subimg_length,subimg_height))
np.shape(Y)

In [None]:
import time

def iterate_minibatches(inputs, targets, batchsize, shuffle=False):
    assert len(inputs) == len(targets)
    if shuffle:
        indices = np.arange(len(inputs))
        np.random.shuffle(indices)
    for start_idx in range(0, len(inputs) - batchsize + 1, batchsize):
        if shuffle:
            excerpt = indices[start_idx:start_idx + batchsize]
        else:
            excerpt = slice(start_idx, start_idx + batchsize)
        yield inputs[excerpt], targets[excerpt]

In [None]:
num_epochs = 5
for epoch in range(num_epochs):
        # In each epoch, we do a full pass over the training data:
        train_err = 0
        train_batches = 0
        start_time = time.time()
        for batch in iterate_minibatches(X, Y, 100, shuffle=True):
            inputs, targets = batch
            train_err += train_fn(inputs, targets)
            train_batches += 1

        # And a full pass over the validation data:
        #val_err = 0
        #val_acc = 0
        #val_batches = 0
        #for batch in iterate_minibatches(X, Y, 100, shuffle=False):
        #    inputs, targets = batch
        #    err, acc = val_fn(inputs, targets)
        #    val_err += err
        #    val_acc += acc
        #    val_batches += 1

        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss:\t\t{:.6f}".format(train_err / train_batches))
        #print("  validation loss:\t\t{:.6f}".format(val_err / val_batches))
        #print("  validation accuracy:\t\t{:.2f} %".format(
        #    val_acc / val_batches * 100))


## Result

Now let us compare the results of bicubic interpolation and SRCNN

In [6]:
# Now we have trained our CNN
# Lets compare its peroformance to Bicubical interpolation

img_lr = cv2.imread(l_r_files_list[1],1)
img_hr_bicubic = cv2.resize(img_lr, None,fx=2,fy=2, interpolation = cv2.INTER_CUBIC)
cv2.imshow('image2',img_hr_bicubic)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [4]:
img_lr = cv2.imread(l_r_files_list[1],1)
img_hr_bicubic = cv2.resize(img_lr, None,fx=2,fy=2, interpolation = cv2.INTER_CUBIC)
img_length = int(np.shape(img_hr_bicubic)[0])
img_height = int(np.shape(img_hr_bicubic)[1])
img_hr_bicubic_ycbcr = cv2.cvtColor(img_hr_bicubic, cv2.COLOR_BGR2YCR_CB)
#img_hr_srcnn = img_hr_bicubic_ycbcr
img_hr_srcnn = predict_fn(img_hr_bicubic_ycbcr)

In [5]:
img_hr_srcnn_correct_dim = np.reshape(img_hr_srcnn,(img_length,img_height,3))
img_hr_srcnn_correct_dim_BGR = cv2.cvtColor(img_hr_srcnn_correct_dim, cv2.COLOR_YCR_CB2BGR)
cv2.imshow('image2',img_hr_srcnn_correct_dim_BGR)
cv2.waitKey(0)
cv2.destroyAllWindows()

## Conclusion

As a typical thrid world problem, I dont have the hardware to train a CNN. However, it seems that we were able to lay out the simple pipeline efficiently.

Another major issue is that the model gets trapped at local minimas far from the global minima.However, maybe with proper training rig, such issues can be solved by reinitializing the system.(or maybe we need better initialization than glorat's?)

I hope that the entire process of Image Super Resolution through Convoluted Neural Networks is clear to anyone who reads through this notebook. 