# Fast Image Processing Using Fully Convolutional Networks (FCN)
<div style = "text-align:justify"> This exercise is based on our PyTorch implementation of the 2017 ICCV paper by [Qifeng Chen et al](https://arxiv.org/pdf/1709.00643.pdf). Unlike the previous three exercises where we always had a set of fully connected layers at the end, in this exercise we will build a fully convolutional network (FCN) to do a couple of image processing operations viz photographic style transfer and pencil drawing. You will be asked to: </div>

- code the adaptive batch normalization function
- fill a part of the FCNN class named *FastIP* and its *forward()* method
- understand how data is preprocessed and loaded (as promised in the last exercise)

So, let's march ahead!

## Multi-Scale Context Aggregation [[2]](#references_cell)
<div style = "text-align:justify"> Let's look back at our three exercises. In the first exercise, we predicted coordinates of 15 keypoints in the face. It is a regression problem. In the second and third exercises, we classified hand signs into digits and recognized faces respectively. These are classification problems. But most natural computer vision problems require dense prediction. For example, in semantic segmentation, we need to predict the object category associated with each pixel. How do we go about this?</div>
<br>
<div style = "text-align:justify">One way to solve dense prediction is to convert the CNN architectures for classification to dense prediction by adding upsampling layers. A typical CNN architecture integrates multi-scale information through pooling layers that successively downsamples the i/p until a global prediction is obtained. So, in order to obtain dense prediction, add upsampling layers.</div>
<br>
<div style = "text-align:justify"> Another way to approach dense prediction is to provide multiple rescaled versions of the image as input to the network and combine the predictions obtained for these multiple inputs.</div>
<br>
<div style = "text-align:justify"> Is severe downsampling necessary if our goal is dense prediction? Do we need multiple rescaled versions of the input? In [[2]](#references_cell) the authors propose a dedicated FCN architecture for dense prediction that outperform the above two approaches. This architecture aggregates multi-scale context through dilated convolutions.</div>

## Dilated Convolutions
<div style = "text-align:justify"> First, let us define "receptive field". Suppose an image is passed through a convolutional layer with kernel_size = (3, 3). Each element in the ouput is computed using the 3x3 kernel centred at that element in the image. So, we say the receptive field of elements after 1 layer of convolution is 3x3. Now if this ouput passed through another convolutional layer with 3x3 kernel, then, with respect to first ouput the receptive field of each element in the second output is 3x3 while with respect to the input image it is 5x5. Receptive field is defined with respect to the input image. In general, receptive field of each element at the output of layer $l$ is $(l - 1)*(f - 1) + f$ where f is the kernel size. We can see that with vanilla convolution layers, the receptive field size grows linearly and hence the aggregation of multi-scale context is slow which is a severe limitation for high resolution images.</div>

<div style = "text-align:justify"> Let us see what a dilated convolution is. The animation in Fig 1 should explain what it is.</div>
<img src="images/dilation.gif" style="width:300px;height:300px;">
<caption><center> <u> <font color='purple'> **Figure 1** </u><font color='purple'>  : **Dilated convolution with dilation factor d = 2, stride s = 1 and kernel size f = 3 [(source)](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)** </center></caption>
<br>
<div style = "text-align:justify"> As is clear from Fig 1, the kernel's elements correlate with 'd pixels' spaced elements in the image for a dilated convolution with dilation factor d. How does the receptive field size grow for dilated convolutions? Let's look at Fig 2.</div>
<img src="images/dilated_recpField.png" style="width:500px;height:200px;">
<caption><center> <u> <font color='purple'> **Figure 2** </u><font color='purple'>  : **Growth of receptive field size with exponential dilation factors (f = 3, s = 1) [(source)](https://arxiv.org/pdf/1511.07122.pdf)** </center></caption>
<br>
<div style = "text-align:justify"> With dilation factors increasing from $2^{0}$ to $2^{2}$ (part (a) through (c) in Fig 2), the receptive field size grows from 3 to 15, an exponential growth. Hence exponential dilations can aggregate multi-scale context much more efficiently than vanilla convolutions.

## Fully Convolutional Model for Image Processing
<div style = "text-align:justify"> Given an image of size m x n, we want to process the image, say, transfer the style of the input image to style of a reference image or convert the input image to a pencil drawing. See Fig 3.</div>
<img src="images/image_processing.png" style="width:500px;height:500px;">
<caption><center> <u> <font color='purple'> **Figure 3** </u><font color='purple'>  : **Photographic style (left column) and Pencil drawing (right column)** </center></caption>
<br>
<div style = "text-align:justify">There are already state-of-the-art algorithms available for these operations [[3]](#references_cell) and [[4]](#references_cell). But they are computationally very expensive, especially for high resolution images. Can we approximate those existing image processing operators by FCN, achieving results on par with the original operators but in real time? Note that this is a dense prediction problem i.e both input and output are images (of same size). We also put a constraint that a single network (FCN) should work for both Photographic style and Pencil drawing i.e the parmeters and the flow of computation are shared for both the operations but for each operation the network will be trained separately and so will have different values for parameters. In fact the authors in [[1]](#references_cell) have come with a network for ten different image processing operations. Let us look at the details of this network (we will call it as *FastIP* though in paper it is called as *CAN24*) which we will be coding below. CAN stands for context aggregation networks. 24 refers to feature maps being fixed to 24 in all the intermediate layers (see below).</div>

### Details of FastIP
- *FastIP* has 9 layers $L^{0}$,...., $L^{8}$
- $L^{0}$ is i/p layer with dimension m x n x 3 (color images). m and n can vary across images
- $L^{8}$ is o/p layer with dimension m x n x 3. Input resolution is retained
- All others layers are of dimension m x n x 24. i.e number of feature maps in the intermediate layers is fixed to   be 24
- Each intermediate layer comprises of a dilated convolution with kernel_size = (3, 3) followed by adaptive batch     normalization and [leaky relu](http://pytorch.org/docs/master/nn.html#leakyrelu) activation. negative_slope for leaky relu is set to 0.2.
- Dilation factors for convolutions are fixed to be 1, 2, 4, 8, 16, 32, 64, 1, 1
- Loss function used is MSELoss

<div style = "text-align:justify"> Groundtruth has been generated by running the matlab scripts implementing the existing state-of-the-art image processing algorithms. For eg, to generate ground truth results for photographic style transfer operation and pencil drawing operation, we ran the scripts implementing [[3]](#references_cell) and [[4]](#references_cell). Implementations were provided by the author. </div>

So all things set for coding!

First let's load the required packages.

In [1]:
# Run this cell
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

import numpy as np
import matplotlib.pyplot as plt
import h5py
import time
import glob
import cv2

%load_ext autoreload
%autoreload 2
%matplotlib inline

### Adaptive Batch Normalization
<div style = "text-align:justify"> Before we implement the class *FastIP* we will define the class *AdaptiveBatchNorm2d* for adaptive batch normalization (ABN). Batch normalization is available in PyTorch but ABN is not available. By implementing this yourself, you will gain confidence to implement your own normalization layers, initialization functions etc. In Pytorch every layer has to be implemented as a class with a *forward* method. Before we do this let's see what ABN is. </div><br>

<div style = "text-align:justify"> The authors found that batch normalization improved accuracy for a couple of operations but degraded the performance of other operations. Therefore they defined an ABN strategy with learnable parameters that adapts itself 'for' and 'against' batch normalization depending on the operation. The equation to implement for ABN is                               $\psi(x) = ax + bBN(x)$ where $a$ and $b$ are learnable parameters and BN is batch normalization operator.</div>

<div style = "text-align:justify">**You will be implementing below a class** *AdaptiveBatchNorm2d* where in the *init* method $a$ and $b$ will be created as learnable parameters and in the *forward* method the above equation will be implemented and it's output returned. For making $a$ and $b$ learnable parameters use [[torch.Tensor]](http://pytorch.org/docs/master/tensors.html) and [[nn.Parameter]](http://pytorch.org/docs/master/nn.html#parameters). Make $a$ and $b$ 4d for consistency in dim with other parameters i.e  $a$ and $b$ are of dimensions 1 x 1 x 1 x 1.

In [4]:
# Replace None in the rhs by your code
class AdaptiveBatchNorm2d(nn.Module):
    
    def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True):
        super(AdaptiveBatchNorm2d, self).__init__()
        self.bn = nn.BatchNorm2d(num_features, eps, momentum, affine)
        self.a = None # nn.Parameter requires a tensor. Create it using torch.Tensor and supply it.
        self.b = None

    def forward(self, x):
        # complete the forward method(). Max 2 lines of code

<div style = "text-align:justify"> Now let's build the *FastIP* class. Details of this class/model have already been provided above. To create dilated convolutions with dilation factor d, use [[nn.Conv2d]](http://pytorch.org/docs/master/nn.html#conv2d) and set the *dilation* argument to d. Note that at every stage from input to output, the height and width has to be retained. This means you have to choose *padding* argument accordingly. *stride* is always (1, 1). We leave it to you to compute *padding* correctly. **Complete the code below.**</div>

In [33]:
class FastIP(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.conv1 = None
        self.abn1 = None
        self.l_relu1 = None
        self.conv2 = None
        self.abn2 = None
        self.l_relu2 = None       
        self.conv3 = None
        self.abn3 = None
        self.l_relu3 = None
        self.conv4 = None
        self.abn4 = None
        self.l_relu4 = None
        self.conv5 = None
        self.abn5 = None
        self.l_relu5 = None
        self.conv6 = None
        self.abn6 = None
        self.l_relu6 = None
        self.conv7 = None
        self.abn7 = None
        self.l_relu7 = None
        self.conv8 = None
        self.abn8 = None
        self.l_relu8 = None
        self.conv9 = None
    
    def forward(self, x):
        # remove pass statement and complete the forward method     
           pass    
    
    

In [8]:
# Let's check if you get the expected output; Run this cell
torch.manual_seed(23)
inputs = Variable(torch.randn(1, 3, 3, 3))
fip = FastIP()
outputs = fip(inputs)
print(outputs)

**Expected Output:**
<br>
 <span style = "color:green">
 <br>
 Variable containing:
 <br>
(0 ,0 ,.,.) = 
 <br>
1.00000e-02 \*
 <br>
  -2.8771 -2.8771 -2.8771
   <br>
  -2.8771 -2.8771 -2.8771
   <br>
  -2.8771 -2.8771 -2.8771
   <br>
(0 ,1 ,.,.) = 
 <br>
1.00000e-02 \*
 <br>
   2.5805  2.5805  2.5805
    <br>
   2.5805  2.5805  2.5805
    <br>
   2.5805  2.5805  2.5805
    <br>
(0 ,2 ,.,.) = 
 <br>
1.00000e-02 \*
 <br>
   0.0056  0.0056  0.0056
    <br>
   0.0056  0.0056  0.0056
    <br>
   0.0056  0.0056  0.0056
    <br>
[torch.FloatTensor of size 1x3x3x3]</span>

<div style = "text-align:justify"> Now, let's come to data preprocessing and loading. In PyTorch, to create a custom dataset, you have to inherit from *Dataset* class defined in *torch.utils.data module*. *Dataset* is an abstract class with two methods *len* and *get_item* . These methods have to be overrided in the inherited class. *len* should return length of the dataset i.e for eg, number of images in the dataset. *get_item* should support indexing such that dataset[i] can be used to get i$^{th}$ sample i.e it should return the i$^{th}$ image and the corresponding groundtruth (groundtruth only in case of training/validation dataset) if supplied the argument i. This makes our dataset an iterable. We will give an idea about creating a custom dataset class for the problem at hand. </div>
<br>
<div style = "text-align:justify"> Let us focus on photographic style transfer problem. Let's say our training images are in *data/MIT-Adobe_train_random/*. The images are of some resolution m x n. m and n could be varying across images for the problem at hand. Also, let's say the groundtruth for the training images (which themselves are images of same resolution as their correponding input images) are available in *data/original_results/Photographic-style/MIT-Adobe_train_random/*. </div>
<br>
<div style = "text-align:justify">In the *init* method we will extract the input and ground truth image names into two diferent lists from both the folders and keep it sorted. We have used *glob* package for this. *len* method can simply return the length of either of these lists which is same as the number of training images. *get_item(i)* will simply read the i$^{th}$ image and i$^{th}$ ground truth  whose names are already available in the lists. Reading image is done using *imread* in cv2 package. The images are then flipped across the channel dimension since *imread* in cv2 reads in BGR format and we need it in RGB format. Further, the images are normalized to [0, 1].</div>
<br>
<div style = "text-align:justify"> Now, images at hand are numpy arrays with dimension *height x width x num_of_channels* while PyTorch requires them to be tensors with dimensions *num_of_channels x height x width*. So we need to transpose the dimensions accordingly and then convert them to tensors. To convert from numpy to PyTorch tensor, use  [[torch.from_numpy]](http://pytorch.org/docs/master/search.html?q=from_numpy&check_keywords=yes&area=default). Then store the image and groundtruth tensors in a dictionary. Our custom dataset named *FastIPTrainDataset* is ready. **See the code below.**</div>
<br>
<div style = "text-align:justify"> Note that the point to keep in mind is that *len* should be implemented to return length of the dataset and *get_item(i)* should be implemented to return the i$^{th}$ item of the dataset. As long as this is taken care, you may implement the dataset differently. We just showed one way of doing here. Also many other transformations to data can be done if required through this class implementation and with other supplementary classes but we are skipping that for now. You may look at [[Data Loading Tutorial]](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html) for more details.</div>

In [28]:
# Run this cell
class FastIPTrainDataset(Dataset):
    
    def __init__(self, task, random = True):
        super().__init__()
        if random:
            
            self.train_ip_folder = "data/MIT-Adobe_train_random/" 
            self.train_op_folder = "data/original_results/" + task + "/MIT-Adobe_train_random/" 
                                
        else:
            self.train_ip_folder = "data/MIT-Adobe_train_480p/"
            self.train_op_folder = "data/original_results/" + task + "/MIT-Adobe_train_480p/"
            
        self.ip_image_names = sorted(glob.glob(self.train_ip_folder + "*.png"))
        self.op_image_names = sorted(glob.glob(self.train_op_folder + "*.png"))
        
    def __len__(self):
        return len(self.ip_image_names)
    
    def __getitem__(self, i):
        ip_path = self.ip_image_names[i]
        op_path = self.op_image_names[i]
                
        ip_image = cv2.imread(ip_path, 1)
        op_image = cv2.imread(op_path, 1)
        
        ip_image = ip_image[...,::-1]
        op_image = op_image[...,::-1]
        
        ip_image = np.around(np.transpose(ip_image, (2,0,1)) / 255.0, decimals=12)
        op_image = np.around(np.transpose(op_image, (2,0,1)) / 255.0, decimals=12)
        
        ip_image = torch.from_numpy(ip_image).float()
        op_image = torch.from_numpy(op_image).float()
           
        sample = {"ip_image":ip_image, "op_image":op_image}
        
        return sample

Let's instantiate this class and iterate through some samples.

In [31]:
# Run this cell
fast_ip_dataset = FastIPTrainDataset('Photographic_style', random = False)

fig = plt.figure()
for i in range(len(fast_ip_dataset)): # we are using len method of our dataset
    sample = fast_ip_dataset[i] # here we are using get_item method of our dataset

    print(i, sample['ip_image'].shape, sample['op_image'].shape)

    ax = plt.subplot(1, 2, i + 1)
    plt.tight_layout()
    ax.set_title('Image #{}'.format(i))
    ax.axis('off')
    plt.imshow(sample['ip_image'].numpy().transpose(1, 2, 0))
    ax = plt.subplot(1, 2, i + 2)
    plt.tight_layout()
    ax.set_title('Groundtruth #{}'.format(i))
    ax.axis('off')
    plt.imshow(sample['op_image'].numpy().transpose(1, 2, 0))
    break

<div style = "text-align:justify"> Now that we have created the dataset, how do we load it for training? Do we manually iterate as above, group data in to mini-batches and then supply to train function? We can do so but that will not be efficient. Instead we will rely on *DataLoader* class defined in *torch.utils.data* module that provides features for batching, sampling, shuffling, multi-processing etc. You may look at [[DataLoader]](http://pytorch.org/docs/master/data.html) for details. For the problem at hand, since the images are big with minimum dimension being 480 pixels, we set batch_size to 1. Otherwise memory will be overwhelmed. See *load_dataset* function below.

In [23]:
# Run this cell
def load_dataset(task, random = True):    
    X = FastIPTrainDataset(task, random)
    data_loader = DataLoader(X, batch_size = 1)
    data_size = len(data_loader)
    return data_loader, data_size

Now we can define train_model. If you had come till here, you should be able to follow train_model on your own.

In [32]:
def train_model(model, data_loader, dataset_size, criterion, optimizer):
    
    since = time.time()
    train_loss_history = [] 
    num_epochs = model.num_epochs
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        
        model.train(True)
        
        running_loss = 0.0
        
        for data in data_loader:
            ip_imgs = data["ip_image"]
            orig_op_imgs = data["op_image"]
            
            if ip_imgs.size()[0] * ip_imgs.size()[1] > 2200000:
                continue                             
            ip_imgs, orig_op_imgs = Variable(ip_imgs), Variable(orig_op_imgs)
            
            # Dont' uncomment the 4 commented lines below. We will be working with CPU.
            
            #if torch.cuda.is_available():
             #   ip_imgs, orig_op_imgs = Variable(ip_imgs.cuda()), Variable(orig_op_imgs.cuda())
            #else:
             #   ip_imgs, orig_op_imgs = Variable(ip_imgs), Variable(orig_op_imgs)
            
            optimizer.zero_grad()
            op_imgs = model.forward(ip_imgs)
            
            loss = criterion(op_imgs, orig_op_imgs)
        
            loss.backward()
            optimizer.step()
            
            running_loss += loss.data[0]
            
        epoch_loss = running_loss / dataset_size           
        train_loss_history.append(epoch_loss)        
        print('Train Loss: {:.8f}'.format(epoch_loss))
        
    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
            time_elapsed // 60, time_elapsed % 60))    
    return

In [10]:
# load training data
data_loader, data_size = load_dataset('Photographic_style', random = False)


<div style = "text-align:justify"> We have trained two separate models one for photographic style transfer and the other for pencil drawing on the MIT-Adobe dataset [[5]](#reference_cell). Each of them took around 33 to 35 hours of training for 180 epochs (close to 500K iterations) on Titan-X GPU. It is roughly around 9-10 minutes per epoch. Since you are working on CPU now, it will take a very long time even to train one epoch. So we are not asking you to train. We generated results for both photographic style transfer and pencil drawing using our pre-trained models. A couple of results are shown in Figures 4 and 5. For more results, Click on File-->Open-->results and see some of the results generated for both the tasks.</div>

<div style = "text-align:justify"> By the way, why was the title "Fast Image Processing using FCN"? The reason is that doing photographic style transfer and pencil drawing using traditional hand-crafted algorithms even with parallelization (MATLAB parallel toolbox) consumes around 6000 and 5000 milliseconds respectively [[1]](#references_cell)  for images at 1080p resolution from MIT-Adobe-Test set [[5]](#references_cell) while FCN consumes constant 190 milliseconds for both the operations. Runtime was measured on a workstation with an Intel i7-5960X 3.0GHz CPU and an Nvidia Titan-X GPU [[1]](#references_cell). It's super fast "Chennai Express"!!!

Hearty Congratulations!! Drained out or pumped up??? Any way, we are planning to cover the latest in CNN - just 30 days old, after dinner at 7.45 p.m. - Capsule Networks!!! But the session is optional. Let us see how many turn up. 

Wish you all the best and lots of love. Thank You for your cooperation.



In [15]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='results/Photographic_style/001166.png'></td>\
               <td><img src='results/Photographic_style/001166.jpg'> <caption><center>\
               <u> <font color='purple'> <font size = 4>Figure 4 </u><font color='purple'>\
               : Photographic style transfer</center></caption></td></tr></table>"))
display(HTML("<table><tr><td><img src='results/Pencil_drawing/000004.png'></td>\
               <td><img src='results/Pencil_drawing/000004.jpg'> <caption><center>\
               <u> <font color='purple'> <font size = 4>Figure 5 </u><font color='purple'>\
               : Pencil Drawing</center></caption></td></tr></table>"))

0,1
,Figure 4 : Photographic style transfer


0,1
,Figure 5 : Pencil Drawing


<a id='references_cell'></a>
### References
1. [Qifeng Chen, Jia Xu and Vladlen Koltun - Fast Image Processing with Fully-Convolutional Networks (2017)](https://arxiv.org/pdf/1709.00643.pdf)

2. [Fisher Yu and Vladelen Koltun - Multi-Scale Context Aggregations by Dilated Convolutions (2016)](https://arxiv.org/pdf/1511.07122.pdf)

3. [M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand - Fast local Laplacian filters: Theory and              applications (2014)](people.csail.mit.edu/sparis/publi/2014/tog/Aubry_14-Fast_Local_Laplacian_Filters.pdf
)

4. [C. Lu, L. Xu, and J. Jia - Combining sketch and tone for pencil drawing production (2012)](www.cse.cuhk.edu.hk/leojia/projects/pencilsketch/npar12_pencil.pdf)

5. [V. Bychkovsky, S. Paris, E. Chan, and F. Durand - Learning photographic global tonal adjustment with a database of input / output image pairs (2011).](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.208.67&rep=rep1&type=pdf)