In [19]:
import torch
import torch.nn as nn
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
import torch.nn.functional as F


<h2>Purpose</h2>
<ul>
<li>Reviewing why Fully Connected Neural networks are not good choices when analying imaging data</li>
<li>Understanding convolution and why convolutional neural networks are better choice when creating models for imaging data </li>
<li>Creating a CNN</li>
<li>Understanding the difference between the module and the functional APIs</li>
<li>Designing a neural network</li>
</ul>

<h3>Why FCNs are not good choices for analyzing images?</h3>
<ul>
    <li>FCNs have too many parameters. As such, they can memorize the details of the training set, and therefore the the resulted models, most probably, will not be generalized well.</li>
    <li>Another issue with having too many parameters is that such a network is computationally expensive to train. </li>
    <li>Every pixel is related to all other pixels regardless of their locations. Therefore, FCNs are not able to capture local patterns</li>
</ul>
<p><span style='color:green;'>Alternative solution:</span> Using <span style='color:red;'>convolution</span>, which is a different linear operation.</p>

<h3>How CNN tackles the above issues?</h3>
<p><b>Learning more with less number of parameters:</b> One of the main building blocks of the CNNs is kernels. Kernels are small matrices; therefore, CNNs are able to learn features with smaller numbers of parameters. </p>
<p><b>Downsampling:</b> CNNs use downsampling methods, such as maxpooling or average pooling, to keep and send the most informative part of the image to the next layer through less number of parameters. </p>



<h3>Convolution</h3>
<p>The main building blocks of CNNs are <span style='background-color:yellow'><b>kernels</b></span>. In image processing, a kernel is a 2D (or 3D in the case of colored images) signal or a matrix which is used for different purposes, such as sharpening, blurring, edge detection. In a CNN model a kernel enables to capture similar patterns across an image regardless of their locations. This property is called 'translation invariant' by the book. To explain more, a kernel slides over an image and the elements of the kernel (remember that a kernel is a matrix) are convolved with the pixels of the image. Thus, one of the main operations in a CNN is convolution and hence the name of these types of neural networks.     
<br/>
    </p>
    <p>
<b>Notes</b>
    <ul>
        <li>In the context of CNNs, a kernel is called <span style='background-color:yellow'><b>Channel</b></span> too. An RGB colored image has three channels, red, green, and blue, meaning there are three values associated with every pixels. Similarly, a CNN has several channels, which is also called kernel, at every convolutional layer. Each of these channels or kernels are responsible to learn a specific pattern.</li>
        <li>At every layer, a CNN has several kernels and hence is able to capture more than one pattern.</li>
        <li>Although there are several kernels at every layer, the size of the kernels is small (e.g., 3*3 or 5*5). Thus, a CNN model can have a much smaller set of parameters in comparison with an FCN, but still be able to reach better results.</li>
        <li>Having several layers each of which consisting several kernels can be interpreted as transforming an image with multiple channels to another one again with multiple channels.</li>
        <li><i><b>torch.nn</b></i> modules provides several modules to perform convolution; <i><b>nn.Conv1d</b></i> for timeseries or 1D signals, <i><b>nn.Conv2d</b></i> for images or <i><b>2D signals</b></i>, and <i><b>nn.Conv3d</b></i> for volums and videos or 3D signals</li>
    </ul>
    </p>


<h3>Downsampling</h3>
<ul>
    <li><b>Averaging pooling:</b> Averaging pixels within a specified window to create one pixel value results in reducing the size of an image. To illustrate, performin average pooling within a window of size 2*2 means we take average over the four pixels within this window. Every four pixels are mapped to one pixel value. Therefore, an image of size 4*4 with 16 pixels is transformed to a 2*2 image with 4 pixels. </li>
    <li><b>Maxpooling:</b> This operation is similar to average pooling but rather than taking average, the maximum pixel value will be selected within the specified window.</li>
    <li><b>Striding:</b> If we use steps with size bigger than one when sliding a kernel over an image, the result will be an image with the size smaller that the original image.</li>
</ul>

<h3>Deeper layers generate more complx features</h3>
<p>
 At every layer, the network tries to compact information within less parameters and send it over deeper layers. As a result we can say that kernels in the first layers operates over small neighboring pixels while kernels deeper within the network work are looking at the compated information generated by previous layers. Therefore, the deeper layers can capture more complex patters and features by looking a wider range of pixels than the kernels in the first layers.
</p>
<p>
<b>Receptive field:</b> The number of input pixels that a layer uses to produce one output pixel is called receptive field.
</p>

<h3>A simple CNN</h3>
<p>Puttng together all the concepts above, a CNN is a sequence layers consisting several kernels to perform convolution, activation functions, and pooling operations. Let's create a simple model. Before that, we need prepare the dataset. Similar to Chapter 7, we will work on CIFAR2 data which we creat from CIFAR10 by including only two categories of images and excluding the rest.</p>

In [22]:
cifar_mean = torch.tensor([0.4914, 0.4822, 0.4465])
cifar_std = torch.tensor([0.2470, 0.2435, 0.2616])
#transform = transforms.compose([transforms.ToTensor(), transforms.Normalize(mean = cifar_mean, std = cifar_std)])
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=cifar_mean, std=cifar_std)])
cifar10 = CIFAR10('../data', train=True, download=False, transform = transform)
cifar10_val = CIFAR10('../data', train=False, download = False, transform=transform)
label_map = {0:1, 2:1}
class_names = ['aiplane', 'bird']
cifar2 = [(img, label_map[label]) for img, label in cifar10 if label in label_map]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in label_map]


In [23]:
sample_img, sample_label = cifar2[0]
print(sample_img.shape)

torch.Size([3, 32, 32])


<span style='color:red'>Note:</span> In the above simple model, pay attention to 

In [32]:
# Size of the image after the first convolution layer: (32-3+2)/1 + 1= 32
# Size of the image after the first maxpooling operation: 32 / 2 = 16
# Size of the image after the second convolution layer: (16-3+2)/1 + 1 = 16
# Size of the image after the second maxpooling operation: 16 / 2 = 8
model = nn.Sequential(
    nn.Conv2d(3,16,kernel_size=(3,3), stride=1, padding=1), 
    nn.Tanh(),
    nn.MaxPool2d(2),
    nn.Conv2d(16,8,kernel_size=(3,3),stride=1, padding=1),
    nn.Tanh(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(8*8*8, 2)
    
)

<span style='color:red'><b>Note:</b></span> In the above simple model, pay attention to <i><b>nn.Flatten()</b></i> coming after the last maxpooling operation and before the linear layer. The output of the last maxpooling layer is still an image or a 2D array that cannot be consumed by the next LINEAR layer which expects 1D input. Hence, we must use nn.Flatten() in between.

Let's do some inference:

In [33]:
model(sample_img.unsqueeze(0))

tensor([[ 0.0358, -0.1954]], grad_fn=<AddmmBackward0>)

<h3>The functional API</h3>
<p>
        <i><b>nn.Module</b></i> is an abstract class and other modules such as <i><b>nn.Linear</b></i> and <i><b>nn.Sequential</b></i> are subclasses of it. We also saw that a customized classes that we designed are also subclasses of <i><b>nn.Module</b></i>. Therefore, they can be considered to be modules. A module, as the way defined in the book, can have a state and a set of parameters. Keeping this in mind, when performing some operations such as <i><b>nn.MaxPool2d</b></i>, we don't really need to keep track of their states or even store any parameters associated with them. As such, considering them as modules is somehow overkill and hence the pytorch functional API comes to seen. As stated in the book, '<span style='background-color:yellow;'><i>functional</i> here means 'having no state.</span> Thus, the rule of thumb is that when we want to simply perfom an operation, use the methods (functions) of <b><i>nn.functional</i></b>, such as <i><b>nn.functional.max_pool2d</b></i> rather than <i><b>nn.MaxPool2d</b></i>
    <br/>
    However, when keeping track of the state and parameters is important, we should use modules of <b><i>nn</i></b> rather than the functional API. As an example, the recommendation is to use <i><b>nn.Linear</b></i> rather than its counterpart <b><i>nn.functional.linear</i></b>.
</p>
<p><b><span style='color:red;'>Note:</span></b> Some functional tools, such as <b><i>nn.functional.tanh</i></b>, have been depricated and the recommendation is not using them eventhough they are more like a function rather than a module. Interested readers can read <a url='https://discuss.pytorch.org/t/torch-tanh-vs-torch-nn-functional-tanh/15897'>this thread</a> in the Pytorch forum.</p>   

<h3>More to read:</h3>
    <a>https://discuss.pytorch.org/t/torch-tanh-vs-torch-nn-functional-tanh/15897</a>
    