In [2]:
from IPython.display import Image
from IPython.core.display import HTML 


For this exercise, we have chosen to go with the Inception V3 architecture, trained on the ImageNet dataset. Inception V3 is a CNN, convolutional neural network, a class of networks wich uses the mathematical operaitons pooling and convolutions. In CNNs, this works by applying a filter to the input of any layer. The filter works by doing some operations depending on the filter type and the input. For convolutional filters, the product of filter cells with corresponding cells of the input are added together. For Pooling filters, the maximum, minimum or average value that the filter "covers" are given as output. Under you can see an example of a 2x2 filter from "A guide to convolution arithmetic for deep learning"(Dumoulin & Visin) https://arxiv.org/abs/1603.07285v2.

In [14]:
Image(url= "https://miro.medium.com/max/441/1*BMngs93_rm2_BpJFH2mS0Q.gif")

The filter slides over the input with a predefined step size, called stride, and outputs one number for each position. These  filters are usually used more than one at the time. What happens then is that the outputs of each filter is stacked on top of each other, making up the output channels. Each of the individual filter however takes into account all the input channels and adds them together. This gives an output size of OxO, O=(W-K+2P)/S + 1, independent of the number of channels in the input. P in the formula is the padding size. Padding is a technique where one adds "empty" pixels on the edges of the input, to effectively increase the effect of the outmost pixels. In our chosen architecture, the padding is "same", meaning that the padding varies so that the output has the same height and width dimensions as the output. 



Inception V3 is as the name suggests a architecture of the Inception type. These are build up of so called Inception module, of which the idea is that instead of chosing a number of filters of one filtersize in a convolutional layer at a point, you chose multiple filter size, and then stack the outputs into one. This allows for detection of objects of different sizes in images in an effective way. The inception module is illustrated in the figure below, from the paper "Going deeper with convolutions"(Szegedy et al., 2014), where the inception network was first introduced. https://arxiv.org/abs/1409.4842

In [15]:
Image(url= "https://miro.medium.com/max/1257/1*DKjGRDd_lJeUfVlY50ojOA.png")

Call this module the model a, the naive version. This uses 3 types convolutional filters of size 1x1, 3x3 and 5x5, and a pooling filter. To ensure the output is of the different filters are of the same size, a same padding is used on both the convolutional filter, but also the pooling filter. 

This is however a operational costly layer, where the numbers of operations for each convolutional filter is the dimension of the input, height * width * number of channels, times the dimension of the filter, height * width * number of filters. To reduce the number of operation, module b is proposed. 

In [16]:
Image(url = "https://miro.medium.com/max/1235/1*U_McJnp7Fnif-lw9iIC5Bw.png")

Here the inventors have added a 1x1 filter before the 3x3 and 5x5 filters. Doing this, the dimension of the input are reduced, specifically the number of channels of the input is reduced, and the output of the 1x1 filters, have a number of channels equal to the number of 1x1 filters. Doing this, the cost of going from the input to the output of the 5x5 filters is reduced to the dimension of the input, height * width * number of channels, times 1x1 times number of 1x1 filters, plus the second filtering, through 5x5, which is equal to the original one in b, divided by the ratio of number of input channels / number of 1x1 filters. 

Modules of the b types are used in the inception architecture from the paper, the Inception V1, ending in the architecture in the below figure(source: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d#81e0). This is also the baseline for Inception V3, but there are multiple other alterings to the inception modules, which will soon be explained.

In [17]:
Image(url = "https://miro.medium.com/max/2591/1*53uKkbeyzJcdo8PE5TQqqw.png")

As we can see, the architecture consists of a stem, which consists of traditional pooling and convolutional layer, and then pooling layers in between inception modules. In the end, there is a pooling, a fully connected and a softmax layer.

As mentioned earlier, the Inception V3 as we use, is based on this Inception V1, but with a couple improvements. First of all, 5x5 filters are factorized into two 3x3 filters. As 5x5 filters are more then two times more computionally expensive than 3x3 filters, this decreases number of operations. nxn filters are further factorized into 1xn and nx1 filters, which is reported in the paper to be 33% cheaper than one nxn filter. To avoid the inception modules beeing to deep, the filters are instead spread, to widen the inception modules. The full network are shown in the figure below (source:https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d#81e0).

In [18]:
Image(url = "https://miro.medium.com/max/3012/1*ooVUXW6BIcoRdsF7kzkMwQ.png")

As we can see, the Inception V3 architecture also involves reduction modules, which in principle are the same as inception module, except that it is designed to decrease the dimensions of the input. In total the Inception V3 includes about 24M parameters. It is also worth mentioning that the V3 takes as default input 299x299x3, and uses a RSMProp optimizer. As this is designed for the ImageNet dataset, it outputs 1000 different classes, but as we use it for the plankton dataset, we change the last layers to fit to our desired output.

In [3]:
Image(url= "https://raw.githubusercontent.com/JakobKallestad/InceptionV3-on-plankton-images/master/images/individualImage.png?fbclid=IwAR2EBAFAn3SFL9UVEts7Lc0DQTWVeyKIhtL4l7IMDLA2OI61ZQjO9WqH_8k")