In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Convolutional Neural Networks

Recall convolutional neural networks layers

#### Dilated convolution

Example of max-pooling
<img src="images/ft/dilation_1.gif" height="600" width="600" />

#### Weight shearing

First way to make model invariant for different (accptable) aspects is weights sharing
<br>
Instead of using different weight for each neuron let's repeat them values time after time

<img src="images/cnn/weight_sharing_1.png" height="600" width="600" />

How can we achieve weights sharing?
Ad restriction per layer to have copy of the weight or use other approach

#### Convolutions on matrices

What if want to have an input with depth for convolution:
$$I \in \mathbb{R}^{h \times w \times c}$$
<br>
and the smaller weights matrix:
  $$\begin{align} K_i &= \begin{pmatrix}
           K_{11}, K_{12}, \dots, K_{1k_2} \\
           K_{21}, K_{22}, \dots, K_{2k_2} \\
           \vdots \\
           K_{k_11}, K_{k_12}, \dots, K_{k_1k_2} \\
         \end{pmatrix}
  \end{align}$$

We calculate convolution per slide with:
$$
\begin{align}
(I \ast K)_{ij} &= \sum_{m = 0}^{k_1 - 1} \sum_{n = 0}^{k_2 - 1} \sum_{c = 1}^{C} K_{m,n,c} \cdot I_{i+m, j+n, c} + b \tag {4}
\end{align}
$$
<br>
Note that convolution does not have strides for channels and always produces one two dimensional matrix, flattens channels

Recall deltas in backpropagation
$$
\delta^l_j = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j}
$$

The weights sharing makes the backpropagation for convolution complicated. It's hard to remember and easy to get lost in deltas and gradients. So I found the excellent <a href="https://mc.ai/backpropagation-for-convolution-with-strides/">blog</a> which makes it easy to understand. I'll follow the above blog.

Consider the input feature map as $\mathbb{R}^{5 \times 5} matrix$:
$$\begin{align} X &= \begin{pmatrix}
           x_{11}, x_{12}, x_{13}, x_{14}, x_{15} \\
           x_{21}, x_{22}, x_{23}, x_{24}, x_{25} \\
           x_{31}, x_{32}, x_{33}, x_{34}, x_{35} \\
           x_{41}, x_{42}, x_{43}, x_{44}, x_{45} \\
           x_{51}, x_{52}, x_{53}, x_{54}, x_{55} \\
         \end{pmatrix}
  \end{align}$$
<br>
We assume that depth is $1$ for better understanding

Weights as $\mathbb{R}^{3 \times 3}$ matrix:
$$\begin{align} W &= \begin{pmatrix}
           w_{11}, w_{12}, w_{13} \\
           w_{21}, w_{22}, w_{23} \\
           w_{31}, w_{32}, w_{33} \\
         \end{pmatrix}
  \end{align}$$
<br>
With stride $s = 2$ and zero padding

Then according the:
$$
H_o = \frac{H - F_h + 2 P}{S_h} + 1
$$
<br>
$$
W_o = \frac{W - F_w + 2 P}{S_w} + 1
$$
<br>
formulas

Well have a:
$$
H_o = \frac{5 - 3 + 2 \cdot 0}{2} + 1 = 2 = W_o
$$

And the output $\mathbb{R}^{2 \times 2} matrix$:
$$\begin{align} Y &= \begin{pmatrix}
           y_{11}, y_{12} \\
           y_{21}, y_{22} \\
         \end{pmatrix}
  \end{align}$$

Now calculate forward propagation:

$$\begin{align} \begin{pmatrix}
           x_{11} \cdot w_{11}, x_{12} \cdot w_{12}, x_{13} \cdot w_{13}, x_{14}, x_{15} \\
           x_{21} \cdot w_{21}, x_{22} \cdot w_{22}, x_{23} \cdot w_{23}, x_{24}, x_{25} \\
           x_{31} \cdot w_{31}, x_{32} \cdot w_{32}, x_{33} \cdot w_{33}, x_{34}, x_{35} \\
           x_{41}, x_{42}, x_{43}, x_{44}, x_{45} \\
           x_{51}, x_{52}, x_{53}, x_{54}, x_{55} \\
         \end{pmatrix}
  \end{align}$$

The first convolution step will be:
<table align="center">
    <tr>
        <td>$x_{11} \cdot w_{11}$</td>
        <td>$x_{12} \cdot w_{12}$</td>
        <td>$x_{13} \cdot w_{13}$</td>
        <td>$x_{14}$</td>
        <td>$x_{15}$</td>
    </tr>
    <tr>
        <td>$x_{21} \cdot w_{21}$</td>
        <td>$x_{22} \cdot w_{22}$</td>
        <td>$x_{23} \cdot w_{23}$</td>
        <td>$x_{24}$</td>
        <td>$x_{25}$</td>
    </tr>
    <tr>
        <td>$x_{31} \cdot w_{31}$</td>
        <td>$x_{32} \cdot w_{32}$</td>
        <td>$x_{33} \cdot w_{33}$</td>
        <td>$x_{34}$</td>
        <td>$x_{35}$</td>
    </tr>
    <tr>
        <td>$x_{41}$</td>
        <td>$x_{42}$</td>
        <td>$x_{43}$</td>
        <td>$x_{44}$</td>
        <td>$x_{45}$</td>
    </tr>
    <tr>
        <td>$x_{51}$</td>
        <td>$x_{52}$</td>
        <td>$x_{53}$</td>
        <td>$x_{54}$</td>
        <td>$x_{55}$</td>
    </tr>
</table>

For the second step:
<table align="center">
    <tr>
        <td>$x_{11}$</td>
        <td>$x_{12}$</td>
        <td>$x_{13} \cdot w_{11}$</td>
        <td>$x_{14} \cdot w_{12}$</td>
        <td>$x_{15} \cdot w_{13}$</td>
    </tr>
    <tr>
        <td>$x_{21}$</td>
        <td>$x_{22}$</td>
        <td>$x_{23} \cdot w_{21}$</td>
        <td>$x_{24} \cdot w_{22}$</td>
        <td>$x_{25} \cdot w_{23}$</td>
    </tr>
    <tr>
        <td>$x_{31}$</td>
        <td>$x_{32}$</td>
        <td>$x_{33} \cdot w_{31}$</td>
        <td>$x_{34} \cdot w_{32}$</td>
        <td>$x_{35} \cdot w_{33}$</td>
    </tr>
    <tr>
        <td>$x_{41}$</td>
        <td>$x_{42}$</td>
        <td>$x_{43}$</td>
        <td>$x_{44}$</td>
        <td>$x_{45}$</td>
    </tr>
    <tr>
        <td>$x_{51}$</td>
        <td>$x_{52}$</td>
        <td>$x_{53}$</td>
        <td>$x_{54}$</td>
        <td>$x_{55}$</td>
    </tr>
</table>

For the third step:
<table align="center">
    <tr>
        <td>$x_{11}$</td>
        <td>$x_{12}$</td>
        <td>$x_{13}$</td>
        <td>$x_{14}$</td>
        <td>$x_{15}$</td>
    </tr>
    <tr>
        <td>$x_{21}$</td>
        <td>$x_{22}$</td>
        <td>$x_{23}$</td>
        <td>$x_{24}$</td>
        <td>$x_{25}$</td>
    </tr>
    <tr>
        <td>$x_{31} \cdot w_{11}$</td>
        <td>$x_{32} \cdot w_{12}$</td>
        <td>$x_{33} \cdot w_{13}$</td>
        <td>$x_{34}$</td>
        <td>$x_{35}$</td>
    </tr>
    <tr>
        <td>$x_{41} \cdot w_{21}$</td>
        <td>$x_{42} \cdot w_{22}$</td>
        <td>$x_{43} \cdot w_{23}$</td>
        <td>$x_{44}$</td>
        <td>$x_{45}$</td>
    </tr>
    <tr>
        <td>$x_{51} \cdot w_{31}$</td>
        <td>$x_{52} \cdot w_{32}$</td>
        <td>$x_{53} \cdot w_{33}$</td>
        <td>$x_{54}$</td>
        <td>$x_{55}$</td>
    </tr>
</table>

And for the fourth step:
<table align="center">
    <tr>
        <td>$x_{11}$</td>
        <td>$x_{12}$</td>
        <td>$x_{13}$</td>
        <td>$x_{14}$</td>
        <td>$x_{15}$</td>
    </tr>
    <tr>
        <td>$x_{21}$</td>
        <td>$x_{22}$</td>
        <td>$x_{23}$</td>
        <td>$x_{24}$</td>
        <td>$x_{25}$</td>
    </tr>
    <tr>
        <td>$x_{31}$</td>
        <td>$x_{32}$</td>
        <td>$x_{33} \cdot w_{11}$</td>
        <td>$x_{34} \cdot w_{12}$</td>
        <td>$x_{35} \cdot w_{13}$</td>
    </tr>
    <tr>
        <td>$x_{41}$</td>
        <td>$x_{42}$</td>
        <td>$x_{43} \cdot w_{21}$</td>
        <td>$x_{44} \cdot w_{22}$</td>
        <td>$x_{45} \cdot w_{23}$</td>
    </tr>
    <tr>
        <td>$x_{51}$</td>
        <td>$x_{52}$</td>
        <td>$x_{53} \cdot w_{31}$</td>
        <td>$x_{54} \cdot w_{32}$</td>
        <td>$x_{55} \cdot w_{33}$</td>
    </tr>
</table>

We need to calculate step backward with respect to gradient of layer before activations:
<br>
From:
$$\begin{align} \nabla Y &= \begin{pmatrix}
           \frac{\partial C}{ \partial y_{11}}, \frac{\partial C}{\partial y_{12}} \\
           \frac{\partial C}{\partial y_{21}}, \frac{\partial C}{\partial y_{22}} \\
         \end{pmatrix}
  \end{align}$$

We should be able to calculate:
$$\begin{align} \nabla X &= \begin{pmatrix}
           \frac{\partial C}{ \partial x_{11}}, \frac{\partial C}{ \partial x_{12}}, \frac{\partial C}{ \partial x_{13}}, \frac{\partial C}{ \partial x_{14}}, \frac{\partial C}{ \partial x_{15}} \\
           \frac{\partial C}{ \partial x_{21}}, \frac{\partial C}{ \partial x_{22}}, \frac{\partial C}{ \partial x_{23}}, \frac{\partial C}{ \partial x_{24}}, \frac{\partial C}{ \partial x_{25}} \\
           \frac{\partial C}{ \partial x_{31}}, \frac{\partial C}{ \partial x_{32}}, \frac{\partial C}{ \partial x_{33}}, \frac{\partial C}{ \partial x_{34}}, \frac{\partial C}{ \partial x_{35}} \\
           \frac{\partial C}{ \partial x_{41}}, \frac{\partial C}{ \partial x_{42}}, \frac{\partial C}{ \partial x_{43}}, \frac{\partial C}{ \partial x_{44}}, \frac{\partial C}{ \partial x_{45}} \\
           \frac{\partial C}{ \partial x_{51}}, \frac{\partial C}{ \partial x_{52}}, \frac{\partial C}{ \partial x_{53}}, \frac{\partial C}{ \partial x_{54}}, \frac{\partial C}{ \partial x_{55}} \\
         \end{pmatrix}
  \end{align}$$

Each $x_{i,j}$ contributes in one or several results in $Y$ and according the chain rule:
$$
\frac{\partial C}{\partial x_{m, n}} = \sum_{i, j}\frac{\partial C}{\partial y_{i, j}} \cdot \frac{\partial y_{i, j}}{\partial x_{m, n}}
$$

So we have:
$$
y_{11} = x_{11} \cdot w_{11} + x_{12} \cdot w_{12} + x_{13} \cdot w_{13} + x_{21} \cdot w_{21} + x_{22} \cdot w_{22} + x_{23} \cdot w_{23} + x_{31} \cdot w_{31} + x_{32} \cdot w_{32} + x_{33} \cdot w_{33} \\
y_{12} = x_{13} \cdot w_{11} + x_{14} \cdot w_{12} + x_{15} \cdot w_{13} + x_{23} \cdot w_{21} + x_{24} \cdot w_{22} + x_{25} \cdot w_{23} + x_{33} \cdot w_{31} + x_{34} \cdot w_{32} + x_{35} \cdot w_{33} \\
y_{21} = x_{31} \cdot w_{11} + x_{32} \cdot w_{12} + x_{33} \cdot w_{13} + x_{41} \cdot w_{21} + x_{42} \cdot w_{22} + x_{43} \cdot w_{23} + x_{51} \cdot w_{31} + x_{52} \cdot w_{32} + x_{53} \cdot w_{33} \\
y_{22} = x_{33} \cdot w_{11} + x_{34} \cdot w_{12} + x_{35} \cdot w_{13} + x_{43} \cdot w_{21} + x_{44} \cdot w_{22} + x_{45} \cdot w_{23} + x_{53} \cdot w_{31} + x_{54} \cdot w_{32} + x_{55} \cdot w_{33}
$$

Let's calculate the gradient for the first element $x_{11}$:
<br>
$$
y_{11} = \pmb{x_{11}} \cdot w_{11} + x_{12} \cdot w_{12} + x_{13} \cdot w_{13} + x_{21} \cdot w_{21} + x_{22} \cdot w_{22} + x_{23} \cdot w_{23} + x_{31} \cdot w_{31} + x_{32} \cdot w_{32} + x_{33} \cdot w_{33} \\
$$
<br>
Here $x_{11}$ only contributes in $y_{11}$ and therefore:
$$
\frac{\partial C}{\partial x_{1, 1}} = \frac{\partial C}{\partial y_{11}}\frac {\partial y_{11}} {\partial {x_{11}}} = \frac{\partial C}{\partial y_{11}} \cdot w_{11}
$$

Now, let's consider $x_{12}$:
<br>
$$
y_{11} = x_{11} \cdot w_{11} + \pmb{x_{12}} \cdot w_{12} + x_{13} \cdot w_{13} + x_{21} \cdot w_{21} + x_{22} \cdot w_{22} + x_{23} \cdot w_{23} + x_{31} \cdot w_{31} + x_{32} \cdot w_{32} + x_{33} \cdot w_{33} \\
$$
<br>
Here $x_{12}$ also only contributes in $y_{11}$ and therefore:
$$
\frac{\partial C}{\partial x_{12}} = \frac{\partial C}{\partial y_{11}}\frac {\partial y_{11}} {\partial {x_{12}}} = \frac{\partial C}{\partial y_{12}} \cdot w_{12}
$$

The $x_{13}$ contributes in the $y_{11}$ and $y_{12}$:
<br>
$$
y_{11} = x_{11} \cdot w_{11} + x_{12} \cdot w_{12} + \pmb{x_{13}} \cdot w_{13} + x_{21} \cdot w_{21} + x_{22} \cdot w_{22} + x_{23} \cdot w_{23} + x_{31} \cdot w_{31} + x_{32} \cdot w_{32} + x_{33} \cdot w_{33} \\
y_{12} = \pmb{x_{13}} \cdot w_{11} + x_{14} \cdot w_{12} + x_{15} \cdot w_{13} + x_{23} \cdot w_{21} + x_{24} \cdot w_{22} + x_{25} \cdot w_{23} + x_{33} \cdot w_{31} + x_{34} \cdot w_{32} + x_{35} \cdot w_{33} \\
$$
<br>
And thus:
$$
\frac{\partial C}{\partial x_{13}} = \frac{\partial C}{\partial y_{11}}\frac {\partial y_{11}}{\partial x_{13}} + \frac{\partial C}{\partial y_{12}} \frac {\partial y_{12}} {\partial x_{13}} = \frac{\partial C}{\partial y_{11}} \cdot w_{12} +  \frac{\partial C}{\partial y_{12}} \cdot w_{11}
$$


The input $x_{33}$ contributes in the all four outputs:
$$
y_{11} = x_{11} \cdot w_{11} + x_{12} \cdot w_{12} + x_{13} \cdot w_{13} + x_{21} \cdot w_{21} + x_{22} \cdot w_{22} + x_{23} \cdot w_{23} + x_{31} \cdot w_{31} + x_{32} \cdot w_{32} + \pmb{x_{33}} \cdot w_{33} \\
y_{12} = x_{13} \cdot w_{11} + x_{14} \cdot w_{12} + x_{15} \cdot w_{13} + x_{23} \cdot w_{21} + x_{24} \cdot w_{22} + x_{25} \cdot w_{23} + \pmb{x_{33}} \cdot w_{31} + x_{34} \cdot w_{32} + x_{35} \cdot w_{33} \\
y_{21} = x_{31} \cdot w_{11} + x_{32} \cdot w_{12} + \pmb{x_{33}} \cdot w_{13} + x_{41} \cdot w_{21} + x_{42} \cdot w_{22} + x_{43} \cdot w_{23} + x_{51} \cdot w_{31} + x_{52} \cdot w_{32} + x_{53} \cdot w_{33} \\
y_{22} = \pmb{x_{33}} \cdot w_{11} + x_{34} \cdot w_{12} + x_{35} \cdot w_{13} + x_{43} \cdot w_{21} + x_{44} \cdot w_{22} + x_{45} \cdot w_{23} + x_{53} \cdot w_{31} + x_{54} \cdot w_{32} + x_{55} \cdot w_{33}
$$
<br>
and therefore:
$$
\frac{\partial C}{\partial x_{33}} = \frac{\partial C}{\partial y_{11}}\frac {\partial y_{11}}{\partial x_{33}} + \frac{\partial C}{\partial y_{12}} \frac {\partial y_{12}} {\partial x_{33}} + \frac{\partial C}{\partial y_{21}}\frac {\partial y_{21}}{\partial x_{33}} + \frac{\partial C}{\partial y_{22}} \frac {\partial y_{22}} {\partial x_{33}} = \frac{\partial C}{\partial y_{11}} \cdot w_{33} +  \frac{\partial C}{\partial y_{12}} \cdot w_{31} + \frac{\partial C}{\partial y_{21}} \cdot w_{13} +  \frac{\partial C}{\partial y_{22}} \cdot w_{11}
$$

Finally we can run dilated convolution:
<br>
$$\begin{align} \nabla Y &= \begin{pmatrix}
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, \frac{\partial C}{ \partial y_{11}}, 0, \frac{\partial C}{\partial y_{12}}, 0, 0 \\
           0, 0, \frac{\partial C}{\partial y_{21}}, 0, \frac{\partial C}{\partial y_{22}}, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
         \end{pmatrix}
  \end{align}$$
<br>
with the:
$$\begin{align} W &= \begin{pmatrix}
           w_{33}, w_{32}, w_{31} \\
           w_{22}, w_{22}, w_{21} \\
           w_{13}, w_{12}, w_{11} \\
         \end{pmatrix}
  \end{align}$$

The above is equivalent to the:
<br>
$$\begin{align} \nabla Y &= \begin{pmatrix}
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, \frac{\partial C}{ \partial y_{11}}, \frac{\partial C}{\partial y_{12}}, 0, 0 \\
           0, 0, \frac{\partial C}{\partial y_{21}}, \frac{\partial C}{\partial y_{22}}, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
           0, 0, 0, 0, 0, 0, 0 \\
         \end{pmatrix}
  \end{align}$$
<br>
with the:
$$\begin{align} W &= \begin{pmatrix}
           w_{33}, 0, w_{32}, 0,  w_{31} \\
           w_{22}, 0, w_{22}, 0,  w_{21} \\
           w_{13}, 0, w_{12}, 0, w_{11} \\
         \end{pmatrix}
  \end{align}$$

So we have a dilated convolution with flipped kernel on $180$ degrees.
<br>
Padding the next layer gradient tensor with $(k1-1, k2-1)$ and dilate with $s - 1$ zeros, will backpropagate error

#### Task

Do the same with channels

Here is the example of convolution, where input has three channels:
<img src="images/cnn/convolution_4.gif" height="600" width="600" />

#### Hint
Here picture is the similar:
$$
\frac{\partial C}{\partial x_{33}} = \sum_{d=1}^D (\frac{\partial C}{\partial y_{11}}\frac {\partial y_{11}}{\partial x_{33d}} + \frac{\partial C}{\partial y_{12}} \frac {\partial y_{12}} {\partial x_{33d}} + \frac{\partial C}{\partial y_{21}}\frac {\partial y_{21}}{\partial x_{33d}} + \frac{\partial C}{\partial y_{22}} \frac {\partial y_{22}} {\partial x_{33d}}) = \sum_{d=1}^D (\frac{\partial C}{\partial y_{11}} \cdot w_{33d} +  \frac{\partial C}{\partial y_{12}} \cdot w_{31d} + \frac{\partial C}{\partial y_{21}} \cdot w_{13d} +  \frac{\partial C}{\partial y_{22}} \cdot w_{11d})
$$

If we change $x_{ij}$ with the $a^l_{ij}$ as activation for the previous layer and $y_{ij}$ with the $z^{l+1}_{ij}$ and recall that:
<br>
$$
\delta^l_j = \frac{\partial C}{\partial z^l_j}
$$
Then we'll get the:
$$\begin{align} \nabla Y &= \begin{pmatrix}
           \delta^{l+1}_{11}, \delta^{l+1}_{12} \\
           \delta^{l+1}_{21}, \delta^{l+1}_{22} \\
         \end{pmatrix}
  \end{align}$$

We can conclude that we have a backpropagation with the $\delta^{l+1}$ kernel flipping

#### Pooling

Pooling layer is not learnable, for instance after max-pooling only maximum values affect the error:
<img src="images/cnn/pooling_1.png" height="600" width="600" />

For max-pooling layer backpropagarion only considers maximum values per sliding window, only the maximum values have influence on the error

For average pooling we can propagate:
$$
\frac{1}{K_1 \times K_2}
$$
<br>
error

## Feture map visualization

Convolutional neural networks has a hierarchical structure. According to that fact, we can imagine, that learning happens hierarchically.
- First layers detect near by edges
- Middle layers more complex edges and color maps
- Last layers detect object patterns
- Then linear classifiers distinguish object according to the extracted features

Let's visualize weights of one for the first layers filters / kernels of different models:
<img src="images/ft/weights_vis_1.png" height="800" width="800" />

If we visualize weights layer by layer:
<img src="images/ft/weights_vis_2.png" height="800" width="800" />

If we visualize learned feature maps activations as images, we can observe that fact:
<img src="images/ft/features_1.png" height="800" width="800" />

Here we see different feature maps visualization:
<img src="images/ft/features_2.jpg" height="600" width="600" />

Different models extract features in a different hierarchy but pattern is preserved:
<img src="images/ft/features_3.jpg" height="200" width="600" />

Here we can observe features for different images:
<img src="images/ft/features_4.jpg" height="800" width="800" />

## Feature extraction / embedding

Let's take one of the pre-trained (on ImageNet) models, VGG, Inception,  ResNet,  etc and remove all the last layers before convolutional layers:
- For VGG16 remove last two fully connected layers
- For Inception and ResNet remove all the layer after adaptive (global) average pooling
<br>
So our model generates vector from the image

In [14]:
import torch
from torch import nn
from torchvision.models import resnet50, resnet34, vgg16


In [21]:
net = vgg16(pretrained=True)
net

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

In [4]:
?? net

In [22]:
model = nn.Sequential(*list(net.children())[:-1])
model

Sequential(
  (0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

In [23]:
x = torch.randn(1, 3, 399, 399)
with torch.no_grad():
    y1 = net(x)
    y2 = model(x)

In [24]:
y1.size()

torch.Size([1, 1000])

In [25]:
y2.size()

torch.Size([1, 512, 7, 7])

In [26]:
y2 = torch.flatten(y2, 1)
y2.size()

torch.Size([1, 25088])

So we have $2048$ dimensional vectors, we can run our model on the our dataset of images and generate $2048$ dimensional vectors.
$$
f: \mathbb{R}^{3 \times H \times W} \mapsto \mathbb{R}^d
$$
<br>
Our model maps each $C \times H \times W$ (they might be different for adaptive average pooling) dimensional image to the fixed $d$ dimensional vector

Vectors have "distance" property.
<br>
If we store this vectors and run K-nearest neighbor search we can observe that similarity search is working even if our dataset was not used during the training.
<br>
Note: Search results depend on model and domain of training set and dataset

Similarity search examples:
<img src="images/ft/sim_1.png" height="1000" width="1000" />

Dimensionality reduction and clustering:
<img src="images/ft/sim_2.png" height="1000" width="1000" />

## Transfer-learning

We can see that first layers extract essential features which are pretty similar for all images. Second layers extract more complex features and last layers more domain-specific features
<br>
Can we use this information for different task. Would it be enough information, enough features if use it pre-trained model on the different dataset?

With the following approach:
- We extract features from the images with the pre-trained model
- Train different model with this features

Turns out that this approach works and it's called transfer earning.
For transfer learning we should consider the following:
- Is the model is trained on the similar domain
- Is the model trained on the enough data

The state-of-the art result achieved with model trained on ImageNet classification task
- It has different and well-distributed images
- More precise labeled
<br>
Or it has enough images to extract "all-possible" features

There are several approaches:
- Use extracted features and train different model
- Freeze the weights and train only classifier
- Fine-tune whole model with discriminative learning rates

First approach needs pre-extraction of the feature vectors and training different model on them
- Extract features
- Train different classifier (SVM, RF, GB) on them

For the second approach we put our layers on top the model and train it:
- Put custom layers on model
- Freeze feature extraction layers weights
- Train custom layer

In [33]:
model_fn = nn.Sequential(*list(model.children()) + [nn.Linear(25088, 500), nn.Dropout(p=0.3), 
                                                    nn.Linear(500, 20)])
model_fn

Sequential(
  (0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

For third approach, we put our layers on top the model and train it with different learning rate:
- Put custom layers on model
- Train full model using larger learning rate for last layers, smaller maybe $ \frac{1}{100}$ for the middle layers and $\frac{1}{1000}$ for the first layers

Pre-trained classifier also used for different tasks
- Segmentation
- Detection
- Image search / metric learning
- Auto-encoders
- GAN
- etc

Transfer learning also works for other tasks, such as NLP models 

<img src="images/cnn/questions.jpg" height="600" width="600" />

#### Thank you