# Convolution Neural Network step-by-step

## Pooling Layer



### Local Down-Pooling Layers 

<font size=4.5pt  face = 'georgia' style='Line-height :2.5'><div style='text-align: justify; padding:35px;margin-bottom:-35px'>
    The main idea of the <b>pooling (down-sample)</b> layer is to reduce the feature map dimension without loose of information.<br>
This idea is based on the assumption that each feature of the feature map lay in a group of several neighbor pixels.
    <br>
Thus we can reduce image size by selection on of the pixel group.<br>

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px;margin-bottom:-35px'>
    The most popular pooling technique is the <b>max-pooling</b>.<br>
<div style='text-align: justify; padding:55px'>    
    <em>The main idea of maxpooling is that most intensive (most bright) pixel contain the most amount of information.</em><br>
    <u>The Max Pooling also performs a Noise Suppression</u> (due to assumption that then more intensity then less sensitive noise, artifacts and other random effect influence).<br> In other words maxpooling allow to make model more invariant (robust) to certain distortions.
    
<div>
    <img src="attachment:image.png" width="400">
    </div>
<!-- ![image.png](attachment:image.png)     -->    


<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px;margin-bottom:-35px'>
<b>The advantages of pooling</b>
<li> enhance transformation invariance and robust
<li> reduce parameters
<li>faster computation
<li> avoid overfiting
    
    

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
The effect of max-pooling size reduction can be expressed as 
$$
\begin{aligned}
&W^{out} = (W^{in}-F)/S + 1; \\
&H^{out} = (H^{in}-F)/S + 1; \\
&D^{out} = D^{in},
\end{aligned}
$$
where: 
<li> $W^{out}$, $H^{out}$ and $D^{out}$ are the dimensions of output; 
<li> $W^{in}$, $H^{in}$ and $D^{in}$   are the dimensions of input;  
<li> $F$ and $S$ are the kernel/filter size $F$ and stride $S$.

<!-- https://ai.plainenglish.io/pooling-layer-beginner-to-intermediate-fa0dbdce80eb -->

<font size=4.5pt  face = 'georgia' style='Line-height :3'><div style='text-align: justify; padding:35px'>
    
It can be distinguished a several types of pooling layers, such as
<ul>    
    <li> <b>Max Pooling</b>: 
        <ul>
            maximum value of each kernel 
            $$ \tilde r = \text{max}[r_{ij}],$$ 
            where <br>
            <ul>
                <li>$ \tilde r$ is the output of pooling layer.
                <li>$r_{ij}$ is the input region.
            </ul>
            Most popular type of pooling. <br> 
            Other variants are 
            <ul>
                <b>Median Pooling</b> 
                $$ \tilde r = \text{sort}[r_{ij}][\text{center value}].$$             
                <b>k-Max Pooling</b>: the k-most intensive values of each kernel are captured
                $$ \tilde r = \text{sort}[r_{ij}][:k].$$ 
                <b>Min Pooling</b>:
                $$ \tilde r = \text{min}r_{ij},$$            
            </ul> 
        </ul>        
    <li><b>Average Pooling(or mean-pooling)</b>: 
        <ul> 
            The average value of the kernel is calculated,
            $$ \tilde r = \frac{1}{size(r_{ij})}\sum_{ij}r_{ij},$$ 
            less robust then maximum pooling, but also more sensitive for features. <br> 
            Other variations of Average Pooling are:
            <ul>       
                 <b>L2 Pooling </b>: the L2 or the Frobenius norm is calculated as,
                    $$ \tilde r = \frac{1}{size(r_{ij})}\sqrt{\sum_{ij}r_{ij}^2},$$ 
                 <b>Lp norm (Generalized Mean Pooling, GeM) pooling </b> is calculated as,
                    $$ \tilde r = \frac{1}{size(r_{ij})}(\sum_{ij}r_{ij}^p)^{1/p},$$ 
                    less robust then maximum pooling, but also more sensitive for features (for $\inf$ or high degree same as max pooling).   
            </ul>
       </ul>       
    <li> <b>Mixed Pooling</b>: <br> the combination of a several types    
        <ul>
            for L1+max case 
            $$ \tilde r  = \lambda \text{max}r_{ij} + \frac{1-\lambda}{size(r_{ij})}\sum_{ij}r_{ij} ,$$
            where $\lambda$ is the some hyperparameter. 
       </ul>        
    <li> <b>Soft Pooling</b> (stochastic pooling): 
        <ul>   
            $$\tilde r = \sum_{ij}r_{ij}p_{ij}$$<br>
            where $p_{ij}=e^{r_{ij}}/\sum_{ij}e^{r_{ij}}$.
            <ul>
                Other variant is <b>random pooling</b>, where is random value of multinomial distribution  $p_{ij}$ is taken. 
            </ul>            
        </ul>        
    <li> <b>Local Importance Pooling (adaptive pooling)</b>: 
        <ul>
            $$\tilde r = \frac{1}{size(r_{ij})}\sum_{ij}r_{ij}w_{ij},$$<br>
            where $w_{ij}=e^{G(r_{ij})}/\sum_{ij}e^{G(r_{ij})},$<br>
                and $G(r_{ij})$ is the pretrained Multilayer Perceptron (or other neural network), need for most important feature extraction.
        </ul>
</ul>    
    
<!-- <li><b>Adaptive Pooling</b>: In this type of pooling, the the kernel size rule of value extraction and stride learning as hyper parameters.  -->   
    
<!-- <li> <b>Min Pooling</b>: In this type, the minimum value of each kernel in each depth slice is captured and passed on to the next layer.
    <ul>minimum value of each kernel $ \tilde r = \text{min}r_{ij},$</ul> -->
    
<!--     https://arxiv.org/ftp/arxiv/papers/2009/2009.07485.pdf -->    
<!--     https://arxiv.org/pdf/2101.00440v3.pdf -->
<!-- https://iq.opengenus.org/pooling-layers/ -->

![image.png](attachment:image.png)

<!-- https://arxiv.org/pdf/2101.00440v3.pdf -->

<div>
    <img src="attachment:image.png" width="1200">
    </div>
<!-- ![image.png](attachment:image.png)     -->



<!-- https://iq.opengenus.org/pooling-layers/ -->

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
Clarifications to the forward and backward-propagation through the max-pooling.

![image.png](attachment:image.png)

<!-- https://mukulrathi.co.uk/demystifying-deep-learning/conv-net-backpropagation-maths-intuition-derivation/ -->

![image.png](attachment:image.png)
<!-- https://www.jeremyjordan.me/semantic-segmentation/ -->

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
Clarifications to the backward-propagation through the average-pooling.


![image.png](attachment:image.png)

<!-- https://www.programmersought.com/article/18724036732/  -->

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
Commentaries to the forward and backward-propagation through the Soft Pooling.

![image.png](attachment:image.png)

<!-- https://arxiv.org/pdf/2101.00440v3.pdf -->

### Global Pooling Layers

![image.png](attachment:image.png)

<!-- https://codelabs.developers.google.com/codelabs/keras-flowers-tpu#8 -->

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
The conventional way to transform feature maps to the <b>flatten</b> vector (flattening) is to using `flatten` (or `ravel`, `vect` and e.t.c.) operation which is consists in :<ul>
    <li> split each 2d feature map to vectors such as each of columns or rows,
    <li> concatenate vectors for one 2d feature map,
    <li> concatenate vectors for all feature map to the big size vector.
   </ul>
The main drawbacks of this approach are: <ul>
    <li> a lot of parameters (for the farther full-connected layer),
    <li> full-connected layer size depends on the number of inputs 
        <ul> thus you need to use only images with predefined size. </ul>
 </ul>
      

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>

As the alternative to the flattening layer a global pooling layers can be considered.
The main advantage of the replacement of flatten to Global Down-Pooling is reducing the requirement to the input images size to such that each $MAP$ will be at least one pixel 
    ($\text{size}(\text{MAP})\ge 1$.

![image.png](attachment:image.png)




            


<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
There can be distinguished the following </b>Global Pooling types</b> as:
<ul>
    <li> <b>Global Average Pooling (GAP)</b>.
        <ul> For each 2d feature map: $$\tilde r_c = \frac{1}{\text{size}(\text{MAP})}\sum_{ij\in\text{MAP}}MAP_c,$$
            where $c$ is the channel of feature maps ($MAP_c$).<br>
            This technique is one of the most popular.
        </ul> 
    <li> <b>Global Max-Pooling</b>.    
        <ul> For each 2d feature map: $$\tilde r_c = \text{max}(MAP_c),$$           
            not-popular due to bed in back-propagation.
        </ul>
    <li> <b>Global Covariance Pooling</b>.    
        <ul> For each 2d feature map: $$\tilde r_c = \frac{1}{n}\sum_{i=1}^{n}(X_c\cdot X_c^T),$$
            where $X_c=[x_{c1},...,x_{cn}]=\text{vec}(\text{MAP}_c)$, n is $vec$ size.
            <ul>In more complex case use $$\tilde R=[\tilde r_{1},...,\tilde r_{c}]^T=U\Lambda^\alpha V $$, 
                where $U\Lambda V$ is the <b>SVD</b> decomposition of $R$, and $\Lambda$ is the singular values matrix,  as a rule $\alpha=0.5$<br>
            It is also possible to use transformation $\Lambda \to log(\Lambda)$, <br>
            and to cut $\Lambda$ such as: $\textit{sort}[\Lambda][:k]$ for reducing near-zero values.
            </ul>
        </ul>   
   
   <li> <b>Spatial Pyramid Pooling (SPP)</b>.    
        <ul> For each 2d feature map: $$\tilde r_c = vec\{\tilde r_1,...,\tilde r_n\},$$            
            where $n$ is the required number of outputs; $\tilde r_1$ is the maxpooling output of the region $\text{size}{MAP_c}/n$ i.e. 
            <ul>
                <li> divide ${MAP_c}$ into $n$ regions
                <li> obtain $\tilde r_i$   as  maxpooling over region $i$.
                <li> concatenate all pooled values into output vector.    
           </ul>
          Worthy replacement of Global maxpooling and flatten layers.
        </ul>        
</ul>     

<em> Note </em> Actually, each type of conventional can be extended to the Global Layer Pooling or Spatial Pyramid Pooling.

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>


![image.png](attachment:image.png)

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>

Global Covariance Pooling 

![image.png](attachment:image.png)

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>

Spatial Pyramid Pooling (SPP)

![image-4.png](attachment:image-4.png)

<!-- https://paperswithcode.com/method/spatial-pyramid-pooling#:~:text=Spatial%20Pyramid%20Pooling%20(SPP)%20is,a%20fixed%2Dsize%20input%20image.&text=last%20convolutional%20layer.-,The%20SPP%20layer%20pools%20the%20features%20and%20generates%20fixed%2Dlength,layers%20(or%20other%20classifiers) -->

### Upsampling Layer

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
    
Beside the pooling (or downsampling layer) in some cases it is need to make inverse operation - <b>upsampling</b>. <br>
There are exist a lot of upsamplig techniques. <br>
In the essence Upsampling consists in the task of mapping one input value $\tilde r$ into the region $R$<br> (or its elements $r_{ij}$ such as  $r_{ij}\in R$). 
    
   

![image-2.png](attachment:image-2.png)

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
The tree main ways to upsampling are:<ul>
    <li> <b>Transposed Convolution </b>;
    <li> <b>High-resolution Convolution</b>;
    <li> <b>Classical up-sampling (pooling up-sampling, Interpolation up-sampling).   </b> 

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
<b> Classical up-sampling (pooling upsampling)</b>.    <ul>  
<li><b>Nearest Neighbors</b> Interpolation - fill full upsampled region with the same values as input.
  <ul>  
    $$r_{ij} = \tilde{r} \text{ for }ij\in R, $$
    where:<ul>
    <li>$r_{ij}$ are the output elements of the upsampled region $R$;
    <li> $\tilde{r}$ is the input element, $\tilde{r} \to r_{ij}\in R$;        
    </ul>
  </ul>  
<li><b>"Bed Of Nails"</b> -  fill only first element of output region with input value. 
    <ul>  
    $$r_{ij} = \tilde{r} \text{ if }i = 0  \ \ j=0 \in R; \ \ 0 \text{ otherwise.} $$
    </ul>
<li><b>Bilinear Interpolation</b> - interpolate values using bilinear equation for 4 nearest points. 
    <li><b>Bicubic Interpolation</b> - interpolate values using bicubick equation for 16 nearest points. 
    <li><b>Low-pass filtration Interpolation</b> (<b>Lanczos interpolation</b>)- interpolate using "Bed Of Nails" and Low-Pass filtration.         
<!-- <li>Generalized Bicubic Interpolation - interpolate values using bicubick equation for 16 nearest points.  -->
    
<!-- https://medium.com/jun-devpblog/dl-12-unsampling-unpooling-and-transpose-convolution-831dc53687ce -->
<!--     https://towardsdatascience.com/transposed-convolution-demystified-84ca81b4baba -->
    

![image-2.png](attachment:image-2.png)

<!-- https://www.jeremyjordan.me/semantic-segmentation/ -->

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
The <b>Bilinear Interpolation</b> is the most popular way for upsampling.

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
For understanding the bi-linear Interpolation let's first considering the linear case  

![image.png](attachment:image.png)



<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>

The bilinear case is the 2-dimensional interpolation:<ul>
    The idea is to obtain new value as linear interpolation of the <br>
    points values which we have. 
<li> Represent region as 2d plot.
<li> use linear interpolation for edge horizontal axis.  
<li> use linear interpolation for edge vertical axis. 
<li> use linear interpolation  for intermediate values using obtained vertical values.     

![image-5.png](attachment:image-5.png)
    



![image.png](attachment:image.png)

<font size=4.5pt  face = 'georgia' style='Line-height :2'><div style='text-align: justify; padding:35px'>
The Bilinear interpolation can be obtained analytically:
$$
\begin{aligned}
    &\text{for New point P=(x,y) and : }  \\
    &\text{Old values points: } \\ \\
        &\begin{matrix}
            Q_{11}=(x_1,y_1) & Q_{12}=(x_{2},y_1) \\
            Q_{21}=(x_2,y_1) & Q_{22}=(x_{2},y_2)
        \end{matrix}\\ \\
     &\text{calculate new value: } f(P) \\ \\    
    &f(R_1) = \frac{x_2-x}{x_2-x_1}f(Q_{11})+\frac{x-x_1}{x_2-x_1}f(Q_{21})\\ \\
    &f(R_2) = \frac{x_2-x}{x_2-x_1}f(Q_{12})+\frac{x-x_1}{x_2-x_1}f(Q_{22})\\ \\
    &f(P) = \frac{y_2-x}{y_2-y_1}f(R_{1})+\frac{y-y_1}{y_2-y_1}f(R_{2})\\ \\
\end{aligned}
$$
   
where: $R_1$, $R_2$ auxiliary points.
    

![image-5.png](attachment:image-5.png)

![image.png](attachment:image.png)
<!-- https://github.com/YasinEnigma/Image_Interpolation -->


![image.png](attachment:image.png)
<!-- https://github.com/pytorch/pytorch/issues/25039 -->