<a id="motivation"></a>
# Motivation: Problems of computer vision 

Computer vision (since the earliest perceptrons) has been traditionally a crucial field of AI research.
One of the reasons is, that the perceptual "circuitry" of vision in humans got considerable attention in many fields (like cognitive science) and is directly relevant for the "biological inspiration" of AI methods.
On the other hand it is a challenging field, model performance was way lower than human baseline till most recently (think 2014-15), though it is a common ability of many animals.
Finally the field is interesting, since we have a strong understanding of physical processes forming it's baseline and would "expect it to be simple". It is not.


## Invariances in the field of vision

"Pictures come from objects, not random pixels"

<img src="https://cms-assets.tutsplus.com/uploads/users/108/posts/19997/image/color-fundamentals-value-1.png" width=400 heigth=400>

The basic mechanism of light reflection and absorption are responsible for the detected "pixel distribution" on a picture, so we can assume that the empirically observed distribution is coming from a directly not observed, but very well defined "generative distribution" of physical objects. While "practicing" vision, we reconstruct the domain of objects from the visible data of pixels. Our expectation would be the same for AI systems. 

<img src="http://www.mstworkbooks.co.za/natural-sciences/gr8/images/gr8ec04-gd-0052.png" width=400 heigth=400>

Colour vision, especially how the abstract categories of "colour" arise from the raw perceptions is non-trivial. (We perceive the same object as "red", even though the lighting conditions influence the perception strongly, and we categorise different things as pink or red according to culture, or even see things differently [based on our culture](https://en.wikipedia.org/wiki/Linguistic_relativity_and_the_color_naming_debate). More [here](https://books.google.hu/books?id=O9SIAgAAQBAJ&dq=Color+vision:+A+case+study+in+the+Foundations+of+Cognitive+Science))

From these it follows, that there are many "invariances" in the visual field, like rotational, translation, lighting,... which the model should handle well. 

Another "adjacent" field in computer science concerns itself with the realistic generation of pictures (or sequences, as animations) from a known set of objects. This is how computer graphics and special effects are made with the inverse procedure, "ray tracing".
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/Ray_trace_diagram.svg/600px-Ray_trace_diagram.svg.png" width=400 heigth=400>

It is important to note, that the development of Graphics Processing Units was a direct result of the need for better computer graphics, thus it is by no chance, that the first applications of GPUs to deep learning came from the field of image recognition. The hardware was already somewhat specialized for the task. 


## Compositionality

"Objects are made of parts, not just blobs"

<img src="https://kevinkaixu.net/projects/scores/teaser.jpg" width=600 heigth=600>

It is also crucially important, that "objects" themselves (though sometimes vaguely defined in themselves) are results of a compositional structure, that is they many times are composed of "parts" (themselves objects). This presupposes that a good model can handle a hierarchic composition of objects.

(The recent advances of machine learning made a "full circle", and deep learning methods are aiding the composition of computer graphics, like in: ["SCORES: Shape Composition with Recursive Substructure Priors"](https://kevinkaixu.net/projects/scores.html).)

Further complicating the problem, objects in the visual field can be described by edges, shapes, textures, so their description is non-trivial.

<img src="http://drive.google.com/uc?export=view&id=1zgVDLXoJCPTIS1JAwbwlGR5MnxDsvNBt"  width=600 heigth=600>


## Why traditional methods suffered?

"Could you please build up a list of cat parts with visual specification, please?"

Image input is notoriously high in dimensionality of measurements, though we annoyingly know, that is is a very well copressable data, just we don't know how.

<img src="http://drive.google.com/uc?export=view&id=1VJZaGkHUVAZXpS-vRMRgEmzIQPia5JWV"  width=600 heigth=600>


I can describe a scene as "there are two cardinal red spheres of 1cm diameter in an empty white space", and I have described the scene with a very good approximation, though it can have literally myriad instantiations under viewing angles, ligthing conditions,...

<img src="https://cdn8.bigcommerce.com/s-e2p82/images/stencil/500x659/products/2992/7729/Q010145__80365.1458131489.jpg?c=2" width=400 height=400>

As stated before: Compressability and "understandability", "discernability" has it's deep connections. 


## Crucial importance of end-to-end learning

In the visual field it is all the more obvious, that the historical progress did not come from the better or stronger classifiers - an SVM will do fine - but the elaboration of features, that is, the representation of the input, is essential. Years upon years of human engineering went into feature extractors (see some at [wikipedia](https://en.wikipedia.org/wiki/Feature_detection_(computer_vision)#Feature_extraction) though there are many many more), and the big breakthrough in 2012 for deep learning came from the fact, that a neural net with no engineered features, working on raw pixels (though with a clever architecture) significantly beat the state of the art.

<img src="http://drive.google.com/uc?export=view&id=1v8yieDdObp74czCbGMg1J50xPYL20XhZ"  width=600 heigth=600>

[source](https://www.youtube.com/watch?v=o8otywnWwKc&t=301s)

<a id="architecture"></a>
# Convolutional Neural Network (CNN) architecture

Breakthrough solution for image recognition the introduction of **Convolutional Neural Network** architecture: 
- De-facto standard now
- Numerous applications outside of it

## Some sung and unsung heroes 

### Fukushima

Worked on **neocognitron model** in the 1980s, which predates CNN-s.
- Hierarchical, multilayered artificial neural network. 
- Used for handwritten character recognition and other pattern recognition tasks.
- Served as inspiration for convolutional neural networks.

<img src="http://personalpage.flsi.or.jp/fukushima/files/FukushimaImg.jpg" width=300 heigth=300>



Inspired by model proposed by Hubel & Wiesel in 1959. 
- Found two types of cells in the visual primary cortex:
- Simple cell (S-cell) and complex cell (C-cell)
- Simple cell: Takes input from complex cells. Responds primarily to oriented edges and gratings (bars of particular orientations) in small regions of the visual field
- Complex cell: Responds (like simple cell) to oriented edges and gratings; however has a degree of spatial invariance. This means that its receptive field cannot be mapped into fixed excitatory and inhibitory zones. Rather, it will respond to patterns of light in a certain orientation within a large receptive field, regardless of the exact location. Some complex cells respond optimally only to movement in a certain direction. Insensitive to exact location of the edge in the field.
- Cascading model of these two types of cells for use in pattern recognition tasks, one in a sense a local edge detector and one a more global one.




In [1]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Cw5PKV9Rj3o" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')




Neocognitron is natural extension of cascading models. 
- Consists of multiple types of cells: S-cells, C-cells. 
- Local features are extracted by S-cells (convolutional layer), and these features' deformation, such as local shifts, are tolerated by C-cells (downsampling layer). 
- Local features in the input are integrated gradually and classified in the higher layers" [Wikipedia](https://en.wikipedia.org/wiki/Neocognitron)

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/581528b2215e017eba96ef4ee16d33a74645755f/3-Figure1-1.png" width=400 heigth=400>

Many of the features of later CNNs present in Fukushima's work, but ideas were based on learning rules of classical era, predating backprop.



### Yann LeCun

It is common to attribute the elaboration of Convolutional Neural Nets - as well as many advances till this day - to [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun), a student of Hinton, who worked extensively on image recognition tasks, especially in hadwritten digit recognition for numeric cheques. He is also attributed for creating our "workhorse", the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.

<img src="https://qph.fs.quoracdn.net/main-thumb-48611829-200-wjnraisajwlkqlmolpgmqnkfnxvuezwr.jpeg" width=400 heigth=400>

Till this day he is one of the most inlfuential figureheads of the deep learning revolution, he's talks and publications are well worth following.

## Basic building blocks of CNNs

Basic building blocks of a CNN:

- Convolutional layers - edge detectors; analogous to simple cells
- Subsampling layers ("Pool" or "Pooling") - going beyond exact location/ reducing information (analogous to complex cells)
- Fully connected layers - combining the information
- A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

<img src="https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/lenet-5-825x285.png?x31195" width=600 heigth=600>

A good summary introduction of CNN elements can be found [here](https://www.wikiwand.com/en/Convolutional_neural_network) and [here](https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/).

### Basic ideas

Remember the idea of *"Embed and cut"* from representation learning? Convolutional and pooling layers essentially building up pipeline of capturing a hierarchy of more and more abstract features of the image until "embedding space" provides enough information for final classifier - the fully connected layers. 

<img src="https://www.researchgate.net/profile/Siva_Chaitanya_Mynepalli/publication/281607765/figure/fig1/AS:284643598323714@1444875730488/Learning-hierarchy-of-visual-features-in-CNN-architecture.png" width=600 heigth=600>

[source](https://www.researchgate.net/publication/281607765_Hierarchical_Deep_Learning_Architecture_For_10K_Objects_Classification)

<img src="https://devblogs.nvidia.com/wp-content/uploads/2015/11/hierarchical_features.png" width=600 heigth=600>

We could not have defined these features so exactly by hand!

#### Hierarchy of "detectors" or "filters"

Alternatively think of architecture as hierarchy of "filters" that "scans" through input field and gives most "signal" when the input partis overlayed with specific pattern it is looking for.

<img src="http://mathworld.wolfram.com/images/gifs/convgaus.gif" height="300">


#### Engineered prior for invariances

Usage of filters and convolution (see below):
- Realize a kind of invariance, since **the filters detect the appropriate pattern irrespective of it's location in the picture**. 
- Combined with successive abstraction of hierarchy - can effectively solve problems of location invariance.

#### Weight sharing

Application of "subsampling" layers together with weight sharing in case of convolution allows for radical decrease of number of weights used for a model, making training time feasible for _huge_ neural networks and work on _high resolution_ (high dimensionality) input data. 

(By the way, weight sharing helps to **decrease the number of parameters**, thus can be considered as an implicit "regularization".)



### Convolution 

#### Mathematiclal definition
General definition:


$$y(t) = x(t) * h(t) = \int_{-\infty}^{\infty}x(\tau)h(t - \tau)d\tau$$

- f and g are both functions which operate on the same domain of inputs
- x(t) is the input, h(t) is the impulse response
- In the context of probability, you might frame these as density functions; in audio processing, they might be waveforms; 
- When you convolve two functions together, the convolution itself acts as a function, insofar as we can calculate the value of the convolution at a particular point t, just as we can with a function.

- To calculate a convolution between f and g, at point t, we take the integral over all values tau between negative and positive infinity, and, at each point, multiply the value of f(x) at position tau by the value of h(x) at t — tau, that is, the difference between the point for which the convolution is being calculated and the tau at a given point in the integral.

- We have $h(t - \tau)$, as one function is reversed and "moves through the other"




<img src="http://mathworld.wolfram.com/images/gifs/convgaus.gif" height="300">


In the figure 





### Convolution, cross correlation and auto-correlation
Some features of convolution are similar to cross-correlation: for real-valued functions, of a continuous or discrete variable, it differs from cross-correlation only in that either f (x) or g(x) is reflected about the y-axis; thus it is a cross-correlation of f (x) and g(−x), or f (−x) and g(x)


__Convolution:__

Continuous function:
$$y(t) = x(t) * h(t) = \int_{-\infty}^{\infty}x(\tau)h(t - \tau)d\tau$$

Discrete function (1d):
$${\displaystyle (f*g)[n]=\sum _{m=-\infty }^{\infty }f[m]g[n-m]}$$




__Cross-correlation:__

Continuous function:
$${\displaystyle (f\star g)(\tau )\ \triangleq \int _{-\infty }^{\infty }{\overline {f(t)}}g(t+\tau )\,dt}$$	


where ${\displaystyle {\overline {f(t)}}}$ denotes the complex conjugate of $f(t)$, and $\tau$  is the displacement, also known as lag (a feature in $f$ at $t$ occurs in $g$ at $t+\tau$.

Discrete function (1d):

$${\displaystyle (f\star g)[n]\ \triangleq \sum _{m=-\infty }^{\infty }{\overline {f[m]}}g[m+n]}$$	


<img src="https://upload.wikimedia.org/wikipedia/commons/2/21/Comparison_convolution_correlation.svg" height="300">


Visual comparison of convolution, cross-correlation, and autocorrelation. For the operations involving function f, and assuming the height of f is 1.0, the value of the result at 5 different points is indicated by the shaded area below each point. 


If $X$ and $Y$ are two independent random variables with probability density functions $f$ and $g$, respectively, then the probability density of the difference $Y-X$ is formally given by the cross-correlation (in the signal-processing sense) $f\star g$; however this terminology is not used in probability and statistics. In contrast, the convolution $f*g$ (equivalent to the cross-correlation of $f(t)$ and  $g(-t)$ gives the probability density function of the sum $X+Y$.

Note that in statistics cross-correlation is always normalized

#### 2D example
Discrete 2D convolution:



$$o[m,n]=f[m,n]*g[m,n]=\sum_{u=-\infty}^{\infty}\sum_{v=-\infty}^{\infty}f[u,v]g[m-u,n-v]$$


Generally the idea of convolution is as follows:

<img src="https://qph.fs.quoracdn.net/main-qimg-578748437404fe6733bc7823755e813c-c" width=600 heigth=600>




**Or more in detail:**

This is the pixel intensity representation of the input image:

<img src="https://ujwlkarn.files.wordpress.com/2016/07/screen-shot-2016-07-24-at-11-25-13-pm.png?w=127&h=115" height="200">

This is the weight matrix of the neuron, that is the "filter":
<img src="https://ujwlkarn.files.wordpress.com/2016/07/screen-shot-2016-07-24-at-11-25-24-pm.png?w=74&h=64" height="100">

Convolution is the "shifting" of the "filters" throughout the input: 

<img src="https://ujwlkarn.files.wordpress.com/2016/07/convolution_schematic.gif?w=268&h=196" height="300">

[source](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)

The convolution of filters on an original image:

<img src="https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif?w=748" height="300">




#### Another intuition for discrete convolutions (and cross-correlation): dot product

Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers. Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them. These definitions are equivalent when using Cartesian coordinates

##### Algebraically

The dot product of two vectors $\textbf{a}=[a_1, a_2, ...., a_n]$ and $\textbf{b}=[b_1, b_2, ...., b_n]$ is defined as:

$$\textbf{a}.\textbf{b}=\sum_{i=1}^{n} {a_i*b_i}=a_1*b_1+a_2*b_2+...+a_n*b_n$$ 

where $\sum$ denotes summation and n is the dimensions of the ector space. For instance, in three-dimensional space the dot product of vectors $[1, 3, -5]$ and $[4, -2, 1]$ is:

$$[1, 3, -5].[4, -2, -1]=(1*4)+(3*-2)+(-5*-1)$$

$=4-6+5$
$=3$

##### Geometrically

$$a\dot{}b=a_xb_x + a_yb_y = |a||b|\cos{\theta}$$

Note that with a and b two sides of a triangle and θ the angle between them, the third side is b−a and (cosine rule)

$$|b-a|^2=|a|^2+|b|^2-2|a||b|\cos \theta$$

so that

$$2|a||b|\cos\theta=\Sigma a_i^2+\Sigma b_i^2-\Sigma (b_i-a_i)^2=2\Sigma a_ib_i$$

so that

$$|a||b|\cos\theta=\Sigma a_ib_i$$





<img src="https://betterexplained.com/wp-content/uploads/dotproduct/dot_product_components.png" height="300">


<img src="https://betterexplained.com/wp-content/uploads/dotproduct/dot_product_rotation.png" height="300">





#### More detailed example of 1D convolutional neural network
["Source"](https://colah.github.io/posts/2014-07-Understanding-Convolutions/) 

So, how does convolution relate to convolutional neural networks?

Consider a 1-dimensional convolutional layer with inputs $\{x_n\}$ and outputs $\{y_n\}$:


<img src="https://colah.github.io/posts/2014-07-Understanding-Convolutions/img/Conv-9-Conv2-XY.png" width=400 heigth=400>

As we observed, we can describe the outputs in terms of the inputs:

$$y_n = A(x_{n}, x_{n+1}, ...)$$

Generally, $A$ would be multiple neurons. But suppose it is a single neuron for a moment.

Recall that a typical neuron in a neural network is described by:

$$\sigma(w_0x_0 + w_1x_1 + w_2x_2 ~...~ + b)$$

Where $x_0$, $x_1$,... are the inputs. The weights, $w_0$, $w_1$, ... describe how the neuron connets to its inputs. A negative weight means that an input inhibits the neuron from firing, while a positive weight encourages it to. The weights are the heart of the neuron, controlling its behavior. Saying that multiple neurons are identical is the same thing as saying that the weights are the same.

It’s this wiring of neurons, describing all the weights and which ones are identical, that convolution will handle for us.

Typically, we describe all the neurons in a layer at once, rather than individually. The trick is to have a weight matrix, $W$:

$$y = \sigma(Wx + b)$$

For example, we get:

$$y_0 = \sigma(W_{0,0}x_0 + W_{0,1}x_1 + W_{0,2}x_2 ...)$$

$$y_1 = \sigma(W_{1,0}x_0 + W_{1,1}x_1 + W_{1,2}x_2 ...)$$

Each row of the matrix describes the weights connecting a neuron to its inputs.

Returning to the convolutional layer, though, because there are multiple copies of the same neuron, many weights appear in multiple positions.

<img src="https://colah.github.io/posts/2014-07-Understanding-Convolutions/img/Conv-9-Conv2-XY-W.png" width=400 heigth=400>

Which corresponds to the equations:

$$y_0 = \sigma(W_0x_0 + W_1x_1 -b)$$
$$y_1 = \sigma(W_0x_1 + W_1x_2 -b)$$

So while, normally, a weight matrix connects every input to every neuron with different weights:

$$W = \left[\begin{array}{ccccc} 
W_{0,0} & W_{0,1} & W_{0,2} & W_{0,3} & ...\\
W_{1,0} & W_{1,1} & W_{1,2} & W_{1,3} & ...\\
W_{2,0} & W_{2,1} & W_{2,2} & W_{2,3} & ...\\
W_{3,0} & W_{3,1} & W_{3,2} & W_{3,3} & ...\\
...     &   ...   &   ...   &  ...    & ...\\
\end{array}\right]$$

The matrix for a convolutional layer like the one above looks quite different. The same weights appear in a bunch of positions. And because neurons don’t connect to many possible inputs, there’s lots of zeros.

$$ W = \left[\begin{array}{ccccc} 
w_0 & w_1 &  0  &  0  & ...\\
 0  & w_0 & w_1 &  0  & ...\\
 0  &  0  & w_0 & w_1 & ...\\
 0  &  0  &  0  & w_0 & ...\\
... & ... & ... & ... & ...\\
\end{array}\right]$$


Multiplying by the above matrix is the same thing as convolving with $[...0, w_1, w_0, 0...]$ The function sliding to different positions corresponds to having neurons at those positions.

What about two-dimensional convolutional layers?

<img src="https://colah.github.io/posts/2014-07-Understanding-Convolutions/img/Conv2-5x5-Conv2-XY.png" width=400 heigth=400>

The wiring of a two dimensional convolutional layer corresponds to a two-dimensional convolution.

Consider our example of using a convolution to detect edges in an image, above, by sliding a kernel around and applying it to every patch. Just like this, a convolutional layer will apply a neuron to every patch of the image.

#### Receptive fields: "size of the convolution" in x*y pixels

#### Stride and padding

"Step size" (in number of pixels of picture, in our case) of filters, the so called _stride_ is a crucial parameter.

- Pixels at the edges of the picture taking part in fewer convolutions: nothing to be convolved on one of their sides. 
- Effect can be mitigated by _"padding"_ input with virtual pixels. 
- Most widespread padding is _zero-padding_: fill up the space with zeros, but other solutions like "warping" "clamping" and "mirroring" also exist. (Further reading [here](https://www.cs.toronto.edu/~urtasun/courses/CV/lecture02.pdf))

<img src="https://adeshpande3.github.io/assets/Pad.png" height="300">

With careful choice of padding and stride we can make the successive layers of the network constant in width:

The new representation, the _"activation map"_ calculated as follows:

Let the picture be of size $W\times W$, the size of the filter (_"receptive field"_) $F\times F$, stride $S$ and padding $P$. The resulting activation map is then:
$O=(W-F+2P)/S+1$. 


For example on a 7x7 picture we convolve a 3x3 filter with stride 1 and 0 padding, we get a $5\times 5$- output. If the stride is 2 pixels, get $3\times 3$ as output. 



#### Weight sharing

- Can also look at the weights in above example, as $5\times 5$ neuron forming the new representation.
- Major difference from fully connected layers, since in case of CNNs **all neurons have the same weights**. 
- Called _weight sharing_.
- Different neurons receive inputs from different parts of the picture. 

#### Spatial arrangement of convolutional units

- Convolution operation can be extended to more dimensions.
- At the end always get a scalar value. 
- Including the color channels (RGB), the picture can be described as a $W\times W \times D$ 3D matrix
- "Depth" equals number of channels (in most cases 3)
- In such a case, the filters are also 3D. 
- If we follow reasoning, we can stack layers of activation maps on top of each-other
- We again get a 3D input at the next layer, like:


<img src="http://deeplearning.net/tutorial/_images/cnn_explained.png" height="300">

#### Connection between hyperparameters

Input size: $W_1 \times H_1 \times D_1$  (width, length, depth - we were using rectangular inputs above, but that is not necessary)

Hyperparameters:
 - Number of filters: K
 - Filter size: F
 - Stride: S
 - Padding: P
 
 Output: $W_2\times H_2 \times D_2$,
  where 
  
  - $W_2 = (W_1-F-2P)/S+1$
  - $H_2 = (H_1-F-2P)/S+1$
  - $D_2 = K$
  
  
  Total parameters for a layer: $(F\times F\times D_1 +1)*K$, where 1 is representing the bias. 
  
  (In case of fully connected layers it is $W_1 \times H_1 \times D_1 \times W_2 \times H_2$ parameters, which is by order of magnitude larger...)
  
  
#### Comparison to fully connected layer

<html>
<head>
<style>
table {
    font-family: arial, sans-serif;
    border-collapse: collapse;
    width: 100%;
}

td, th {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
}

tr:nth-child(even) {
    background-color: #dddddd;
}
</style>
</head>
<body>

<h2>Connection between sub parts</h2>

<table>
  <tr>
    <th>MLP</th>
    <th>Convolutional network</th>
  </tr>
  <tr>
    <td><img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" height="300"></td>
    <td><img src="http://cs231n.github.io/assets/cnn/cnn.jpeg" height="300"></td>
  </tr>
</table>

</body>
</html>

Left side: traditional 3 layer network. 
Right side: ConvNet, where we arrange inputs in 3 dimensions:
- Output of filters for each part of picture single scalar, values form an activation map for the given filter over the whole image. 
- Maps stacked on top of each-other to get to 3D representation, forming the input for the next layers. 
- In case of pictures, RGB input can be considered 3D by default. [Source](http://cs231n.github.io/convolutional-networks/)


### Pooling

- Follow logic of "detectors" in case of convolutional filters
- Accept premise that there is a compositional hierarchy of object parts, 

--> Most relevant information whether filter detected certain pattern somewhere in the region of the input picture. 

Exact location is not important, only the fact that there is "strong" (confident) signal.

Following this logic additional _pooling layer_ introduced in CNNs:
- Downsampling strategy, reducing the dimensionality of the inputs for a next convolutional layer, 
- Aiding the "invariant behavior" of the network as well as a hierarchic form of compressing.

Different pooling strategies exist (like "average pooling"), but most widespread form is _max pooling_:

- The pooling layer does no learning, has no weights. 
- For each $k*k$  input a single value which is the $maximum$ of activation in that region.
- In case of input with dimensions $N*N$ will then output a $ \frac{N}k *\frac{N}k$ layer, as each $k*k$ has an output of only a scalar because of the application of the $max$ function.

<img src="https://adeshpande3.github.io/assets/MaxPool.png" width=400 heigth=400>

Some more justification for this procedure can be, that in a certain local region the presence of a feature is unlikely be "double", unless the picture is somehow blurry, so we only need a decision of maximal presence.  Additional resources [here](http://yann.lecun.com/exdb/publis/pdf/boureau-icml-10.pdf).

#### Sidenote: some disadvantages of CNN

Recently the researchers at Uber's AI lab started to investigate the failings of ConvNets, especially their inability to solve tasks requiring exact coordinates, for example dealing with single pixel based tasks (like given a coordinate pair, the network should output a singel pixel of white at that coordinate). The basic convolutional architectures fail miserably, thus implying that we laid too much emphasis on invariances, and with it, crippled the ability of the networks to use accurate coordinates.

<img src="https://eng.uber.com/wp-content/uploads/2018/07/image15.png" width=300 heigth=300>

The researchers propose a simple extension of connectivity, thus a new architecture they call ["CoordConv"](https://eng.uber.com/coordconv/) to solve this isssue.

<img src="https://eng.uber.com/wp-content/uploads/2018/07/image8-768x321.jpg" width=400 heigth=400>

#### Visualization of a convolutional neural network
["Click here for a simple visualization of a convoluational neural network"](http://scs.ryerson.ca/~aharley/vis/conv/flat.html) 


<a id="specarch"></a>
# Specific architecture examples

After the initial success of ConvNet a _huge_ amount of development went into refining techniques and architectures that fueled the rapid development of the visual processing field, which surpassed (on certain tasks!!!) human performance.

<img src="https://pbs.twimg.com/media/DDwtxn6U0AADbz0.jpg" width=400 heigth=400>

One obvious source of development was, that with the "tricks" discussed before, as well as the increase of computation power the number of layers inside modern image recognition models also rises rapidly.

<img src="https://www.researchgate.net/profile/Kien_Nguyen26/publication/321896881/figure/fig1/AS:573085821489153@1513645715549/The-evolution-of-the-winning-entries-on-the-ImageNet-Large-Scale-Visual-Recognition.png" width=600 height=600>

[Source](https://www.researchgate.net/figure/The-evolution-of-the-winning-entries-on-the-ImageNet-Large-Scale-Visual-Recognition_fig1_321896881)

But some structural modifications were also necessary for these results to become achiveable.

--------------------------------------------
## LeNet

[Original paper](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)

This is the original ConvNet by LeCun et al. with the first implementation of the conv, pool and FC layers used for character recognition.

<img src="http://drive.google.com/uc?export=view&id=1Yw1NECPVDmrO2a3JoPwJVvnsNU8BU1Ul" height="500">


---------------------------------
## AlexNet

Papers:

- [Here](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
- [and here](https://kratzert.github.io/2017/02/24/finetuning-alexnet-with-tensorflow.html)


<img src="https://cdn-images-1.medium.com/max/1600/1*qyc21qM0oxWEuRaj-XJKcw.png" height="300">


Winner of the "ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012", 37.5% top-1 and 17.0% top-5 error rate, 650000 neuron, 60 million parmeters.

### Main ideas:
- **RELU in all layers**

<img src="http://drive.google.com/uc?export=view&id=14p7awodIcVsqEj-N2buGty74uz2doXuA" height="500">

- **Parallel processing on two GPUs**

The processing is done in two separate channels. This "innovation" came to be because of the fact that GPU memory was extremely limited back then, and the only way to train a bigger model was to "divide it by half". None the less a mechanic had to be implemented that ensured communication between the two channels at some layers.

- **Local response normalization**

As a linear form of "inhibition", we divide the activation of a neuron at a position with a scalar that depends on the activations of other neurons in the same position. It has some faint connections to batch norm, and resulted in 1.2-1.4% gain in accuracy. This is also called "brightness normalization" and fell out of favour recently.

- **Overlapping pooling**

Stride is smaller than the pooling filter. This caused 0.3-0.4% increase in accuracy.

- **Techniques against overfitting**

    - Data augmentation
    - DropOut (only on the last FC layers - it doubled the number of training iterations none the less for convergence)

-----------------------------
## VGG (Simonyan and Zisserman, Visual Geometry Group, University of Oxford)

Sources:
- https://arxiv.org/abs/1409.1556
- http://www.robots.ox.ac.uk/~vgg/research/very_deep/
- https://github.com/machrisaa/tensorflow-vgg
- https://github.com/tensorflow/models/blob/master/research/slim/nets/vgg.py
- https://gist.github.com/omoindrot/dedc857cdc0e680dfb1be99762990c9c

<img src="http://www.hirokatsukataoka.net/research/cnnfeatureevaluation/cnnarchitecture.jpg" width=600 heigth=600>

Second place in 2014 ImageNet. It had 16 convolutional and 3 FC layers 3x3 convolutional and 2x2 pooling filters. The model's appeal was it's simplicity, though the FC layers at the end made it quite demanding in computational terms. Decreasing FC layers was later found to mitigate this probelm without meaningful loss of performance.

Main ideas:

- more convolutional layers
- only small, but overlapping filters (3x3, stride=1)
- no local response normalization
- Sampling from multi-scale pictures
- usage of 1x1 convolutional layers
- ensemble of multiple models (2 made a large improvement, no need for large ensembles)

VGG is still a widely used architecture - especially in transfer learning.

---------------------------------------------
## GoogLeNet-Inception models

Winner of ILSVRC 2014 in detection and classification also, model of the Google team.

**Sources:**
- https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
- https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/googlenet.html


**Design objective:**
Lower level of energy and memory consumption

**Problems:**
1. More layers mean higher precision, but too many parameters (overfitting, training time, computational power,...)
2. ConvNet libraries (as well as general numeric libraries and hardware) is designed to deal with dense matrices, but computer vision representations work with sparse distributions

**Ideas:**

1. idea, based on: M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. Using 1x1-es convolutions, increasing depth but not increasing dimensions. Concatenateing this structure repeatedly resulted in GoogLeNet
2. idea: Inspired by biology, trying to reach scalar-invariance by applying variable size filters in parallel (This was also present in the classical model of Serre, HMAX just with non-learned filters.) Optimally we would work with many filters that are only sparsely active.

Inception model combines the two ideas, it does not enforce sparsity explicitly, but approximates it instead.

The building blocks, "inception modules" are composed of different size pooling and convolutional filters, their output is concatenated to become the input for the next layer. The ratio of different size filters changes as we get further from the original input, since we can expect to see lower and lower levels of spatial correlations in the higher levels of representation. The number of parameters is controlled by 1x1 dimension reduction convolutions:

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_5/inception_1x1.png" width=700 height=700>

The structure of GoogLeNet:

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_5/GoogleNet.png" width=700 height=700>

Because of memory constraints, in the first two layers they used "classical" conv and pool layers, then multiple inception modules were chained after each-other. Finally they used _avg pooling_ (instead of FC layers), dropoutot and softmax. If we consider a module one "layer", the number of operators with parameters was 22, all in all the number of layers was 100

3. idea: Put external classifiers on some of the intermediate layers, this contributes to the error minimized during training. This ensures that strong gradient flow is maintained in bottom layers also. (The motivation came from the fact that not so deep models work also, so there is a gradient signal that can be extracted from them.) During testing, the auxillary classifiers are not used.

4. idea: The usage of photometric distortion help the model becoming invariant against the properties of the photos taking (camera, detector type,...) [soruce](https://arxiv.org/abs/1312.5402)

5. idea: They have tought 7 (6) identical models on subsets of the training data, and finally made an ensemble of them. (Ensembling always helps a bit. :-( )




---------------------------------------------
## Highway Networks

**Sources:**
- https://arxiv.org/abs/1505.00387
- http://people.idsia.ch/~rupesh/very_deep_learning/
- https://arxiv.org/abs/1507.06228
- https://medium.com/jim-fleming/highway-networks-with-tensorflow-1e6dfa667daa
- https://github.com/wujysh/highway-networks-tensorflow


**Problem:**

The teaching of very deep networks is difficult (unsurprisingly) 

**Idea:**

The flow of information could be enhanced by passing through **the original as well as the transformed input**, thus the signal is essentially not weakened during the process. (And by the way, gradients can flow backwards more easily!). This general idea is also called a skip connection, although highway networks are just one particular way of implementing skip connections.

$\mathbf y=H(\mathbf{x}, \mathbf{W_H})\odot T(\mathbf{x}, \mathbf{W_T})+ \mathbf{x}\odot C(\mathbf{x}, \mathbf{W_C})$, 

where T is the "transform" gate, C is the "carry over" gate. 

Most simple form is when 

$C()=1-T()$, 

thus 

$\mathbf y=H(\mathbf{x}, \mathbf{W_H})\odot T(\mathbf{x}, \mathbf{W_T})+ \mathbf{x}\odot (1-T(\mathbf{x}, \mathbf{W_T}))$.

In the original paper, the transform gate is concretely a sigmoid layer: $T(\mathbf x)=\sigma(\mathbf{W_T}^T \mathbf x + \mathbf{b_T})$.

<img src="http://drive.google.com/uc?export=view&id=17asvsfH4S6O2TVmE8dLQBF7uhJmalCNt" width=500 height=500>

<img src="http://drive.google.com/uc?export=view&id=1ab6Z6Sr8_tc30s6ZtvaBy-f4ycdyXWOI" width=500 height=500>




------------------------------------------
## ResNet, Wide ResNet, ResNeXt


**Sources:**
- https://github.com/ppwwyyxx/tensorpack/tree/master/examples/ResNet
- https://github.com/tensorflow/models/tree/master/research/resnet
- https://arxiv.org/abs/1512.03385
- https://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
- https://arxiv.org/pdf/1602.07261.pdf   combining Inception and ResNet leads to quicker training and higher accuracy
- https://blog.waya.ai/deep-residual-learning-9610bb62c355


Originally a result of Microsoft Research Asia, the lead author, Professor He is now at Facebook AI...

ResNets	@	ILSVRC	&	COCO	2015	winner in all categories!
- ImageNet	Classification:	“Ultra-deep”	152-layer nets	
- ImageNet	Detection: 16% better	than	2nd
- ImageNet	Localization: 27% better	than	2nd
- COCO	Detection: 11% better	than	2nd
- COCO	Segmentation: 12% better	than	2nd

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_5/ImagenetTable.png" width=500 height=500>


### ResNet vs Highway networks

Motivation:

Depth has a limit: "degradation problem"

<img src="http://drive.google.com/uc?export=view&id=1g9laFRfDIdW0Bl9PC58et2z9DqyB5DNf" width=600 height=600>


You need more and more tricks to keep up the gradient flow in case of deeper networks.

<img src="http://drive.google.com/uc?export=view&id=141aDrpQAtXdN8HLKg3eopbO3RB7YSXeo" width=400 height=400>


ResNets and Highway networks time to realise the same goal, so similar concept, different realization:

<img src="https://www.researchgate.net/publication/311842587/figure/fig1/AS:442294890438658@1482462727725/Illustrating-our-usage-of-blocks-and-stages-in-Highway-and-Residual-networks-Note-also.png" width=400 height=400>

[Source](https://www.researchgate.net/publication/311842587_Highway_and_Residual_Networks_learn_Unrolled_Iterative_Estimation)

Both solutions try to implement a "shortcut" in depth, though in case of Highway networks it is done through a "gating function". The gates are dependent on the input and other parameters. If the value of the gate converges to 0, it does not learn a residual mapping (so it is in essence a variable amount of "residuality"), while ResNet always operates on the basis of residuals. It is not clear that a Highway net can always capiatalize on the increased depth.

The big idea of ResNets is to learn the simplest of "transformations", namely _identity mapping_, that is we introduce "skip connections", so that the old mapping is the "residual" part of the new mapping. (It had some precursors in [VLAD (Vector of Locally Aggregated Descriptors](http://www.vlfeat.org/api/vlad-fundamentals.html).)

<img src="http://drive.google.com/uc?export=view&id=17vkfFKkbNBxtc4v_tqM9GFRonE-hbPNw" width=400 height=400>

Introduction of identity transformation helps optimization.

<img src="http://drive.google.com/uc?export=view&id=1NWGv-2LtzBHZODghTTJsVlccye3CaEpT" width=400 height=400>


<img src="http://drive.google.com/uc?export=view&id=1MLyeQ6fpeQbLn71jmuclDn0ECO5i2utC" width=400 height=400>


Building blocks of original ResNet, "improved" and "generalized to more layers":

<img src="http://drive.google.com/uc?export=view&id=1dWnfQSp0JBR1jLQ-iPXzBlB9V-vQ1FqU" width=600 height=600>

The difference lies in the transformations used. In smaller networks they use non-linearities (ReLU or ReLU+Batchnorm), and in bigger networks identity.

The whole network is made up of such units, but before the classification there are no FC layers, only average pooling.


The bottleneck structure for reducing parameters is useful in big tasks:
<img src="http://drive.google.com/uc?export=view&id=19W-y5pAJCNTPO0iRzb5tRfguUj3P1Jq_" width=600 height=600>


Two ResNet components are:
- basic:two convolutional layers, 3x3 filters, batch normalization and ReLU
- bottleneck: dimensionality reduction and 1x1 expanding convolutional layer and a 3x3 conv layer inbetween


**Results:**

<img src="http://drive.google.com/uc?export=view&id=1Wp9eO5k0KV8BPRhdqx89I2CtCkAlb6KM" width=600 height=600>





## Wide ResNet

[**Source**](https://arxiv.org/abs/1605.07146)

The introduction of identity mapping was the key in ResNets but with enough layers, the so called "diminishing feature reuse" problem arises, that is, the information is preferredly flowing through the skip connections and nothing is learned on the other branch.


**Ideas:**
- Changing the order of operators: BN->ReLU->conv, results in faster training
- Since the bottleneck layer is "narrow", and the goal is to build wide nets, they have left it out
- Increase of representation power
    - more layers
    - more conv layers in one ResNet module
    - change of receptive fields of filters (smaller filters are generally better
    - Widening: more filters inside the conv layer of the module
- Widening (more filters) allowed for better GPU utilization, drastic speed increase in comparison to classic ResNet
- Batch Norm instead of DropOut in convolutional layers


<img src="http://drive.google.com/uc?export=view&id=1txjchc8PNT9_UGxNYMdAQaG39Pe7e3UO" width=500 height=500>



------------------------------------------------------
## DenseNet

**Sources:** 
- https://arxiv.org/abs/1608.06993
- https://github.com/liuzhuang13/DenseNet
- https://github.com/YixuanLi/densenet-tensorflow

**Problem:** 
- Many other models were experimenting with "shortening the paths", but there were many variants for it

**Idea:**
- What if we enable connection between all layers, the diverse filters together increase representation potential?
- Few filters per layer, further compression between "dense" blocks


<img src="https://cloud.githubusercontent.com/assets/8370623/17981494/f838717a-6ad1-11e6-9391-f0906c80bc1d.jpg" width=500 height=500>

<img src="https://cloud.githubusercontent.com/assets/8370623/17981496/fa648b32-6ad1-11e6-9625-02fdd72fdcd3.jpg" width=700 height=500>


**Comparison with other models:**

<img src="http://drive.google.com/uc?export=view&id=1av2mGDY1aH0GAq7bpyjwdA9sN7d2C09u" width=500 height=500>





