### Why Convolution?
Classifying Images with a Fully Connected Network is bad for 2 reasons.
- There is no sense of "localization"
- The number of parameters explodes

#### Localization
Our input image is a reactangle
```
[[(0, 0), (1, 0), (2, 0)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 2), (1, 2), (2, 2)],
 [(0, 3), (1, 3), (2, 3)]]
```
by unraveling the image into a vector, we lose content (note raster scanning can help)
```
[(0, 0), (1, 0), (2, 0), (0, 1), (1, 1), (2, 1), (0, 2), (1, 2), (2, 2), (0, 3), (1, 3), (2, 3)]
```

#### Number of Parameters
We are passing an image $x_i = (LxWx3)$, thus each Neuron in the first hidden layer must have $L\cdot W \cdot 3$ weights. (Note that each "Neuron" in this context would be equivalent to the first row in our Linear Layer). Thus a 32x32 image would require only 3,072 weights, but a 200x200 image would require 120,000 weights. 

And obviously we would have more than one hidden layer, and the number of neurons may even increase between layers.

### Convolution Operation (Basic)
The solution is to apply an operation called **convolution** with these things called **filter (s)**.

A filter $F$ is simply a small matrix (eg 5x5x3). During convolution we will take the dot product of $F$, and a subsection of the image, $I$ (eg 5x5x3) to create an **activation map**, $A$.
```
    [0, 1, 2]
    [3, 4, 5]     [0, 1]     [4, 6]
I = [6, 7, 8] F = [1, 0] A = [10, 12]
```

### Intuition
We are trying to solve the above problems of **Localization**, and **Number of Parameters**

#### Localization
This has obviously improved. By taking the dot product over sections of an image, our filters are effectively trained to pick up on special details (eg. corners, blobs of color)

#### Number of Parameters
Filters are typically small, for instance 5x5x3, thus the number of parameters has been greatly reduced. Even with 12 filters in a layer we would end up with only 900 weights. 

### Convolution (Medium)

We skipped a couple things in the basic overview.

#### Stride
Say we have an Image, $I$, and a Filter $F$
```
    [0, 1, 2, 3]
    [4, 5, 6, 7]
    [8, 9, 10, 11]       [0, 1]
I = [12, 13, 14, 15] F = [1, 0]
```

Typically we would convolve as follows
```
[0, 1]   [0, 1]
[4, 5] * [1, 0]

[1, 2]   [0, 1]
[5, 6] * [1, 0]

[2, 3]   [0, 1]
[6, 7] * [1, 0]
...
```

This is a Stride of 1. If Stride were 2 then we would convolve as
```
[0, 1]   [0, 1]
[4, 5] * [1, 0]

[2, 3]   [0, 1]
[6, 7] * [1, 0]
...
```

#### Zero Padding
Certain Filter Dimensions or Strides do not work so we add a border of Zero's to the image, e.g
```
[0, 0, 0, 0, 0, 0]
[0, 0, 1, 2, 3, 0]
[0, 4, 5, 6, 7, 0]
[0, 8, 9, 10, 11, 0]
[0, 12, 13, 14, 15, 0]
[0, 0, 0, 0, 0, 0]
```

### Convolution Formula
If the input image is $WxWx3$, and we have $KxKx3$ filters, with a stride of $S$, and padding $P$, then the output activation map will be shape
$$\frac{(W - K + 2\cdot P)}{S} + 1$$

(note that this can also be used to check when padding will be needed)

### Coding Example
Lets reproduce the 2d convolution we did above
```
    [0, 1, 2, 3]
    [4, 5, 6, 7]
    [8, 9, 10, 11]       [0, 1]
I = [12, 13, 14, 15] F = [1, 0]
```

In [1]:
import torch
import torch.nn.functional as f

In [2]:
I = torch.Tensor([[[[i + j*3 for i in range(3)] for j in range(3)]]])
F = torch.Tensor([[[[0, 1], [1, 0]]]])

In [3]:
A = f.conv2d(input=I, weight=F, bias=None, stride=1, padding=0)

In the above example we declared our filter as
```
[[[[0, 1], [1, 0]]]] # shape = (1, 1, 2, 2)

```

Rather than simply as
```
[[0, 1], 
 [1, 0]]
```

As one might expect, this is because in training we typically pass multiple filters, eg
```
[[[[1., 1., 1.],
  [1., 0., 1.],
  [1., 1., 1.]],

 [[0., 1., 0.],
  [1., 0., 1.],
  [0., 1., 0.]]]]
```

(and the data must be 4 dimensional on acccount of training data being passed in mini-batches)