# Convolutional Neural Networks
- Mainly for image data
- Image data is high dimensional (224x244x3)
- Nearby pixels are statistically correlated,but FCN have no notion of 'nearby' and treat the relation between every input equally
- Interpretation of an image is estable under geometric transformations
- Convolutional Neural Networks
  - Process each local image region independently
  - Fewer parameters than fully connected layers
  - Exploit spatial relations between nearby pixels

## Invariance and Equivariance
- **Invariance**
  - A function $f[x]$ of an image $x$ is invariant to a transformation $t[x]$ if $$f[t[x]] = f[x]$$
  - The output of a function is the same regardless of $t[x]$
- **Equivariance**
  - A function $f[x]$ of an image $x$ is invariant to a transformation $t[x]$ if $$f[t[x]] = t[f[x]]$$
  - Networks for pixel-to-pixel image segmentation should be equivariant to transformations like rotation,translation

## Convolutional network for 1D inputs
- CNNs consists of a series of convolutional layers, each of which is equivariant to translation
- Pooling mechanisms that induce partial invariance to translation
- Convolution
  - Transforms an input vector $x$ to an output vector $z$ such that each output $z_i$ is a weighted sum of its nearby inputs
  - Convolution kernel: Weights of the weighted sum. Used at every step
  - Kernel size: Region which the inputs are weighted and summed
  - *This is cross-correlation, but in machine learning its normally called correlation*
- Padding
  - How to deal with the first and last output in a kernel of size 3?
  - Zero padding: Input is zero outside the range. Introduce extra information
  - Treating the input as circular
  - Discard the output positions where the kernel exceeds the range of input positions. Representation size decreases
- Stride
  - When evaluating the output at every position, stride = 1. If we have a stride of two, the outputs are decreased by roughly half
- Kernel size 
  - Normally an odd number, so that can be centered in the current position
  - Larger kernel sizes leads to more weights
  - Increasing/Decreasing kernel size leads to the idea of dilated/atrous convolution
    - Kernel of size 5 into a dilated kernel of size 3 with zeros at the second and fourth positions
    - Number of $0$s is the dilation rate
- Convolutional Layers
  - Convolves the input, add the bias and pass through an activation function $a$
  - With kernel size 3, stride one and dilation rate one, the $i^th$ hidden unit $h_i$ would be computed as $$h_i = [\beta + w_1x_{i-1} + w_2x_{i} + w_3x_{i+1}]$$ $$ = a\left[\beta + \sum_{j}^3 w_jx_{i+j-2}         \right]$$
  - Kernel weights $w$ and bias $\beta$ are weighted parameters
  - It's a special case of a fully connected network where $$h_{i} = a\left[\beta_i + \sum_{j=1}^D w_{ij}x_j \right]$$
  - This fully connected layer would need
    - $D^2$ weights and $D$ biases. Convolutional layer only uses three weights and one bias.
    - Can reproduce if most weights are set to zero and the others are constrained to be identical
- Channels
  - Several convolutions in parallel
  - Each convolution produces a set of hidden variables, termed *feature map* or *channel*
  - If the incoming layers has $C_{i}$ channels and kernel size $K$, the resulting matrix will be $\Omega \in \mathbb{R}^{C_i \times K}$ and one bias
  - If there are $C_o$ channels in the next layer, we need $\Omega \in \mathbb{R}^{C_i \times C_o \times K}$ weights and $\beta \in \mathbb{R}^{C_o}$ biases
- CNNs and Receptive fields
  - Receptive field
    - Receptive field of a hidden unit is the region of the original input that feeds into it
    - CNN with kernel size $3$
      - Units in first layer takes a weighted sum of the three closest inputs $\rightarrow$ Receptive field of size 3
      - Units in second layer takes a weighted sum of the three closest positions in the first layer, which are weighted sums of size 3 $rightarrow$ Receptive field of size 5
      - Receptive field of units in sucessive layers increases
## CNNs for 2D input
- Kernel is now a $2D$ object
- A $3\times3$ kernel applied to a $2D$ input comprising of elements $ij$ computes a single hidden output $h_{ij}$ as $$h_{ij} = a\left[\beta + \sum_{m}^3\sum_{n}^3 w_{mn}x_{i+m-2,j+n-2}    \right]  $$
- Weighted sum over a square $3\times3$ matrix
- Often the input is a RGB image, so the kernel would be $3\times3\times3$
- With kernel size of $K \times K$, $C_i$ output channels
  - Each output is a weighted sum of $C_i\times K \times K$ quantities and a bias
  - To compute $C_o$ output channels, it needs $C_i \times C_o \times K \times K$ weights and $C_o$ biases
## Downsampling/Upsampling
- Downsampling
  - Scaling down both dimensions by a factor of 2
  - Maxpooling: Maximum value of 2x2 input values
  - Mean pooling: Averaging the inputs 
  - Each approach to each channel individually
- Upsampling
  - Scale back up
  - Duplicate all channels at each space position by four times
  - Bilinear interpolation between input values

## Applications
- Image Classification
  - Most methods reshape the image to a standard size (224x224 RGB)
  - Famous architectures
    - AlexNet
    - VGG (more depth than AlexNet)
- Object Detection
  - Identify and localize multiple objects within an image
  - YOLO (You Only Look Once)
    - Output: Encodes which class is present at each of the 7x7 grid of locations
    - For each location, output also encodes the bounding boxes
      - Bounding boxes are defined by 5 parameters
        - x and y position of the center
        - height and width of the box
        - confidence of the prediction
- Semantic Segmentation
  - Label to each pixel according to the object that it belongs to or no label if it doesnt match anything
  - Input: 224x224 RGB image
  - Output: 224x224x21 that contains the probability of each of the possible 21 classes in that pixel
  - Encoder: Downsampling
  - Decoder: Upsampling
- **Increasing network depth indefinitely doesnt continue to help. After a certain depth, it becomes difficult to train. This is the motivation for *residual networks***