# Convolutional Networks

## Definitions:
- Convolute:
	- To roll or coil together.
- Convolving Together:
    - To roll or coil two functions together. You can think of this as a way to combine two functions. Depending on the representational dimensions of the functions, the result of convolving them together can be a scalar, a vector, or a matrix.
- Convolution
    - A mathematical operation that takes two functions and produces a third function.


## Dont Loose Information while Convoluting
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.27.33 PM.png">


## 1D Convolutional Layers
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.38.33 PM.png">

## Channels
If we only apply a single convolution, information will inevitably be lost; we are averaging
nearby inputs, and the ReLU activation function clips results that are less than zero.
Hence, it is usual to compute several convolutions in parallel. Each convolution produces
a new set of hidden variables, termed a feature map or channel.
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.40.11 PM.png">

### 1D Convolution
Terms:
- **Kernel Size**: The size of the filter matrix.
- **Stride**: The number of steps the filter moves each time.
- **Padding**: The number of zeros added to the input image.
  
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.28.32 PM.png">
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.37.27 PM.png">
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-27 at 7.25.31 PM.png">

## 2D Convolution
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 11.56.40 AM.png">
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 12.01.23 PM.png">

## 2D Conv. Terms:
- **Downsampling**: Reducing the size of the image, by reducing the number of channels in the next layer. This has the benefit of increasing the receptive field of the network at each subsequent downsampled layer. This allowed those layers to learn more complex features since they could see more of the image.
- **Upsampling**: Increasing the size of the image, by increasing the number of channels in the next layer. This has the benefit of increasing the resolution of the image, allowing the network to learn more fine-grained features. Also, it can be a common technique when recombining representations from two branches of a network.
- **Condensing Channels**: Reducing the number of channels in the image. This can be useful when you have a large number of channels and want to reduce the number of parameters in the network, without pooling or downsampling the image.

### Downsampling
3 Approaches:
- **Strided Convolution**: This involves skipping some of the values when applying the convolution. This is less common than pooling but can be useful when you want to learn the downsampling.
- **Max Pooling**: The most common approach to downsampling is max pooling. This involves dividing the image into non-overlapping rectangles and, for each such sub-region, outputting the maximum value.
- **Average Pooling**: This involves taking the average of the values instead of the maximum.
  
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 12.06.23 PM.png">

### Upsampling
- **Transposed Convolution**: Also known as a **Bi-Linear Interpolation**; It involves padding the image with zeros and then applying a convolution where each input contributes three values to the output, which has twice as many outputs as inputs. The associated weight matrix is the transpose of the weights used in a downsampled convolution.
- **Nearest Neighbour**: This involves duplicating the downsamples values across a 2x2 grid per downsamples value.
- **Max Up Pooling**: We redistribute the values to the same position they originally came from in the downsampling layer - where the maxima were found.

<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 12.17.43 PM.png">

### Condensing Channels
- **1x1 Convolution**: This is a convolution with a kernel size of 1x1. This is useful when you want to reduce the number of channels in the image without changing the spatial dimensions of the image. This can be useful when you have a large number of channels and want to reduce the number of parameters in the network probably to combine it with a layer from another branch of the network.

<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 12.23.33 PM.png">

## Applications
- **Image Recognition**: Convolutional networks are used in image recognition tasks, such as identifying objects in images.

### AlexNet
- **AlexNet**: This is a convolutional neural network that won the ImageNet Large Scale Visual Recognition Challenge in 2012. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It was the first deep learning model to win the challenge, and it significantly outperformed the second-place model. It was also the first time that a deep learning model had been used in the challenge.
<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 12.42.19 PM.png">

The mathematics for understanding the results of the layer sizes between each convolution is worth understanding. This is so that when we design a network ourself, we can have a general sense of the size of the layers at each stage of the network will be, based on our kernel size, stride, and padding.

So, the size of the output feature map from a convolution operation can be calculated deterministically given the input size, kernel size, stride, and padding used. The formula for calculating the output size for one dimension (assuming no padding is added) is:


$$
\text{output size} = \left\lfloor \frac{\text{input size} - \text{filter size}}{\text{stride}} \right\rfloor + 1
$$

So, applying a 11x11 kernel with a stride of 4 on a 224x224 input, you would calculate the output size in each dimension (width and height) as:

$$
\text{output size} = \left\lfloor \frac{224 - 11}{4} \right\rfloor + 1 = \left\lfloor \frac{213}{4} \right\rfloor + 1 = 53.25 + 1
$$

Since we can't have a fraction of a pixel, we take the floor of the division (which is implicitly done in most programming implementations):

$$
53 + 1 = 54
$$

So, you would actually end up with an output feature map that is 54x54.

If you want an output of exactly 55x55, you would need to consider padding the input. Padding is often added to ensure that the output size remains consistent with the design of the network, particularly to preserve as much information as possible from the input or to maintain a certain output size throughout the layers. The amount of padding `P` needed to achieve a specific output size can be determined by rearranging the formula:

$$
P = \left( (\text{output size} - 1) \times \text{stride} \right) + \text{filter size} - \text{input size}
$$


<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 1.31.29 PM.png">

Recall that "Fully Connected Layer" mean, that every hidden unit in the input layer is connected to every hidden unit in the ouptut layer. This is in contrast to convolutional layers, where each hidden unit is only connected to a small region of the input.

<img src="/Users/tobiahrex/code/me/learn_machine-learning/screenshots/Screenshot 2024-03-30 at 1.37.42 PM.png">