# Convolutional Neural Networks Architectures

This notebook studies and compares different convolutional neural network architectures. It includes calculating the number of parameters, memory requirements, and analyzing inception modules.

## Overview
The key steps involve calculating parameters for AlexNet and VGG19, understanding inception modules, and comparing architectures.

## Procedure
- **AlexNet Parameters**: Calculate the number of parameters for each layer of AlexNet and sum them.
- **VGG19 Parameters**: Complete Table 1 for VGG19, calculating activation units and parameters for each layer.
- **Receptive Field Calculation**: Show that a stack of N convolution layers with filter size F×F has the same receptive field as one convolution layer with filter size (NF - N + 1) × (NF - N + 1). Calculate the receptive field for 3 filters of size 5x5.
- **Inception Module Analysis**:
  - General idea behind inception modules.
  - Calculate output size after each filter and filter concatenation for naive and dimensionality reduction architectures.
  - Calculate the number of convolutional operations for each inception architecture.
  - Explain the computational complexity and savings of dimensionality reduction architecture.
- **Faster R-CNN**:
  - Main difference between Fast R-CNN and Faster R-CNN.
  - Explain the architecture of Region Proposal Network (RPN).
  - Describe how region proposals are generated from RPN.
  - Technique to reduce the number of proposals and its working.

References:
- [AlexNet Paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
- [VGG Paper](https://arxiv.org/pdf/1409.1556.pdf)
- [GoogLeNet Paper](https://arxiv.org/pdf/1409.484​⬤

## 1.1

| Layer    | Output Size  | Strides | Padding | Weight          | Biases | Parameters   |
|:--------:|:------------:|:-------:|:-------:|:---------------:|:------:|:------------:|
| Input    | 227*227*3    |   -     |   -     |       0         |   0    |      0       |
| Conv-1   | 55*55*96     |   4     |   2     | 34,848          |   96   | 34,944       |
| MaxPool-1| 27*27*96     |   2     |   0     |       0         |   0    |      0       |
| Conv-2   | 27*27*256    |   1     |   2     | 614,400         |  256   | 614,656      |
| MaxPool-2| 13*13*256    |   2     |   0     |       0         |   0    |      0       |
| Conv-3   | 13*13*384    |   1     |   1     | 884,736         |  384   | 885,120      |
| Conv-4   | 13*13*384    |   1     |   1     | 1,327,104       |  384   | 1,327,488    |
| Conv-5   | 13*13*256    |   1     |   1     | 884,992         |  256   | 885,248      |
| MaxPool-3| 6*6*256      |   2     |   0     |       0         |   0    |      0       |
| FC-1     | 4096         |   -     |   -     | 37,748,736      | 4096   | 37,752,832   |
| FC-2     | 4096         |   -     |   -     | 16,777,216      | 4096   | 16,781,312   |
| FC-3     | 1000         |   -     |   -     | 4,096,000       | 1000   | 4,097,000    |
| Output   | 1000         |   -     |   -     |       0         |   0    |      0       |
| **Total**|              |         |         | **62,378,344**   |       |              |

Total Parameters for alexnet = $$34,944 + 614,656 + 885,120 + 1,327,488  + \\ 885,248 + 37,752,832  + 16,781,312 + 4,097,000 = 62,378,344$$

## 1.2


Layer | Number of Activations (Memory) | Parameters (Compute)
:-----:|:----------:|:---:
Input | 224\*224\*3 = 150K | 0
CONV3-64 | 224\*224\*64 = 3.2M | (3\*3\*3)\*64 = 1,728
CONV3-64 | 224\*224\*64 = 3.2M | (3\*3\*64)\*64 = 36,864
POOL2 |  112\*112\*64 = 800K | 0
CONV3-128 | 112\*112\*128 = 1.6M | (3\*3\*64)\*128 = 73,728
CONV3-128 | 112\*112\*128 = 1.6M | (3\*3\*128)\*128 = 147,456
POOL2 |  56\*56\*128 = 400K | 0
CONV3-256 | 56\*56\*256 = 800K | (3\*3\*128)\*256 = 294,912
CONV3-256 | 56\*56\*256 = 800K | (3\*3\*256)\*256 = 589,824
CONV3-256 | 56\*56\*256 = 800K | (3\*3\*256)\*256 = 589,824
CONV3-256 | 56\*56\*256 = 800K | (3\*3\*256)\*256 = 589,824
POOL2 | 28\*28\*256 = 200K | 0
CONV3-512 | 28\*28\*512 = 400K | (3\*3\*256)\*512 = 1,179,648
CONV3-512 | 28\*28\*512 = 400K | (3\*3\*512)\*512 = 2,359,296
CONV3-512 | 28\*28\*512 = 400K | (3\*3\*512)\*512 = 2,359,296
CONV3-512 | 28\*28\*512 = 400K | (3\*3\*512)\*512 = 2,359,296
POOL2 | 14\*14\*512 = 100K | 0
CONV3-512 | 14\*14\*512 = 100K | (3\*3\*512)\*512 = 2,359,296
CONV3-512 | 14\*14\*512 = 100K | (3\*3\*512)\*512 = 2,359,296
CONV3-512 | 14\*14\*512 = 100K | (3\*3\*512)\*512 = 2,359,296
CONV3-512 | 14\*14\*512 = 100K | (3\*3\*512)\*512 = 2,359,296
POOL2 | 7\*7\*512 = 25K | 0
FC | 4096 | -
FC | 4096 | 4096\*4096 = 16,777,216
FC | 1000 | -  
**Total** | **143.7M** | **143.67M**



## 1.3


A single convolution with a filter of size (NF - N + 1)x(NF - N + 1) can cover the same receptive field as N convolutional layers of filter size FxF.
    Imagine stacking the layers. Each layer increases the receptive field by F-1.
    After N layers, the total increase is N(F-1) resulting in a receptive field of F + N(F-1) = NF - N + 1

Calculating Receptive Field for 3x 5x5 Filters:

* F = 5
* N = 3
* Receptive Field = (3 * 5) - 3 + 1 = 13x13


## 1.4

### a

**Answer:**

The general idea behind designing an inception module in a convolutional neural network is because inception modules allow for more efficient computations and deeper networks through a dimensionality reduction with stacked 1×1 convolutions. Inception modules were designed to help ease computational expenses and overfitting as a bigger model is more prone to overfitting and increasing parameters will result in more resources.Thus, the solution is to take multiple kernel filter sizes within the convoluted neural network and ordering them to operate on the same level instead of stacking them sequentially. 

### b

Naive filter concatenation results in dimensions of 32x32x672, combining 1x1, 3x3, 5x5 convolutions and 3x3 max pooling, which halves spatial dimensions.

In contrast, dimension reduction leads to 32x32x896 dimensions due to steps including two 1x1 convolutions(128), 3x3 convolutions(192), additional 1x1 convolutions(32 and 64), 5x5 convolutions(96), and 3x3 max pooling that also halves spatial dimensions.


### c

In the naive version, total operations are 1,115,684,176, calculated as follows: 1x1 convolutions (128) require 1,048,576 operations, 3x3 convolutions (192) need 530,841,600 operations, and 5x5 convolutions (96) need 1,179,648,000 operations. 3x3 max pooling doesn't add extra operations.

In dimension reduction, the total operations are 1,115,472,184. Breakdown: two 1x1 convolutions (128) require 1,048,576 operations each, 3x3 convolutions (192) need 530,841,600 operations, another 1x1 convolutions (32) require 262,144 operations, 5x5 convolutions (96) need 1,179,648,000 operations, and a final 1x1 convolutions (64) needs 524,288 operations. 3x3 max pooling adds no extra operations.


### d

The naive architecture's high computation needs slow training and increase costs. However, the dimension reduction architecture reduces this load by introducing 1x1 convolution filters before larger ones. This approach maintains feature capturing while boosting computational efficiency.



The naive architecture is problematic due to its high computational complexity, necessitating a substantial number of convolutional operations. This results in computational expenses and slow training. On the other hand, the dimensionality reduction architecture offers a solution by reducing the computational load. It achieves this by first employing 1x1 convolution filters to reduce the number of channels before applying larger filters like 3x3 and 5x5. This strategic approach enhances computational efficiency without compromising the model's ability to capture intricate features.a

## 1.5

### a

Faster-RCNN accelerates detection compared to Fast-RCNN by revamping region proposal generation. Fast-RCNN employs an external region proposal methodology, such as Selective Search, which is computation-heavy, thus reducing detection speed. However, Faster-RCNN incorporates a Region Proposal Network (RPN) within the CNN architecture, enabling convolutional feature sharing with the detection network. This makes region proposal more streamlined and efficient, boosting object detection speed.


### b

Faster-RCNN generates region proposals. Features: base convolutional network (often pre-trained on ImageNet), several subsequent convolutional layers. Base network predicts object bounding boxes and objectness scores at each feature map position. RPN uses varied anchor boxes for proposal prediction.


### c

RPN scans a feature map with anchor boxes to predict potential object bounding boxes. Each anchor box generates coordinates and an objectness score, indicating the probability of containing an object. High-scoring proposals are selected as region proposals. These proposals, along with their objectness scores, are used in subsequent object detection stages.


### d

NMS is used in Faster-RCNN to reduce overlapping region proposals from RPN. It selects the proposal with the highest objectness score and discards any other proposals that significantly overlap with it (measured using IoU). This process continues until the desired number of proposals is reached.


In an image with overlapping proposals, NMS selects the proposal with the highest objectness score and removes any others that overlap significantly. This ensures that only the most promising proposals are used in subsequent object detection stages. In the image below, the green boxes represent the proposals selected by NMS.
