### Exercises

#### 1. What are the advantages of a CNN over a fully connected DNN for image classification?
**Answer:**
- **Parameter Sharing**: CNNs use shared weights in convolutional layers, which means the same filter (or set of filters) is used across different parts of the input. This greatly reduces the number of parameters compared to fully connected layers.
- **Sparsity of Connections**: In CNNs, each neuron is connected only to a local region of the input, reducing the number of connections and parameters.
- **Translation Invariance**: CNNs can detect features regardless of their position in the image, thanks to the nature of convolution operations.
- **Hierarchical Feature Learning**: CNNs automatically learn to detect low-level features (like edges) in the initial layers and high-level features (like objects) in deeper layers.

#### 2. Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.
**Answer:**

1. **First Convolutional Layer**:
   - Input: 200 x 300 x 3
   - Stride: 2
   - "Same" padding ensures the output size is \( $\lceil \frac{\text{input size}}{2} \rceil$ \)
   - Output size (height, width): \( $\lceil \frac{200}{2} \rceil \times \lceil \frac{300}{2} \rceil$ \) = 100 x 150
   - Output: 100 x 150 x 100 (100 feature maps)
   
2. **Second Convolutional Layer**:
   - Input: 100 x 150 x 100
   - Stride: 2
   - Output size (height, width): \( $\lceil \frac{100}{2} \rceil \times \lceil \frac{150}{2} \rceil$ \) = 50 x 75
   - Output: 50 x 75 x 200 (200 feature maps)
   
3. **Third Convolutional Layer**:
   - Input: 50 x 75 x 200
   - Stride: 2
   - Output size (height, width): \( $\lceil \frac{50}{2} \rceil \times \lceil \frac{75}{2} \rceil$ \) = 25 x 38
   - Output: 25 x 38 x 400 (400 feature maps)

##### Total Number of Parameters in the CNN:

1. **First Convolutional Layer**:
   - Number of filters: 100
   - Filter size: \(3 $\times$ 3 $\times$ 3\)
   - Parameters per filter: \(3 $\times$ 3 $\times$ 3 = 27\)
   - Total parameters (including biases): \(100 $\times$ (27 + 1) = 2800\)

2. **Second Convolutional Layer**:
   - Number of filters: 200
   - Filter size: \(3 $\times$ 3 $\times$ 100\)
   - Parameters per filter: \(3 $\times$ 3 $\times$ 100 = 900\)
   - Total parameters (including biases): \(200 $\times$ (900 + 1) = 180200\)

3. **Third Convolutional Layer**:
   - Number of filters: 400
   - Filter size: \(3 $\times$ 3 $\times$ 200\)
   - Parameters per filter: \(3 $\times$ 3 $\times$ 200 = 1800\)
   - Total parameters (including biases): \(400 $\times$ (1800 + 1) = 720400\)

- **Total parameters in the CNN**: \(2800 + 180200 + 720400 = 903400\)

##### RAM Requirement for a Single Instance:

- **Input image size**: \(200 $\times$ 300 $\times$ 3 = 180000\)
- **First layer output size**: \(100 $\times$ 150 $\times$ 100 = 1500000\)
- **Second layer output size**: \(50 $\times$ 75 $\times$ 200 = 750000\)
- **Third layer output size**: \(25 $\times$ 38 $\times$ 400 = 380000\)

- **Total activations**: \(180000 + 1500000 + 750000 + 380000 = 2810000\)

- **Total activations in bytes (32-bit float = 4 bytes)**: \(2810000 $\times$ 4 = 11240000\) bytes = 11.24 MB

- **Parameters in bytes**: \(903400 $\times$ 4 = 3613600\) bytes = 3.61 MB

- **Total RAM for a single instance**: \(11.24 + 3.61 = 14.85\) MB

##### RAM Requirement for a Mini-Batch of 50 Images:

- **Activations for 50 images**: \(2810000 $\times$ 50 = 140500000\)
- **Activations in bytes**: \(140500000 $\times$ 4 = 562000000\) bytes = 562 MB
- **Parameters remain the same**: 3.61 MB

- **Total RAM for a mini-batch of 50 images**: \(562 + 3.61 = 565.61\) MB

##### Summary:

- **Total parameters in the CNN**: 903,400
- **Total RAM for a single instance**: 14.85 MB
- **Total RAM for a mini-batch of 50 images**: 565.61 MB


#### 3. If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?
**Answer:**
1. **Reduce the Mini-Batch Size**: Decreasing the number of samples in each mini-batch reduces memory usage.
2. **Use Model Checkpointing**: Save intermediate states and use gradient checkpointing to recompute parts of the model during backpropagation instead of storing all intermediate activations.
3. **Reduce the Input Image Size**: Decreasing the resolution of input images lowers memory requirements.
4. **Simplify the Model**: Reduce the number of layers or the number of filters per layer to decrease the model size.
5. **Use Mixed Precision Training**: Train the model with mixed precision (using both 16-bit and 32-bit floating point numbers) to save memory.

#### 4. Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?
**Answer:**
- **Dimensionality Reduction**: Max pooling reduces the spatial dimensions of the input, which decreases the computational load and memory usage.
- **Feature Selection**: Max pooling helps retain the most prominent features while discarding less important ones, which can improve generalization.
- **Translation Invariance**: Max pooling introduces a degree of translational invariance, helping the network recognize objects regardless of minor positional changes.

#### 5. When would you want to add a local response normalization layer?
**Answer:**
- **Competitive Normalization**: Local response normalization (LRN) layers can help highlight significant features by normalizing the responses across neighboring neurons, making the network more sensitive to strong activations.
- **Improving Generalization**: LRN can improve generalization by reducing overfitting, especially in early layers of the network.
- **Used in Specific Architectures**: LRN layers were particularly popular in earlier architectures like AlexNet but are less commonly used in modern architectures.

#### 6. Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet, and Xception?
**Answer:**
- **AlexNet**: 
  - Deeper network with more filters per layer.
  - Use of ReLU activation functions instead of tanh or sigmoid.
  - Overlapping max pooling.
  - Dropout for regularization.
  - Data augmentation to reduce overfitting.
  - Use of GPUs for faster training.
  
- **GoogLeNet (Inception)**:
  - Inception modules that allow multiple convolutions with different kernel sizes to run in parallel.
  - Reduction in the number of parameters by using 1x1 convolutions.
  - Deep network with 22 layers.

- **ResNet**:
  - Introduction of residual connections (skip connections) to alleviate the vanishing gradient problem and enable training of very deep networks.
  - Identity mappings in skip connections to make the optimization easier.

- **SENet**:
  - Introduction of Squeeze-and-Excitation (SE) blocks that adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels.
  
- **Xception**:
  - Extreme version of Inception, where Inception modules are replaced with depthwise separable convolutions.
  - Efficient combination of depthwise and pointwise convolutions to reduce computational cost while maintaining performance.

#### 7. What is a fully convolutional network? How can you convert a dense layer into a convolutional layer?
**Answer:**
- **Fully Convolutional Network (FCN)**: A network composed entirely of convolutional layers, without any dense (fully connected) layers, typically used for tasks like semantic segmentation where spatial information must be preserved throughout the network.
  
- **Converting a Dense Layer to a Convolutional Layer**: 
  - A dense layer with \(n\) neurons can be replaced by a convolutional layer with \(n\) filters of size 1x1. This makes each output feature map correspond to a single neuron in the dense layer while maintaining the spatial dimensions of the input.

#### 8. What is the main technical difficulty of semantic segmentation?
**Answer:**
- **Precise Localization**: Semantic segmentation requires precise pixel-level classification, which is challenging because it demands accurate spatial information throughout the network. Maintaining high-resolution features and combining contextual information effectively while preserving spatial details is technically difficult. This often requires a combination of downsampling for context and upsampling (with techniques like deconvolution or unpooling) to restore spatial resolution.

#### 9. Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST.

In [None]:
from utils import set_seed
import os

set_seed()

ROOT_DIR = './'
datapath = os.path.join(ROOT_DIR, 'datasets')
os.makedirs(datapath, exist_ok=True)

: 

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

# Load the MNIST dataset using tfds
(ds_train, ds_val, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train[:80%]', 'train[80%:]', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
    data_dir=datapath
)
