**1. What are the advantages of a CNN over a fully connected DNN for image classification?**

Convolutional Neural Networks (CNNs) are well-suited for image classification tasks as they have several advantages over fully connected Deep Neural Networks (DNNs):

Spatial Invariance: CNNs use convolutional layers with filters to detect patterns and features in images, making them more robust to local translations and deformations in the image. This makes them better suited for image classification tasks where the same object may appear in different positions or orientations within the image.

Parameter Efficiency: CNNs have significantly fewer parameters than fully connected DNNs, making them more computationally efficient and easier to train.

Feature Reuse: In CNNs, the same features can be learned and applied to different parts of the image through pooling layers. This allows for feature reuse and reduces the number of parameters required, making the network more computationally efficient.

Robustness to Scale and Translation: CNNs use pooling layers to reduce the spatial dimensions of the input, making them more robust to small variations in scale and translation.

Better Handling of Local Context: Convolutional layers in CNNs scan the image using filters and pooling layers, allowing them to preserve the local context of the image and better learn the fine-grained details that are important for image classification.

In conclusion, the convolutional structure of CNNs, with their ability to detect patterns and features, handle local context, and reduce parameters, make them well-suited for image classification tasks.

**2. Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of
2, and &quot;same&quot; padding. The lowest layer outputs 100 feature maps, the middle one outputs
200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.
What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much
RAM will this network require when making a prediction for a single instance? What about when
training on a mini-batch of 50 images?**

To calculate the total number of parameters in the CNN, we need to consider the parameters in each convolutional layer. Each 3x3 kernel in a convolutional layer has 9 parameters, and each feature map in a layer has a separate set of parameters.

The first layer has 100 feature maps and each feature map has 9 parameters, so the first layer has 9 * 100 = 900 parameters.

The second layer has 200 feature maps and each feature map has 9 parameters, so the second layer has 9 * 200 = 1800 parameters.

The third layer has 400 feature maps and each feature map has 9 parameters, so the third layer has 9 * 400 = 3600 parameters.

So the total number of parameters in the CNN is 900 + 1800 + 3600 = 5400.

For a single prediction on a 200x300 RGB image, the network would need to store the activations of each feature map for each layer. Each pixel in an RGB image has 3 values, so a 200x300 image has 200 * 300 * 3 = 180000 values. After the first layer, the activations will be 100 feature maps of size 100x150, the second layer will produce 200 feature maps of size 50x75, and the final layer will produce 400 feature maps of size 25x38.

Therefore, the memory required for a single prediction is (180000 * 4 + 100 * 100 * 150 * 4 + 200 * 50 * 75 * 4 + 400 * 25 * 38 * 4) bytes = approximately 2.6 GB, where 4 bytes are required to store a 32-bit float.

For a mini-batch of 50 images, the memory required would be approximately 2.6 GB * 50 = 130 GB.

**3. If your GPU runs out of memory while training a CNN, what are five things you could try to
solve the problem?**

Reduce the batch size: Decreasing the number of samples in each batch can reduce the memory requirements.

Use model pruning techniques: Removing unimportant weights and neurons can reduce the overall memory footprint of the model.

Use lower-precision data types: Using lower-precision data types such as float16 instead of float32 can significantly reduce memory usage.

Reduce the size of input images: Decreasing the resolution of the input images can reduce memory requirements.

Use Gradient Checkpointing: By saving intermediate activations during forward pass and not storing all activations in memory, you can reduce memory usage.

**4. Why would you want to add a max pooling layer rather than a convolutional layer with the
same stride?**

Max pooling and convolutional layers serve different purposes and have different effects on the feature maps.

Max pooling is used for down-sampling, it reduces the spatial dimensions of the feature map by taking the maximum value over a local neighborhood. This has the effect of reducing the computational complexity and allowing the model to focus on the most important features.

Convolutional layers, on the other hand, are used for feature extraction, they learn local patterns in the input data and increase the depth of the feature map by applying filters.

Using a convolutional layer with the same stride as a max pooling layer would simply reduce the spatial dimensions of the feature map, but it would not reduce the computational complexity or allow the model to focus on the most important features.

In summary, max pooling is used for down-sampling, while convolutional layers are used for feature extraction, adding a max pooling layer allows for the reduction of computational complexity and focuses on important features.

**5. When would you want to add a local response normalization layer?**

Local Response Normalization (LRN) is a normalization technique used to normalize the activations within a local neighborhood in the same channel. It was originally introduced in the AlexNet architecture to address the problem of internal covariate shift, which occurs when the distribution of the activations changes during training.

The LRN layer acts as a form of data-dependent regularization, which helps to mitigate the internal covariate shift and reduce overfitting. It has been shown to improve the performance of Convolutional Neural Networks (CNNs) on image classification tasks.

In summary, you would want to add a LRN layer when training a CNN for image classification and you are experiencing internal covariate shift and overfitting problems.

**6. Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main
innovations in GoogLeNet, ResNet, SENet, and Xception?**

**AlexNet** is a deep Convolutional Neural Network (CNN) architecture introduced in 2012 and won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) that year. The main innovations in AlexNet compared to LeNet-5 (a previous architecture introduced in the 90s) are:

Increased Depth: AlexNet has eight layers compared to LeNet-5's seven layers.

Large Number of Filters: AlexNet uses many more filters, increasing the capacity to learn complex features.

ReLU Activations: AlexNet introduced the use of Rectified Linear Unit (ReLU) activation functions, which have proven to be more effective than traditional sigmoidal activation functions.

Dropout Regularization: AlexNet introduced the use of dropout regularization to prevent overfitting.

**GoogLeNet** is a deep CNN architecture introduced in 2014, the main innovations in GoogLeNet compared to AlexNet are:

Inception Modules: GoogLeNet introduced the use of Inception modules, which allows for a more efficient use of computational resources by applying multiple filters in parallel and concatenating their outputs.

Improved Network Depth: GoogLeNet has 22 layers, compared to AlexNet's eight layers.

**ResNet** is a deep residual network introduced in 2015, the main innovations in ResNet compared to GoogLeNet and AlexNet are:

Residual Connections: ResNet introduced the use of residual connections, which allow the network to learn residual functions with reference to the input, instead of learning the input itself.

Extreme Depth: ResNet's architecture allows for training extremely deep networks, with over 150 layers.

**SENet is a deep CNN architecture introduced in 2017, the main innovations in SENet compared to ResNet are:**

Squeeze-and-Excitation (SE) Blocks: SENet introduced the use of Squeeze-and-Excitation (SE) blocks, which allow the network to adaptively re-calibrate channel-wise feature responses by explicitly modeling inter-dependencies between channels.

**Xception is a deep CNN architecture introduced in 2016, the main innovations in Xception compared to ResNet are:**

Depthwise Separable Convolutions: Xception introduced the use of depthwise separable convolutions, which allow for a more efficient use of computational resources and reduced overfitting.

Extremely Deep and Narrow: Xception's architecture is extremely deep and narrow, with a depth of 126 layers and a reduced number of filters compared to ResNet.

**7. What is a fully convolutional network? How can you convert a dense layer into a
convolutional layer?**

A fully convolutional network (FCN) is a type of Convolutional Neural Network (CNN) where the fully connected (dense) layers of a traditional CNN have been replaced by 1x1 convolutional layers. This allows the network to operate on inputs of arbitrary size and preserve spatial information, making it suitable for tasks such as semantic segmentation, object detection, and image generation.

To convert a dense layer into a convolutional layer, the dense layer needs to be transformed into a 1x1 convolutional layer. This can be achieved by changing the dense layer's weights into a kernel and stride of 1, then adding a padding of 0. The dense layer's activation function should also be replaced with an activation function suitable for convolutional layers, such as ReLU.

The conversion of a dense layer to a 1x1 convolutional layer effectively maintains the dense layer's functionality while preserving the spatial information in the feature maps, allowing the network to be used for tasks that require spatial information.

**8. What is the main technical difficulty of semantic segmentation?**

The main technical difficulty of semantic segmentation is obtaining dense, per-pixel predictions for the entire image while preserving spatial information. This is challenging because it requires the network to have a large enough receptive field to capture long-range dependencies while still being able to make precise per-pixel predictions. Additionally, semantic segmentation models must also deal with class imbalance and the presence of small, meaningful objects, which can be difficult to detect and segment.

Another challenge in semantic segmentation is the need for large amounts of annotated training data, which can be time-consuming and expensive to obtain. The model also needs to be able to generalize well to unseen images and maintain a good balance between precision and recall.

To overcome these difficulties, many recent advances in semantic segmentation make use of techniques such as multi-scale prediction, context aggregation, and class-balanced loss functions. Additionally, pre-training on large-scale classification tasks and using data augmentation can help improve performance.

**9. Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST.**

In [2]:
import tensorflow as tf
from tensorflow import keras

# Load the MNIST dataset
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the data
x_train, x_test = x_train / 255.0, x_test / 255.0

# Reshape the data
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

# Build the model
model = keras.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=10)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print('Test accuracy:', test_accuracy)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9871000051498413
