 What are the advantages of a CNN over a fully connected DNN for image classi‐
fication?

1. Spatial Hierarchy Preservation
2. Parameter Efficiency
3. Translation Invariance
4. Better Feature Extraction
5. Scalability to High-Resolution Images
6. Improved Generalization and Accuracy
7. Less Overfitting
8. Widely Supported and Proven

CNNs are designed specifically for image data. They are more efficient, accurate, and scalable than fully connected DNNs for image classification. CNNs' ability to capture spatial features makes them the preferred choice in most computer vision applications.


2. Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels,
a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps, the
middle one outputs 200, and the top one outputs 400. The input images are RGB
images of 200 × 300 pixels.
What is the total number of parameters in the CNN? If we are using 32-bit floats,
at least how much RAM will this network require when making a prediction for a
single instance? What about when training on a mini-batch of 50 images?

 Given:

Input: RGB image of size 200×300×3
Three convolutional layers


Kernel size = 3×3
Stride = 2
Padding = "same"
Output feature maps:
Layer 1 → 100 filters
Layer 2 → 200 filters
Layer 3 → 400 filters


Data type: 32-bit floats (4 bytes per value)

Because stride = 2 and padding = "same", output size becomes:

output_size = ceil(input_size / stride)


Layer 1:
Input: 200×300×3
Output size:
Height: ceil(200 / 2) = 100
Width: ceil(300 / 2) = 150
Channels: 100


→ Output: 100×150×100

Layer 2:


Input: 100×150×100
Output size:
Height: ceil(100 / 2) = 50
Width: ceil(150 / 2) = 75
Channels: 200


→ Output: 50×75×200

Layer 3:


Input: 50×75×200
Output size:
Height: ceil(50 / 2) = 25
Width: ceil(75 / 2) = 38 (actually 37.5, so rounded up)
Channels: 400


→ Output: 25×38×400

Number of Parameters

Each convolutional filter has:

Kernel: 3×3,
Input channels = from previous layer,
Output channels = number of filters
Don’t forget biases (1 per filter).

Layer 1:


Filters: 100
Each filter: 3×3×3 = 27
Total: 100 × 27 + 100 = 2800 parameters


Layer 2:


Filters: 200
Each filter: 3×3×100 = 900
Total: 200 × 900 + 200 = 180,200 parameters

Layer 3:


Filters: 400
Each filter: 3×3×200 = 1800
Total: 400 × 1800 + 400 = 720,400 parameters

= 2800 + 180,200 + 720,400
= 903,400 parameters

Memory Requirement for Inference

➤ For a single instance, memory is required for:
a. Parameters:

Each parameter = 4 bytes (32-bit float)
Total = 903,400 × 4 = 3.61 MB
b. Activations (intermediate outputs)

Layer outputs:

Layer 1: 100×150×100 = 1,500,000 → 6 MB
Layer 2: 50×75×200 = 750,000 → 3 MB
Layer 3: 25×38×400 = 380,000 → 1.52 MB
→ Total activations: ~10.52 MB

= Parameters + Activations
= 3.61 MB + 10.52 MB
= ~14.13 

= Activations (FWD + BWD) + Param grads + Parameters
= ~1.05 GB + 3.61 MB + 3.61 MB
≈ ~1.06 GB

| Requirement                    | Value          |
| ------------------------------ | -------------- |
| Total parameters               | **903,400**    |
| RAM for 1 prediction           | **\~14.13 MB** |
| RAM for batch of 50 (training) | **\~1.06 GB**  |



4.  If your GPU runs out of memory while training a CNN, what are five things you
could try to solve the problem?

| # | Solution                 | Impact                          |
| - | ------------------------ | ------------------------------- |
| 1 | Reduce batch size        | 💥 Huge memory savings          |
| 2 | Gradient checkpointing   | 🔁 Saves memory, slower compute |
| 3 | Mixed precision training | 🧠 Memory + speed gain          |
| 4 | Simplify model           | ⚒ Structural fix                |
| 5 | Gradient accumulation    | 🧮 Simulate large batches       |


5.  Why would you want to add a max pooling layer rather than a convolutional
layer with the same stride?

 1. Purpose Difference: Downsampling vs. Feature Extraction


Max Pooling Layer:


Designed specifically for downsampling.
It reduces spatial dimensions while preserving important features (like edges or textures).
Applies a simple max operation — no learnable parameters.


Convolutional Layer with stride > 1:
Performs feature extraction and downsampling simultaneously.
Has trainable filters, increasing the model complexity and risk of overfitting.

2. No Extra Parameters in Max Pooling


Pooling layers don’t have weights → no additional memory or training time.
Convolution layers have many parameters, especially with many filters

3. Better Control Over Translation Invariance


Max pooling introduces local translation invariance.
If a feature shifts slightly, pooling still captures it.
Strided convolution may miss small shifts unless filters are trained carefully.

4. Regularization Effect


Pooling reduces overfitting by reducing spatial size and making the network focus on the most prominent features.
Strided convolutions may extract more details but can also memorize noise if not regularized properly.


5. Computational Efficiency


Max pooling is a simple operation (just max).
Convolution with stride is more computationally expensive (involves multiplications and additions with weights).


6. When would you want to add a local response normalization layer?


What is LRN?


Local Response Normalization (LRN) is a type of normalization layer that encourages competition between adjacent neurons (usually across channels).
It was popularized by AlexNet (2012).

When to Use It
You might want to add an LRN layer in the following cases:

1. To Enhance Generalization and Reduce Overfitting (in early CNNs)
LRN acts as a regularizer, similar to dropout or batch normalization.
It helps highlight high-activation neurons while suppressing weaker ones.
Useful in older networks without batch normalization.


2. To Promote Competition Between Feature Maps
LRN applies normalization across nearby channels (depth-wise).
Neurons that strongly activate "suppress" the activation of their neighbors.
Encourages distinct, strong features to stand out.

3. When Reproducing or Extending Legacy Models (e.g., AlexNet)
If you're re-implementing or modifying older architectures, LRN is often used after ReLU activation.
Helps maintain consistency with earlier benchmarks.

4. On Shallow Networks or Small Datasets
When using shallow CNNs or training on limited data, LRN can add regularization and boost robustness.
Can slightly improve generalization when batch normalization is not used.

When Not to Use It:


Modern networks (like ResNet, VGG, MobileNet) don’t use LRN.
Batch Normalization, Layer Norm, or Group Norm have largely replaced LRN due to:
Better performance
More stable gradients
Faster convergence



6. Can you name the main innovations in AlexNet, compared to LeNet-5? What
about the main innovations in GoogLeNet, ResNet, SENet, and Xception?

1. LeNet-5 (1998) — The Foundation

One of the first CNNs, designed by Yann LeCun for digit recognition (MNIST).
Architecture:
2 convolutional layers + subsampling (pooling) + fully connected layers.
Limitations:
Shallow network.
Couldn’t handle large images or datasets like ImageNet.
Trained on CPU.


 2. AlexNet (2012) — The Breakthrough

Designed by Krizhevsky, Sutskever, Hinton — winner of ImageNet 2012.
Innovations Compared to LeNet-5:
Much deeper architecture (8 layers vs. 5 in LeNet).
Used ReLU activation (faster convergence than sigmoid/tanh).
Introduced Dropout to prevent overfitting.
Used GPU acceleration for training.
Employed Data Augmentation (cropping, flipping).
Used Local Response Normalization (LRN).
Used overlapping max pooling.


 3. GoogLeNet (Inception-v1, 2014) — The Depth Revolution

Developed by Szegedy et al., winner of ImageNet 2014.
Key Innovations:
Introduced the Inception module:
Combines 1×1, 3×3, and 5×5 convolutions in parallel.
Allows the network to learn multi-scale features.
1×1 convolutions used for:
Dimensionality reduction
Reduced computation cost.
Removed fully connected layers → replaced with Global Average Pooling.
Much deeper (~22 layers) but still computationally efficient.


4. ResNet (2015) — The Deep Learning Enabler

Introduced by Kaiming He et al., winner of ImageNet 2015.
Key Innovation:
Introduced Residual Connections (Skip Connections):
Solves the vanishing gradient problem.
Enables training of very deep networks (up to 1000+ layers).
Residual block:
Output = F(x) + x 
Allows the network to learn identity mappings easily.


5. SENet (Squeeze-and-Excitation Network, 2017) — Channel-Wise Attention

Winner of ImageNet 2017.
Key Innovations:
Introduced Squeeze-and-Excitation (SE) blocks:
Learn channel-wise attention weights.
"Squeeze": Global average pooling to summarize feature maps.
"Excite": Fully connected layers to learn importance of each channel.
Improves accuracy by recalibrating feature maps channel-wise.
Can be added to any CNN (ResNet, Inception, etc.).


6. Xception (2017) — Extreme Inception + Depthwise Separable Convolutions

Proposed by François Chollet (creator of Keras).
Key Innovations:
Replaces Inception modules with Depthwise Separable Convolutions:
Depthwise convolution → applies one filter per input channel.
Pointwise (1×1) convolution → combines outputs.
Significantly reduces computation while maintaining performance.
Fully convolutional, no Inception blocks.
Inspired by Inception but with greater modularity and efficiency.

7. What is a fully convolutional network? How can you convert a dense layer into a
convolutional layer?

A Fully Convolutional Network (FCN) is a type of neural network that consists only of convolutional layers (and possibly pooling and activation layers) — no fully connected (dense) layers.

Key Characteristics:


Accepts inputs of any size (not fixed like in traditional CNNs).
Outputs spatial predictions, making it ideal for tasks like:
Semantic segmentation
Heatmap generation
Object localization
Example Use Case:


In semantic segmentation, an FCN takes an image and outputs a pixel-wise classification map, labeling each pixel with a class.

How to Convert a Dense Layer into a Convolutional Layer

Converting a dense (fully connected) layer into a convolutional layer allows a CNN to become fully convolutional — enabling flexible input sizes and spatial output.

Conceptual Mapping:
Dense Layer	Equivalent Conv Layer
Input: Flattened vector	Input: 2D feature map
Dense: units = N	Conv2D: filters = N, kernel = H×W
Output: 1D (N values)	Output: 1×1×N (if kernel covers full input)
Key Idea:
A dense layer is a 1×1 convolution applied over a flattened feature map.

Conversion Steps:
Let’s say you have a dense layer that operates on a flattened feature map:

Original input: feature map of shape H×W×C
Flattened: vector of size H×W×C = N
Dense layer: e.g., 512 units → matrix multiplication: W × N
To convert:

Replace the dense layer with a convolutional layer:
Use Conv2D with:
Number of filters = number of units in Dense layer
Kernel size = H×W (if you want full coverage)
Or kernel size = 1×1 if spatial position should be preserved
Adjust strides and padding appropriately.
Example:
Let’s say:

You have a dense layer with 512 units.
Input before flattening was 7×7×256.
Replace:

Dense(512) → Conv2D(512 filters, kernel size = 7×7)
This will produce the same result, but allows the network to work with larger inputs, maintaining spatial structure.

8. What is the main technical difficulty of semantic segmentation?

1. Maintaining Spatial Accuracy
Semantic segmentation requires predicting a class for every pixel.
As CNNs go deeper, spatial resolution is reduced due to pooling/striding (e.g., from 256×256 to 8×8).
Recovering precise object boundaries from low-resolution features is difficult.
🔧 Solution attempts: Skip connections (FCN), dilated convolutions, encoder–decoder architectures (e.g., U-Net, DeepLab).


2. Multi-Scale Object Recognition
Objects in images appear at various sizes and shapes.
A single receptive field may not capture both small and large objects accurately.
🔧 Solution attempts: Pyramid pooling (PSPNet), Atrous Spatial Pyramid Pooling (ASPP in DeepLab), multi-scale feature fusion.


3. Class Imbalance
Background pixels often dominate the image (e.g., sky, road), while some classes (like a person or traffic light) may be very small.
This can lead to bias in the loss function, where the model ignores rare classes.
🔧 Solution attempts: Weighted loss functions, focal loss, over-sampling rare classes.


4. Ambiguous Boundaries
Adjacent classes often have unclear or fuzzy borders, like “cat” vs. “sofa”.
Slight annotation noise or motion blur makes classification harder at edges.
🔧 Solution attempts: Conditional Random Fields (CRF) post-processing, edge-aware networks.
5. High Computational Cost
Pixel-wise predictions require large memory and compute, especially for high-resolution images.


Training and inference can be slow.
🔧 Solution attempts: Model compression, efficient backbones (e.g., MobileNet, ENet).

9. Build your own CNN from scratch and try to achieve the highest possible accu‐
racy on MNIST


In [1]:
import tensorflow as tf
from tensorflow.keras import layers,models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical


In [2]:
(x_train,y_train),(x_test,y_test) = mnist.load_data()
x_train = x_train.astype('float32')/255.0
x_test = x_test.astype('float32')/255.0
x_train = x_train[...,tf.newaxis]
x_test = x_test[...,tf.newaxis]

y_train = to_categorical(y_train,10)
y_test = to_categorical(y_test,10)

In [3]:
model = models.Sequential([
    layers.Conv2D(32,(3,3),activation='relu',input_shape=(28,28,1)),
    layers.BatchNormalization(),
    layers.Conv2D(64,(3,3),activation='relu'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(pool_size=(2,2)),
    layers.Dropout(0.25),

    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Dropout(0.25),

    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [4]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [5]:
history = model.fit(
    x_train,y_train,
    epochs=15,
    batch_size=128,
    validation_split = 0.1,
    verbose=2
)

Epoch 1/15
422/422 - 42s - 98ms/step - accuracy: 0.9434 - loss: 0.1889 - val_accuracy: 0.4052 - val_loss: 2.3912
Epoch 2/15
422/422 - 41s - 96ms/step - accuracy: 0.9802 - loss: 0.0663 - val_accuracy: 0.9880 - val_loss: 0.0400
Epoch 3/15
422/422 - 41s - 98ms/step - accuracy: 0.9838 - loss: 0.0529 - val_accuracy: 0.9888 - val_loss: 0.0402
Epoch 4/15
422/422 - 42s - 99ms/step - accuracy: 0.9861 - loss: 0.0439 - val_accuracy: 0.9907 - val_loss: 0.0306
Epoch 5/15
422/422 - 42s - 101ms/step - accuracy: 0.9891 - loss: 0.0358 - val_accuracy: 0.9915 - val_loss: 0.0277
Epoch 6/15
422/422 - 46s - 110ms/step - accuracy: 0.9900 - loss: 0.0323 - val_accuracy: 0.9925 - val_loss: 0.0246
Epoch 7/15
422/422 - 48s - 113ms/step - accuracy: 0.9907 - loss: 0.0292 - val_accuracy: 0.9930 - val_loss: 0.0255
Epoch 8/15
422/422 - 50s - 119ms/step - accuracy: 0.9915 - loss: 0.0270 - val_accuracy: 0.9933 - val_loss: 0.0222
Epoch 9/15
422/422 - 47s - 111ms/step - accuracy: 0.9926 - loss: 0.0243 - val_accuracy: 0.99

In [6]:
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")


Test accuracy: 0.9935


10. Use transfer learning for large image classification, going through these steps:
a. Create a training set containing at least 100 images per class. For example, you
could classify your own pictures based on the location (beach, mountain, city,
etc.), or alternatively you can use an existing dataset (e.g., from TensorFlow
Datasets).
b. Split it into a training set, a validation set, and a test set.
c. Build the input pipeline, including the appropriate preprocessing operations,
and optionally add data augmentation.
d. Fine-tune a pretrained model on this datase

In [2]:
import tensorflow_datasets as tfds
(raw_train,raw_val,raw_test),metadata = tfds.load(
    'tf_flowers',
    split=['train[:80%]','train[80%:90%]','train[90%:]'],
    with_info=True,
    as_supervised=True,
)

In [3]:
import tensorflow as tf 
IMG_SIZE  = 224
BATCH_SIZE = 32
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.Rescaling(1./255),
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1)
])

In [4]:
def preprocess_image(image,label):
    image = tf.image.resize(image,(IMG_SIZE,IMG_SIZE))
    image = tf.cast(image,tf.float32)/255.0
    return image,label
train_ds = raw_train.map(preprocess_image).cache().shuffle(1000).batch(BATCH_SIZE).prefetch(1)
val_ds = raw_val.map(preprocess_image).batch(BATCH_SIZE).prefetch(1)
test_ds = raw_test.map(preprocess_image).batch(BATCH_SIZE).prefetch(1)

In [6]:
base_model = tf.keras.applications.MobileNetV2(
    input_shape=(IMG_SIZE,IMG_SIZE,3),
    include_top = False,
    weights='imagenet'
)
base_model.trainable = False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224_no_top.h5
[1m9406464/9406464[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [7]:
from tensorflow.keras import layers,models
model = models.Sequential([
    data_augmentation,
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128,activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(5,activation='softmax')
])

In [8]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
history = model.fit(train_ds,validation_data=val_ds,epochs=5)

Epoch 1/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 748ms/step - accuracy: 0.2162 - loss: 1.7261 - val_accuracy: 0.2425 - val_loss: 1.5951
Epoch 2/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 771ms/step - accuracy: 0.2423 - loss: 1.6078 - val_accuracy: 0.2425 - val_loss: 1.6041
Epoch 3/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 732ms/step - accuracy: 0.2457 - loss: 1.6053 - val_accuracy: 0.2425 - val_loss: 1.6008
Epoch 4/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 803ms/step - accuracy: 0.2458 - loss: 1.6036 - val_accuracy: 0.2425 - val_loss: 1.5984
Epoch 5/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 680ms/step - accuracy: 0.2547 - loss: 1.5999 - val_accuracy: 0.2425 - val_loss: 1.5968


In [9]:
base_model.trainable=True
fine_tune_at = 100
for layer in base_model.layers[:fine_tune_at]:
    layer.trainable =False
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
history_fine = model.fit(train_ds,validation_data=val_ds,epochs=5)

Epoch 1/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m106s[0m 1s/step - accuracy: 0.2862 - loss: 1.5880 - val_accuracy: 0.2425 - val_loss: 1.5993
Epoch 2/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 1s/step - accuracy: 0.3375 - loss: 1.5313 - val_accuracy: 0.2425 - val_loss: 1.6000
Epoch 3/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m425s[0m 5s/step - accuracy: 0.3806 - loss: 1.4804 - val_accuracy: 0.2425 - val_loss: 1.6020
Epoch 4/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m92s[0m 997ms/step - accuracy: 0.3673 - loss: 1.4861 - val_accuracy: 0.2425 - val_loss: 1.5997
Epoch 5/5
[1m92/92[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 1s/step - accuracy: 0.4017 - loss: 1.4519 - val_accuracy: 0.2425 - val_loss: 1.5929


In [10]:
test_loss,test_accuracy = model.evaluate(test_ds)
print(f"Test accuracy: {test_accuracy:.2f}")

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 265ms/step - accuracy: 0.2069 - loss: 1.6233
Test accuracy: 0.19
