# Question 1.1: The Manual Dimension Map

## Network Configuration
* **Input:** 64 x 64 x 3 (Tiny ImageNet)
* **Layers:** 3 Convolutional, 2 Max-Pooling, 1 Fully Connected
* **Formula:** $H_{out} = \lfloor \frac{H_{in} + 2P - K}{S} \rfloor + 1$

---

## Layer-by-Layer Breakdown

### 1. Input Layer
* **Dimensions:** 64 x 64 x 3

### 2. Conv1 Layer
* **Parameters:** 3x3 kernel, Padding=1, Stride=1, 32 Filters
* **Calculation:** $(64 + 2*1 - 3)/1 + 1 = 64$
* **Output Shape:** **64 x 64 x 32**

### 3. Pool1 Layer
* **Parameters:** 2x2 kernel, Stride=2
* **Calculation:** $(64 - 2)/2 + 1 = 32$
* **Output Shape:** **32 x 32 x 32**

### 4. Conv2 Layer
* **Parameters:** 3x3 kernel, Padding=1, Stride=1, 64 Filters
* **Calculation:** $(32 + 2*1 - 3)/1 + 1 = 32$
* **Output Shape:** **32 x 32 x 64**

### 5. Pool2 Layer
* **Parameters:** 2x2 kernel, Stride=2
* **Calculation:** $(32 - 2)/2 + 1 = 16$
* **Output Shape:** **16 x 16 x 64**

### 6. Conv3 Layer
* **Parameters:** 3x3 kernel, Padding=1, Stride=1, 128 Filters
* **Calculation:** $(16 + 2*1 - 3)/1 + 1 = 16$
* **Output Shape:** **16 x 16 x 128**

### 7. Flattening & FC Layer
* **Flatten:** $16 \times 16 \times 128 = 32,768$ features
* **FC Layer:** 32,768 units $\to$ 10 (Classes)
* **Final Output:** **10**

---

## Summary Table

| Layer | Input | Operation | Output |
| :--- | :--- | :--- | :--- |
| Input | - | - | 64 x 64 x 3 |
| Conv1 | 64 x 64 x 3 | 3x3, p=1 | 64 x 64 x 32 |
| Pool1 | 64 x 64 x 32 | 2x2, s=2 | 32 x 32 x 32 |
| Conv2 | 32 x 32 x 32 | 3x3, p=1 | 32 x 32 x 64 |
| Pool2 | 32 x 32 x 64 | 2x2, s=2 | 16 x 16 x 64 |
| Conv3 | 16 x 16 x 64 | 3x3, p=1 | 16 x 16 x 128 |
| FC | 32,768 | Linear | 10 |

In [5]:

import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class SimpleCNN(nn.Module):

    # We have implemeneted a Custom CNN with explicit forward pass (not Sequential) here
    # The Architecture: Conv1 -> Pool1 -> Conv2 -> Pool2 -> Conv3 -> FC

    def __init__(self, num_classes=10):
        super().__init__()
        # Padding =1 to preserve spatial size
        self.conv1=nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.conv2= nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.conv3=nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        # MaxPool 2x2 stride 2
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        # Fully connected which matches flattened size after two poolings (16x16x128)
        self.fc = nn.Linear(16 * 16 * 128, num_classes)

    def forward(self, x):
        print(f"Input:  {x.shape}")           # (N, 3, 64, 64)

        x = F.relu(self.conv1(x))
        print(f"After Conv1:{x.shape}")           # (N, 32, 64, 64)

        x = self.pool(x)
        print(f"After Pool1:{x.shape}") # (N, 32, 32, 32)

        x = F.relu(self.conv2(x))
        print(f"After Conv2:{x.shape}")           # (N, 64, 32, 32)

        x = self.pool(x)
        print(f"After Pool2:{x.shape}")    # (N, 64, 16, 16)

        x = F.relu(self.conv3(x))
        print(f"After Conv3:{x.shape}")           # (N, 128, 16, 16)

        x = x.view(x.size(0), -1)
        print(f"Flatten: {x.shape}")           # (N, 32768)

        x = self.fc(x)
        print(f"FC Output: {x.shape}")      # (N, 10)
        return x

#This function counts the trainable parameters
def count_parameters(model: nn.Module) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
model=SimpleCNN(num_classes=10)
print(f"Total trainable parameters: {count_parameters(model):,}")
print("Layer-by-layer shape verification")

dummy_input=torch.randn(1, 3, 64, 64)
output=model(dummy_input)

print(f"\nFinal output shape:{output.shape}")

Total trainable parameters: 420,938
Layer-by-layer shape verification
Input:  torch.Size([1, 3, 64, 64])
After Conv1:torch.Size([1, 32, 64, 64])
After Pool1:torch.Size([1, 32, 32, 32])
After Conv2:torch.Size([1, 64, 32, 32])
After Pool2:torch.Size([1, 64, 16, 16])
After Conv3:torch.Size([1, 128, 16, 16])
Flatten: torch.Size([1, 32768])
FC Output: torch.Size([1, 10])

Final output shape: torch.Size([1, 10])


# Parameter Explosion Analysis

In [8]:
# WITH POOLING using the current architecture
fc_input_with_pool = 16 * 16 * 128  # After 2 pooling layers
fc_params_with_pool = fc_input_with_pool * 10 + 10  # weights + bias

# WITHOUT POOLING ==> no spatial reduction
fc_input_no_pool = 64 * 64 * 128  # Original spatial size preserved
fc_params_no_pool = fc_input_no_pool * 10 + 10  # weights + bias

print(f"WITH POOLING:")
print(f"Feature map before FC: 16 x 16 x 128 = {fc_input_with_pool:,} features")
print(f"FC layer parameters: {fc_params_with_pool:,}")

print(f"WITHOUT POOLING:")
print(f"Feature map before FC: 64 x 64 x 128 = {fc_input_no_pool:,} features")
print(f"FC layer parameters: {fc_params_no_pool:,}")

print(f"PARAMETER EXPLOSION:")
print(f"Increase factor: {fc_params_no_pool / fc_params_with_pool:.1f}x")
print(f"Additional parameters: {fc_params_no_pool - fc_params_with_pool:,}")



WITH POOLING:
Feature map before FC: 16 x 16 x 128 = 32,768 features
FC layer parameters: 327,690
WITHOUT POOLING:
Feature map before FC: 64 x 64 x 128 = 524,288 features
FC layer parameters: 5,242,890
PARAMETER EXPLOSION:
Increase factor: 16.0x
Additional parameters: 4,915,200


# Why is this a problem?

High parameter counts, especially in the transition to Fully Connected (FC) layers, create several bottlenecks that can ruin a model's performance and efficiency.

---

### 1. Memory Constraints
Every parameter in a model requires dedicated space in the GPU or RAM. For instance, a single FC layer with 5.2 million parameters (using 32-bit floats) consumes about 20MB of memory just for the weights alone.

### 2. Computational Latency
More parameters mean more multiply-accumulate operations during every forward and backward pass. This makes training significantly slower and increases inference time, which is a major issue for real-time applications.

### 3. The Overfitting Trap
When a model has too much "capacity" (too many parameters) relative to the complexity of the task, it stops learning patterns and starts memorizing noise.
* **Memorization:** The model learns the specific pixels of the training set rather than the general features.
* **Generalization:** It will perform perfectly on your training data but fail on any unseen data.
* **Regularization:** You end up needing aggressive techniques like Dropout or Weight Decay just to keep the model from spiraling.

### 4. Data Requirements
A common rule of thumb is that you should have roughly 10x more training samples than you have parameters to ensure the model generalizes well. If you have millions of parameters but only thousands of images, the model is mathematically over-determined.

---

## The Solution: Strategic Pooling

Pooling layers (Max or Average) are used to solve these issues by progressively reducing the spatial dimensions (Width x Height) of the feature maps.

By downsampling, we:
1.  **Reduce the Flattened Vector:** This keeps the input to the FC layer manageable.
2.  **Focus on Features:** We preserve the most important information while discarding redundant spatial data.
3.  **Efficiency:** This allows the network to remain deep and accurate without blowing the memory or data budget.