### CNN Coding Basics

1. Imports

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt


<img src="1.png" alt="Alt text" width="700" height="400">

2. Load MNIST dataset

In [2]:
transform = transforms.ToTensor()

trainset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

trainloader = torch.utils.data.DataLoader(
    trainset,
    batch_size=64,
    shuffle=True
)


#### Explanation
```
transform = transforms.ToTensor()
What ToTensor() does in detail:

Before transformation (PIL Image):
Format: PIL Image object [PIL means Python Imaging Library]
Pixel values: 0-255 (integers)
Shape: (28, 28) # Height, Width
Data type: uint8

After transformation (PyTorch Tensor):
Format: torch.Tensor
Pixel values: 0.0-1.0 (floats)
Shape: (1, 28, 28) # Channels, Height, Width  [2D image with 1 channel]
Data type: float32

Why this is important:
Neural networks work with floating point numbers (0.0-1.0)
Adding channel dimension tells network it's grayscale
Consistent format for all images


trainset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)
Breaking down each parameter:
Parameter	 Value	           What it does
root	    './data'	       Create folder 'data' in current directory to store dataset
train	     True	           Load training set (60,000 images) not test set
download	 True	           Download if not already present in ./data
transform	transform	       Apply ToTensor() to every image automatically


What's inside trainset:
trainset is a Dataset object containing:
- 60,000 images
- 60,000 labels

Access like a list:
image, label = trainset[0]  # First image and its label
image.shape = (1, 28, 28)   # 1 channel (grayscale), 28x28 pixels
label = 5                   # Integer 0-9



trainloader = torch.utils.data.DataLoader(
    trainset,
    batch_size=64,
    shuffle=True
)
What DataLoader does:


With DataLoader (automatic):

for images, labels in trainloader:
    # images and labels are already batched tensors!
    # images.shape = (64, 1, 28, 28)
    # labels.shape = (64,)

DataLoader parameters explained:
Parameter	      Value	      What it does
batch_size	       64	      Group 64 images together
shuffle	          True	      Randomize order every epoch (prevents learning order bias)

How DataLoader organizes data:

Before shuffling (example):
trainset: [img0, img1, img2, img3, img4, img5, ..., img59999]

After shuffling (example):
shuffled = [img423, img12876, img5, img38902, img0, ...]

Batching with batch_size=64:
Batch 0: images[0:64]  → shape (64, 1, 28, 28)
Batch 1: images[64:128] → shape (64, 1, 28, 28)
...
Batch 937: images[59936:60000] → shape (64, 1, 28, 28)
Number of batches: 60000/64 = 937.5 → 938 batches (last batch smaller)
```

3. Define simple CNN

In [3]:
class SimpleCNN(nn.Module):                              # Define a SimpleCNN class that inherits from nn.Module
    def __init__(self):                                  # Constructor method to initialize the layers of the SimpleCNN
        super(SimpleCNN, self).__init__()                # Call the constructor of the parent class (nn.Module) to properly initialize the SimpleCNN class
        
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3)     # Define a convolutional layer with 16 output channels and a kernel size of 3 ; it will learn 16 features from the input image 
        self.pool = nn.MaxPool2d(2, 2)                   # Define a max pooling layer with a kernel size of 2, reducing the image size by half
        self.fc1 = nn.Linear(16 * 13 * 13, 10)           # Define a linear layer with 16 * 13 * 13 inputs and 10 outputs which means it will learn to classify the input into 10 classes (digits 0-9)

    def forward(self, x):                                # Define the forward pass of the SimpleCNN, which takes an input tensor x and passes it through the layers defined in self 
        x = self.pool(torch.relu(self.conv1(x)))         # forward the input through the first convolutional layer and then apply a ReLU activation function
        x = x.view(x.size(0), -1)                        # flatten the output of the convolutional layer from a 3D feature map to a 1D vector       
        x = self.fc1(x)
        return x


### Explanation
```     
self.conv1 = nn.Conv2d(1, 16, kernel_size=3)
nn.Conv2d parameters explained:
Parameter	       Value       	Meaning
in_channels	        1	        Input has 1 channel (grayscale image)
out_channels       	16         	Output will have 16 channels (16 filters)
kernel_size        	3	        Filter size is 3×3 pixels

What happens inside Conv2d:
1. Creates 16 filters (kernels):
Filter 1: 3×3 matrix of learnable weights
Filter 2: 3×3 matrix of learnable weights
...
Filter 16: 3×3 matrix of learnable weights
Each filter shape: (3, 3)

2. Creates 16 biases (one per filter):
bias_1, bias_2, ..., bias_16

3. Total parameters in conv1:
Weights: 16 filters × 3×3 = 16 × 9 = 144 parameters
Biases: 16 parameters
Total: 160 parameters
What convolution does mathematically:
For one filter on one position:


Input patch (3×3):     Filter (3×3):        Output:
[a b c]                [w1 w2 w3]           a×w1 + b×w2 + c×w3
[d e f]         ×      [w4 w5 w6]     =     + d×w4 + e×w5 + f×w6
[g h i]                [w7 w8 w9]            + g×w7 + h×w8 + i×w9
                                           + bias
Output size calculation:

Input size: 28×28
Filter size: 3×3
Padding: 0 (default)
Stride: 1 (default)  

Output size = (28 - 3 + 1) = 26
So after conv1: (16, 26, 26) is the output , 16 outputs each with size 26*26 


self.pool = nn.MaxPool2d(2, 2)  [reduces size]
nn.MaxPool2d parameters:
Parameter	    Value	      Meaning
kernel_size	      2           Pooling window is 2×2
stride	          2	          Move window by 2 pixels each step

What MaxPool2d does:
For each 2×2 region, take the maximum value:

Input (4×4):           After MaxPool (2×2):
┌───┬───┬───┬───┐     ┌───┬───┐
│ 1 │ 3 │ 2 │ 4 │     │ 7 │ 8 │
├───┼───┼───┼───┤  →  ├───┼───┤
│ 5 │ 7 │ 6 │ 8 │     │15 │16 │
├───┼───┼───┼───┤     └───┴───┘
│ 9 │11 │10 │12 │
├───┼───┼───┼───┤
│13 │15 │14 │16 │
└───┴───┴───┴───┘

Each 2×2 block becomes 1 value (the maximum)
Why MaxPool?
Reduces size: 26×26 → 13×13 (half the size)
Keeps important features: Maximum = strongest activation
Adds invariance: Small shifts don't change max much 
No parameters to learn: Just an operation
Output after pool: (16, 13, 13) [output number is same 16, only size is half]



self.fc1 = nn.Linear(16 * 13 * 13, 10)  [input vector size, output]
Why 16 * 13 * 13?
After conv+pool, we have 16 feature maps
Each feature map is 13×13 in size
Total number of values = 16 × 13 × 13 = 2704 parameters


This calculation is CRITICAL:
Input image: 28×28
After conv1 (no padding): 26×26   [Padding is the technique of adding extra pixels (usually zeros) around the border of an input image before applying convolution.]   

After pool (2×2 stride 2): 13×13
After 16 filters: 16 × 13 × 13 = 2704

nn.Linear parameters:
Parameter	     Value	      Meaning
in_features 	 2704	      Input vector size
out_features	 10	          Output size (10 classes for digits 0-9)

What Linear layer does:
output = input × weight.T + bias

input shape: (2704)
weight shape: (10, 2704)   [input 2704, 10 output neurons]
bias shape: (10)             
output shape: (10)         # Logits for each digit

Parameters in fc1:
Weights: 10 × 2704 = 27,040
Biases: 10
Total: 27,050 parameters



    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = x.view(x.size(0), -1)
        x = self.fc1(x)
        return x
Step-by-step forward pass:
Input x shape: (batch_size, 1, 28, 28)     # Example: (64, 1, 28, 28)

Line 1: self.conv1(x)

Input: (64, 1, 28, 28)
Conv2d(1→16, kernel=3)      [from each image we get 16 features or output]
Output: (64, 16, 26, 26)
Line 1: torch.relu(...)
Input: (64, 16, 26, 26)
ReLU: max(0, value) for every element
Output: (64, 16, 26, 26)  # Same shape, negative values become 0
 
Line 1: self.pool(...)
Input: (64, 16, 26, 26)
MaxPool2d(2,2): reduces each spatial dimension by half
Output: (64, 16, 13, 13)
Line 2: x.view(x.size(0), -1)

x.size(0) = batch_size = 64
-1 means "infer this dimension"

Before view: (64, 16, 13, 13)
After view:  (64, 16×13×13) = (64, 2704)

This FLATTENS the 3D feature maps into 1D vectors 


Line 3: self.fc1(x)
Input: (64, 2704)
Linear(2704 → 10)
Output: (64, 10)    # 64 samples, each with 10 logits [each image gets 10 logits or early prediction]
Return: (64, 10) tensor of logits

```

## 4. Initialize model and optimizer

In [None]:
model = SimpleCNN()                                    # Create an instance of the SimpleCNN class
criterion = nn.CrossEntropyLoss()                      # Probabilities -> Avg. Loss -> Gradients -> Weights ,Define the loss function as CrossEntropyLoss, which is commonly used for multi-class classification problems 
optimizer = optim.Adam(model.parameters(), lr=0.001)   # loss -> gradients -> weights  


### Explanation:
```
model = SimpleCNN()
What happens here:
Calls __init__() to create all layers
Randomly initializes all weights and biases
Registers all parameters for optimization
Parameter count:
conv1 weights: 144  (16 filters, 3×3 matrix of learnable weights)
conv1 biases: 16
fc1 weights: 27,040 (16 * 13 * 13 * 10)
fc1 biases: 10
Total: 27,210 parameters

criterion = nn.CrossEntropyLoss()
What CrossEntropyLoss does:
Step 1: Applies Softmax to convert logits to probabilities:
logits = [2.0, 1.0, 0.1, -1.0, ..., 3.2]  # 10 values
softmax(logits)[i] = exp(logits[i]) / sum(exp(logits[j]))
Example:
exp(2.0) = 7.39
exp(1.0) = 2.72
exp(0.1) = 1.11
exp(-1.0) = 0.37
...
sum = 15.6
probabilities = [0.47, 0.17, 0.07, 0.02, ...]
Step 2: Computes negative log likelihood:
If true label = 0:
loss = -log(probability[0]) = -log(0.47) = 0.76
If true label = 1:
loss = -log(probability[1]) = -log(0.17) = 1.77  # Higher! if wrong then punishes more

If model is very confident and correct:
probability[0] = 0.95 → loss = -log(0.95) = 0.05

Step 3: Averages over batch:
total_loss = sum(loss for each sample) / batch_size(64)
python
optimizer = optim.Adam(model.parameters(), lr=0.001)


What Adam optimizer does:
Adam = Adaptive Moment Estimation

Parameters it manages:
model.parameters(): All 27,210 learnable weights and biases

lr=0.001: Learning rate (step size)
What Adam maintains for each parameter:

moment_1 = 0  # First moment (mean of gradients)
moment_2 = 0  # Second moment (variance of gradients)
step = 0      # Number of updates
Update rule (simplified):
At each step:
step += 1
# Update moments
moment_1 = 0.9 × moment_1 + 0.1 × gradient
moment_2 = 0.999 × moment_2 + 0.001 × gradient²

# Bias correction
m_corrected = moment_1 / (1 - 0.9^step)
v_corrected = moment_2 / (1 - 0.999^step)

# Update parameter
parameter = parameter - lr × m_corrected / (sqrt(v_corrected) + 1e-8)


Why Adam:

Adapts learning rate per parameter
Works well with default settings
Converges faster than SGD
```

5. Training loop (very small)

In [None]:
for epoch in range(10):
    running_loss = 0.0
    for images, labels in trainloader:
        optimizer.zero_grad()
        
        outputs = model(images)                                  # Forward pass: Pass the input images through the model to get the predicted outputs (logits) for each image in the batch
        loss = criterion(outputs, labels)                        # Calculate the loss between the model's predictions (outputs) and the true labels (labels) using the defined loss function (criterion)
        loss.backward()                                          # Backward pass: Compute the gradients of the loss with respect to the model's parameters (weights) using backpropagation
        optimizer.step()                                         # Update the model's parameters using the computed gradients and the defined optimization algorithm (optimizer)
        
        running_loss += loss.item()
    
    print(f"Epoch {epoch+1}, Loss: {running_loss:.3f}")


Epoch 1, Loss: 278.862
Epoch 2, Loss: 108.298
Epoch 3, Loss: 78.554
Epoch 4, Loss: 64.227
Epoch 5, Loss: 55.757
Epoch 6, Loss: 49.155
Epoch 7, Loss: 44.339
Epoch 8, Loss: 39.983
Epoch 9, Loss: 36.913
Epoch 10, Loss: 32.992
