# Comprehensive Python Tutorial: Solving MNIST with Keras and PyTorch

In this tutorial, we will explore how to solve the MNIST handwritten digit classification problem using different approaches. We will start with simple dense (fully connected) layers in Keras, then move to Convolutional Neural Networks (CNNs) in Keras. After that, we will transition to PyTorch, where we will implement dense layers and CNNs, and finally, we will enhance our PyTorch CNN model with data augmentation.

The narrative will follow a progression where each approach builds upon the previous one, highlighting the differences between Keras and PyTorch, and explaining why certain choices are made. By the end of this tutorial, you will have a deep understanding of how to implement and improve neural networks for image classification tasks.



## 1. MNIST using Keras Dense Layers

### Introduction to MNIST
The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0-9). The goal is to classify each image into one of the 10 classes.

### Why Start with Dense Layers?
Dense layers are the simplest type of neural network layers, where each neuron is connected to every neuron in the previous layer. Starting with dense layers allows us to understand the basics of neural networks before moving to more complex architectures like CNNs.

### Step-by-Step Implementation

#### 1.1 Importing Libraries


In [1]:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical


#### 1.2 Loading and Preprocessing the Data


In [3]:

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the images to the range [0, 1]
train_images = train_images.astype('float32') / 255
test_images = test_images.astype('float32') / 255

# Flatten the 28x28 images into 784-dimensional vectors
train_images = train_images.reshape((60000, 28 * 28))
test_images = test_images.reshape((10000, 28 * 28))

# One-hot encode the labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)




#### 1.3 Building the Model


In [8]:

model = models.Sequential()
model.add(layers.Input((28*28,)))
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(10, activation='softmax'))




#### 1.4 Compiling the Model


In [9]:

model.compile(optimizer='adamw',
              loss='categorical_crossentropy',
              metrics=['accuracy'])




#### 1.5 Training the Model


In [10]:

model.fit(train_images, train_labels, validation_split=0.1, epochs=10, batch_size=128)



Epoch 1/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8228 - loss: 0.6237 - val_accuracy: 0.9623 - val_loss: 0.1390
Epoch 2/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9482 - loss: 0.1827 - val_accuracy: 0.9718 - val_loss: 0.1005
Epoch 3/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9628 - loss: 0.1252 - val_accuracy: 0.9782 - val_loss: 0.0833
Epoch 4/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9709 - loss: 0.0976 - val_accuracy: 0.9772 - val_loss: 0.0749
Epoch 5/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9748 - loss: 0.0812 - val_accuracy: 0.9800 - val_loss: 0.0700
Epoch 6/10
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9795 - loss: 0.0687 - val_accuracy: 0.9813 - val_loss: 0.0632
Epoch 7/10
[1m422/422[0m 

<keras.src.callbacks.history.History at 0x22665bd6b00>


#### 1.6 Evaluating the Model


In [11]:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc:.4f}')


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9761 - loss: 0.0766
Test accuracy: 0.9791




### Discussion
- **Pros**: Simple to implement, good for understanding the basics.
- **Cons**: Dense layers do not take into account the spatial structure of images, leading to lower accuracy compared to CNNs.



## 2. MNIST using Keras CNN Layers

### Why Move to CNNs?
Convolutional Neural Networks (CNNs) are designed to work with image data. They use convolutional layers to extract spatial features, making them more effective for image classification tasks.

### Step-by-Step Implementation

#### 2.1 Importing Libraries


In [12]:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical




#### 2.2 Loading and Preprocessing the Data


In [13]:

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the images to the range [0, 1]
train_images = train_images.astype('float32') / 255
test_images = test_images.astype('float32') / 255

# Reshape the images to include a single channel (grayscale)
train_images = train_images.reshape((60000, 28, 28, 1))
test_images = test_images.reshape((10000, 28, 28, 1))

# One-hot encode the labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)


In [14]:


#### 2.3 Building the Model

model = models.Sequential()
model.add(layers.Input((28,28,1)))
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))




#### 2.4 Compiling the Model


In [15]:

model.compile(optimizer='adamw',
              loss='categorical_crossentropy',
              metrics=['accuracy'])




#### 2.5 Training the Model


In [16]:

model.fit(train_images, train_labels, validation_split=0.1, epochs=10, batch_size=64)


Epoch 1/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 8ms/step - accuracy: 0.8561 - loss: 0.4689 - val_accuracy: 0.9852 - val_loss: 0.0505
Epoch 2/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9830 - loss: 0.0549 - val_accuracy: 0.9890 - val_loss: 0.0398
Epoch 3/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9890 - loss: 0.0345 - val_accuracy: 0.9878 - val_loss: 0.0419
Epoch 4/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9905 - loss: 0.0298 - val_accuracy: 0.9923 - val_loss: 0.0302
Epoch 5/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9930 - loss: 0.0229 - val_accuracy: 0.9903 - val_loss: 0.0368
Epoch 6/10
[1m844/844[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9935 - loss: 0.0185 - val_accuracy: 0.9907 - val_loss: 0.0384
Epoch 7/10
[1m844/844[0m 

<keras.src.callbacks.history.History at 0x22665fd7940>



#### 2.6 Evaluating the Model


In [17]:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc:.4f}')


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9888 - loss: 0.0402
Test accuracy: 0.9915




### Discussion
- **Pros**: CNNs capture spatial features, leading to higher accuracy.
- **Cons**: More complex to implement and computationally expensive compared to dense layers.

## 3. MNIST using PyTorch Dense Layers

### Why Transition to PyTorch?
PyTorch is a powerful deep learning framework that offers more flexibility and control compared to Keras. It is particularly popular in research due to its dynamic computation graph.

### Step-by-Step Implementation

#### 3.1 Importing Libraries


In [18]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms




#### 3.2 Loading and Preprocessing the Data


In [21]:
from torch.utils.data import random_split


# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])

# Load the MNIST dataset
# train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
full_train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)

# === 3. Perform 90-10 Split ===
train_size = int(0.9 * len(full_train_dataset))  # 90% for training
val_size = len(full_train_dataset) - train_size  # 10% for validation
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])


test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)  # No shuffle for validation
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)


 
## **Step 1: Data Transformation**
```python
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
```
### **What is happening here?**
- The `transform` variable defines a sequence of transformations that will be applied to the MNIST images before they are fed into the model.
- `transforms.Compose([...])`: Combines multiple transformations into one pipeline.
- `transforms.ToTensor()`: Converts the images from PIL images (or NumPy arrays) into **PyTorch tensors** with values in **[0,1]**.
- `transforms.Normalize((0.5,), (0.5,))`: Normalizes the pixel values using the formula:
  
  \[
  X_{\text{normalized}} = \frac{X - \text{mean}}{\text{std}}
  \]
  
  Since MNIST images have pixel values between 0 and 1, subtracting **0.5** and dividing by **0.5** rescales them into the range **[-1,1]**, which is useful for better training stability.

### **Alternative Options**
- Instead of `transforms.Normalize((0.5,), (0.5,))`, we could use:
  - `transforms.Normalize((0,), (1,))` (no normalization)
  - `transforms.RandomRotation(30)`: Rotates images randomly for data augmentation.
  - `transforms.RandomAffine(30, translate=(0.1,0.1))`: Adds random translations and rotations.
  - `transforms.Grayscale(num_output_channels=1)`: Ensures single-channel grayscale output.

---

## **Step 2: Loading the MNIST Dataset**
```python
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
```
### **What is happening here?**
- `datasets.MNIST(...)` is a **PyTorch dataset class** that automatically downloads and loads the MNIST dataset.
- Parameters:
  - `root='./data'`: Specifies the directory where the dataset will be stored.
  - `train=True`: Loads the **training set** (60,000 images).
  - `train=False`: Loads the **test set** (10,000 images).
  - `download=True`: Downloads the dataset if it is not already present.
  - `transform=transform`: Applies the previously defined transformations.

### **Alternative Dataset Options**
- Instead of `datasets.MNIST`, we could use:
  - `datasets.FashionMNIST`: Similar to MNIST but with clothing images.
  - `datasets.CIFAR10`: A dataset with 10 object classes (e.g., dog, cat, car).
  - `datasets.ImageFolder(root='path/to/dataset', transform=transform)`: For custom datasets.

---

## **Step 3: Creating Data Loaders**
```python
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)
```
### **What is happening here?**
- The `DataLoader` class **creates an iterable (iterator)** over the dataset, allowing **batch processing** and **shuffling**.
- Parameters:
  - `train_dataset`: The dataset to load.
  - `batch_size=128`: The number of samples per batch (128 images per iteration).
  - `shuffle=True`: Randomizes the order of the training data at each epoch to prevent overfitting.
  - `shuffle=False`: Maintains a fixed order for the test set.

### **Alternative `DataLoader` Options**
- `batch_size=32` or `batch_size=256`: Smaller batches use less memory but take longer per epoch, larger batches train faster but require more memory.
- `num_workers=4`: Uses multiple CPU threads for faster data loading.
- `pin_memory=True`: Speeds up GPU training by storing tensors in pinned (page-locked) memory.

---

## **Comparison with TensorFlow**
The equivalent TensorFlow/Keras code would be:

```python
import tensorflow as tf
from tensorflow.keras.datasets import mnist

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the images
train_images = (train_images.astype('float32') / 255.0 - 0.5) / 0.5
test_images = (test_images.astype('float32') / 255.0 - 0.5) / 0.5

# Add a channel dimension (needed for CNNs)
train_images = train_images[..., tf.newaxis]
test_images = test_images[..., tf.newaxis]

# Create TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(60000).batch(128)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(128)
```

### **Comparison: PyTorch vs TensorFlow**
| Feature         | PyTorch (`torchvision`) | TensorFlow (`tf.data`) |
|----------------|----------------|----------------|
| **Data Loading** | `datasets.MNIST(root, download=True)` | `mnist.load_data()` (built-in) |
| **Transformations** | `transforms.Compose([...])` | Manual NumPy preprocessing |
| **Normalization** | `transforms.Normalize((0.5,), (0.5,))` | `image = (image / 255.0 - 0.5) / 0.5` |
| **Batching & Shuffling** | `DataLoader(dataset, batch_size, shuffle=True)` | `Dataset.shuffle().batch()` |
| **Customization** | More flexible for custom datasets | Easier integration with TF models |

### **Which is better?**
- **PyTorch**: More flexible, better for research and debugging.
- **TensorFlow**: More optimized for production and deployment.




#### 3.3 Building the Model


In [30]:

class DenseNet(nn.Module):
    def __init__(self):
        super(DenseNet, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.drop = nn.Dropout(0.25)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = torch.relu(self.fc1(x))
        x = self.drop(x)
        x = self.fc2(x)
        return x

model = DenseNet()


## **Step 1: Understanding the PyTorch Code**
```python
import torch
import torch.nn as nn

class DenseNet(nn.Module):  # (1)
    def __init__(self):  # (2)
        super(DenseNet, self).__init__()  # (3)
        self.fc1 = nn.Linear(28 * 28, 512)  # (4)
        self.fc2 = nn.Linear(512, 10)  # (5)

    def forward(self, x):  # (6)
        x = x.view(-1, 28 * 28)  # (7)
        x = torch.relu(self.fc1(x))  # (8)
        x = self.fc2(x)  # (9)
        return x  # (10)

model = DenseNet()  # (11)
```

### **Breaking It Down**
1. **Class Definition (`class DenseNet(nn.Module)`)**  
   - This is **not** just an arbitrary class. It **inherits** from `nn.Module`, which gives the model its **core properties** (like tracking layers and parameters).
   - The decision to use a **class-based approach** reflects PyTorch’s **object-oriented design philosophy**, which encourages **flexibility, customization, and introspection**.

2. **Constructor (`__init__()`)**  
   - Defines **what layers exist** in the network.
   - Uses `super(DenseNet, self).__init__()` to ensure that `nn.Module` is **properly initialized**.

3. **Defining Layers (`self.fc1`, `self.fc2`)**  
   - `self.fc1 = nn.Linear(28 * 28, 512)`: A **fully connected layer** that takes **28x28=784** input features (flattened image) and outputs **512** neurons.
   - `self.fc2 = nn.Linear(512, 10)`: Maps from **512 neurons** to **10 classes** (digits 0-9).

4. **Forward Method (`forward(self, x)`)**  
   - This method defines the **actual computation** of the model.
   - `x = x.view(-1, 28 * 28)`: Reshapes **2D images** into **1D vectors**.
   - `x = torch.relu(self.fc1(x))`: Passes through **ReLU activation**.
   - `x = self.fc2(x)`: No activation function here—this is raw **logits**.

---

## **How is this Different from TensorFlow’s Sequential API?**

A similar **TensorFlow model** using `Sequential`:

```python
import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(512, activation='relu'),
    layers.Dense(10)
])
```

### **Key Differences:**
| Feature            | PyTorch `nn.Module` (Class)  | TensorFlow `Sequential` (API) |
|--------------------|----------------------------|------------------------------|
| **Flexibility**    | Highly flexible (OOP-based) | Limited to stackable layers |
| **Extensibility**  | Can define custom layers, forward passes, and complex architectures | Harder to customize beyond simple stacks |
| **Explicit Forward Pass** | Uses `forward(self, x)`, explicitly defining computations | Automatically handles `call()` method |
| **Parameter Handling** | `self.fc1`, `self.fc2` are attributes (trackable via `model.parameters()`) | Hidden inside `model.layers[]` |
| **Best Use Case**  | Research, complex architectures, and debugging | Standard feedforward and CNN models |

### **Philosophical Difference**
- **TensorFlow’s `Sequential` is high-level**:  
  - It’s declarative: **“Just tell me what layers you want, and I’ll handle the execution.”**
  - It follows a "write less, do more" approach.
  - Great for quick prototyping.
  
- **PyTorch’s `nn.Module` is explicit and imperative**:  
  - It’s more like **manual control over computation**.
  - The `forward()` method means we’re **not just stacking layers**—we have **complete control over how data flows**.
  - This makes PyTorch **more flexible** for architectures like:
    - **Skip connections** (e.g., ResNets)
    - **Multiple inputs/outputs**
    - **Graph-based computations** (e.g., transformers)

---

## **Why a Class? Why Inherit from `nn.Module`?**
1. **State Management**  
   - Inheriting from `nn.Module` **automatically tracks** all layers and parameters.
   - This is why calling `model.parameters()` works effortlessly.

2. **Reusability & Encapsulation**  
   - We can **extend** this class for **custom architectures**.
   - Example:
     ```python
     class CustomDenseNet(DenseNet):
         def __init__(self):
             super().__init__()
             self.fc3 = nn.Linear(10, 5)  # Extra layer

         def forward(self, x):
             x = super().forward(x)
             return self.fc3(x)
     ```
   - This is **impossible** in TensorFlow’s `Sequential`.

3. **Explicit Forward Pass**  
   - PyTorch’s philosophy emphasizes **explicit computation graphs** (dynamic computation).
   - Unlike TensorFlow (which builds static graphs), **PyTorch allows different forward passes per call**.

4. **Custom Behaviors Beyond `Sequential`**  
   - We could add:
     - **Custom activations** (beyond ReLU).
     - **Multiple forward passes** (e.g., stochastic forward).
     - **Conditionally executed layers** (e.g., if-else logic inside `forward()`).

---

## **What Other Methods & Arguments Could This Class Have?**
PyTorch `nn.Module` is a base class that supports additional functionality:

### **1. Custom Weight Initialization**
```python
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        nn.init.zeros_(m.bias)

model.apply(init_weights)
```

### **2. Saving & Loading Models**
```python
torch.save(model.state_dict(), "model.pth")  # Save
model.load_state_dict(torch.load("model.pth"))  # Load
```

### **3. Custom Loss Functions**
```python
class CustomLoss(nn.Module):
    def forward(self, y_pred, y_true):
        return torch.mean((y_pred - y_true) ** 2)  # Example: MSE Loss
```

---

## **The Big Takeaway:**
- **TensorFlow abstracts away details**:  
  - Encourages high-level modeling (great for production-ready code).
  - `Sequential` is **declarative**—less control over how computations happen.
  
- **PyTorch exposes raw control**:  
  - Makes **execution explicit** (`forward()`).
  - Encourages **custom architectures** that break traditional stacks.
  - Best suited for **researchers, experimental architectures, and debugging**.





#### 3.4 Defining the Loss Function and Optimizer


In [31]:

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters())


## **TensorFlow Losses vs. PyTorch Counterparts**
| **TensorFlow (Keras)**                          | **PyTorch Equivalent**                              | **Use Case** |
|-------------------------------------------------|-----------------------------------------------------|--------------|
| `categorical_crossentropy`                      | `nn.CrossEntropyLoss()`                             | Multi-class classification (one-hot labels in TF, class indices in PyTorch) |
| `sparse_categorical_crossentropy`               | `nn.CrossEntropyLoss()`                             | Multi-class classification (class indices in both TF and PyTorch) |
| `binary_crossentropy`                           | `nn.BCEWithLogitsLoss()` (`nn.BCELoss()` rarely used) | Binary classification (sigmoid + loss in one step) |
| `mean_squared_error (mse)`                      | `nn.MSELoss()`                                      | Regression (continuous targets) |
| `mean_absolute_error (mae)`                     | `nn.L1Loss()`                                       | Regression (absolute error) |

### **Key Differences**
1. **No One-Hot Encoding Needed in PyTorch for Classification**
   - In TensorFlow, `categorical_crossentropy` expects **one-hot-encoded labels**.
   - In PyTorch, `CrossEntropyLoss` expects **integer class labels (0,1,2...)**, not one-hot encoding.
   - If using one-hot labels in PyTorch, convert them using `torch.argmax(y, dim=1)`.

2. **`BCEWithLogitsLoss()` vs. `BCELoss()`**
   - `BCEWithLogitsLoss()` includes **sigmoid activation internally**, so it’s numerically more stable.
   - `BCELoss()` assumes the inputs are already passed through `sigmoid()`, so it’s rarely used.

---

## **How to Use These Losses in PyTorch**

### **1. Multi-Class Classification (Categorical Crossentropy)**
#### **TensorFlow:**
```python
model.compile(loss="categorical_crossentropy", optimizer="adam")
```
#### **PyTorch Equivalent:**
```python
criterion = nn.CrossEntropyLoss()
```
#### **Example Usage in PyTorch:**
```python
logits = model(images)  # Model output (raw scores, NOT softmax)
loss = criterion(logits, labels)  # Labels are class indices (e.g., tensor([0, 2, 1, ...]))
```
- No `softmax()` needed, `CrossEntropyLoss()` applies it internally.

---

### **2. Multi-Class Classification (Sparse Categorical Crossentropy)**
#### **TensorFlow:**
```python
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
```
#### **PyTorch Equivalent:**
Same as `CrossEntropyLoss()`, since PyTorch does **not** require one-hot encoding.
```python
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)  # Labels are class indices
```

---

### **3. Binary Classification (Binary Crossentropy)**
#### **TensorFlow:**
```python
model.compile(loss="binary_crossentropy", optimizer="adam")
```
#### **PyTorch Equivalent:**
Use `BCEWithLogitsLoss()` (recommended) or `BCELoss()`.
```python
criterion = nn.BCEWithLogitsLoss()
```
#### **Example Usage in PyTorch:**
```python
logits = model(images)  # Model output (raw scores)
loss = criterion(logits, labels.float())  # Labels should be float (0 or 1)
```
- If using `BCELoss()`, apply `sigmoid()` first:
  ```python
  criterion = nn.BCELoss()
  loss = criterion(torch.sigmoid(logits), labels.float())
  ```

---

### **4. Regression (Mean Squared Error)**
#### **TensorFlow:**
```python
model.compile(loss="mse", optimizer="adam")
```
#### **PyTorch Equivalent:**
```python
criterion = nn.MSELoss()
```
#### **Example Usage in PyTorch:**
```python
predictions = model(inputs)  # Model output (continuous values)
loss = criterion(predictions, targets)
```

---

### **5. Regression (Mean Absolute Error)**
#### **TensorFlow:**
```python
model.compile(loss="mae", optimizer="adam")
```
#### **PyTorch Equivalent:**
```python
criterion = nn.L1Loss()
```




#### 3.5 Training the Model


In [24]:
simple = False
if simple:

    for epoch in range(10):
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()


In [32]:
epochs = 10
for epoch in range(epochs):
    model.train()  # Set model to training mode
    train_loss = 0
    
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()  # Reset gradients

        outputs = model(batch_X)  # Forward pass
        loss = criterion(outputs, batch_y)  # Compute loss

        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        
        train_loss += loss.item()

    model.eval()  # Set model to evaluation mode
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient computation for validation
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)  # Get predictions
            loss = criterion(outputs, batch_y)  # Compute validation loss
            val_loss += loss.item()

            # Get the predicted class (argmax over class dimension)
            preds = torch.argmax(outputs, dim=1)  # Convert logits to class indices

            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

    # Calculate average losses
    train_loss /= len(train_loader)
    val_loss /= len(val_loader)
    val_acc = correct / total

    print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")

Epoch [1/10] - Train Loss: 0.4413 | Val Loss: 0.2640 | Val Acc: 0.9213
Epoch [2/10] - Train Loss: 0.2293 | Val Loss: 0.1744 | Val Acc: 0.9510
Epoch [3/10] - Train Loss: 0.1727 | Val Loss: 0.1371 | Val Acc: 0.9607
Epoch [4/10] - Train Loss: 0.1425 | Val Loss: 0.1160 | Val Acc: 0.9668
Epoch [5/10] - Train Loss: 0.1266 | Val Loss: 0.1205 | Val Acc: 0.9662
Epoch [6/10] - Train Loss: 0.1145 | Val Loss: 0.1086 | Val Acc: 0.9688
Epoch [7/10] - Train Loss: 0.1026 | Val Loss: 0.0999 | Val Acc: 0.9713
Epoch [8/10] - Train Loss: 0.1002 | Val Loss: 0.1022 | Val Acc: 0.9697
Epoch [9/10] - Train Loss: 0.0875 | Val Loss: 0.0928 | Val Acc: 0.9747
Epoch [10/10] - Train Loss: 0.0840 | Val Loss: 0.0942 | Val Acc: 0.9742


### **Why Use `torch.no_grad()` During Validation?**

When performing validation (or inference), we **do not need to compute gradients** because:
1. **We are not updating the model’s parameters** (weights remain frozen).
2. **It reduces memory usage** by preventing PyTorch from storing intermediate computations needed for backpropagation.
3. **It speeds up inference** since gradient tracking adds computational overhead.


## **How Gradient Computation Works in PyTorch**
During training, PyTorch tracks all tensor operations involving model parameters to compute **gradients during backpropagation**. This tracking:
- Stores intermediate activations and gradients in memory.
- Allows `loss.backward()` to compute **parameter updates**.

However, during validation, we only **evaluate** the model, so:
- **Gradients are unnecessary**.
- **Saving memory is crucial**, especially for large models.


## **When Should You Use `torch.no_grad()`?**
- **Validation and testing** (e.g., after every epoch).
- **Inference/predictions** on new data.
- **Feature extraction** (e.g., using pre-trained models without modifying weights).

🚀 **Want an example of how much `torch.no_grad()` saves memory/time in practice?**



#### 3.6 Evaluating the Model


In [33]:

correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test accuracy: {100 * correct / total:.2f}%')


Test accuracy: 97.61%



### Discussion
- **Pros**: PyTorch offers more flexibility and control over the model and training process.
- **Cons**: More verbose and requires a deeper understanding of neural networks compared to Keras.

## 4. MNIST using PyTorch CNN Layers

### Why Use CNNs in PyTorch?
CNNs are more effective for image data, and implementing them in PyTorch allows for greater customization and optimization.

### Step-by-Step Implementation

#### 4.1 Importing Libraries


In [34]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms




#### 4.2 Loading and Preprocessing the Data


In [35]:

from torch.utils.data import random_split


# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])

# Load the MNIST dataset
# train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
full_train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)

# === 3. Perform 90-10 Split ===
train_size = int(0.9 * len(full_train_dataset))  # 90% for training
val_size = len(full_train_dataset) - train_size  # 10% for validation
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])


test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)  # No shuffle for validation
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)




#### 4.3 Building the Model


In [36]:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(64 * 3 * 3, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = self.pool(torch.relu(self.conv3(x)))
        x = x.view(-1, 64 * 3 * 3)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = CNN()




#### 4.4 Defining the Loss Function and Optimizer


In [37]:

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters())




#### 4.5 Training the Model


In [38]:

# for epoch in range(5):
#     for images, labels in train_loader:
#         optimizer.zero_grad()
#         outputs = model(images)
#         loss = criterion(outputs, labels)
#         loss.backward()
#         optimizer.step()


epochs = 10
for epoch in range(epochs):
    model.train()  # Set model to training mode
    train_loss = 0
    
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()  # Reset gradients

        outputs = model(batch_X)  # Forward pass
        loss = criterion(outputs, batch_y)  # Compute loss

        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        
        train_loss += loss.item()

    model.eval()  # Set model to evaluation mode
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient computation for validation
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)  # Get predictions
            loss = criterion(outputs, batch_y)  # Compute validation loss
            val_loss += loss.item()

            # Get the predicted class (argmax over class dimension)
            preds = torch.argmax(outputs, dim=1)  # Convert logits to class indices

            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

    # Calculate average losses
    train_loss /= len(train_loader)
    val_loss /= len(val_loader)
    val_acc = correct / total

    print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")

Epoch [1/10] - Train Loss: 0.2779 | Val Loss: 0.0755 | Val Acc: 0.9758
Epoch [2/10] - Train Loss: 0.0610 | Val Loss: 0.0493 | Val Acc: 0.9848
Epoch [3/10] - Train Loss: 0.0420 | Val Loss: 0.0458 | Val Acc: 0.9862
Epoch [4/10] - Train Loss: 0.0330 | Val Loss: 0.0367 | Val Acc: 0.9885
Epoch [5/10] - Train Loss: 0.0251 | Val Loss: 0.0353 | Val Acc: 0.9893
Epoch [6/10] - Train Loss: 0.0206 | Val Loss: 0.0384 | Val Acc: 0.9882
Epoch [7/10] - Train Loss: 0.0185 | Val Loss: 0.0424 | Val Acc: 0.9887
Epoch [8/10] - Train Loss: 0.0156 | Val Loss: 0.0363 | Val Acc: 0.9897
Epoch [9/10] - Train Loss: 0.0131 | Val Loss: 0.0364 | Val Acc: 0.9895
Epoch [10/10] - Train Loss: 0.0114 | Val Loss: 0.0577 | Val Acc: 0.9862




#### 4.6 Evaluating the Model


In [39]:

correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test accuracy: {100 * correct / total:.2f}%')


Test accuracy: 98.83%




### Discussion
- **Pros**: CNNs in PyTorch are highly customizable and can achieve state-of-the-art performance.
- **Cons**: More complex to implement and requires a deeper understanding of both CNNs and PyTorch.

## 5. MNIST using PyTorch CNN Layers with Data Augmentation

### Why Use Data Augmentation?
Data augmentation is a technique used to artificially increase the size of the training dataset by applying random transformations to the images. This helps the model generalize better and reduces overfitting.

### Step-by-Step Implementation

#### 5.1 Importing Libraries


In [40]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms




#### 5.2 Loading and Preprocessing the Data with Augmentation


In [45]:

from torch.utils.data import SubsetRandomSampler

# 1) Define Augmentation and Basic Transforms
transform_aug = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomAffine(0, translate=(0.1, 0.1)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

transform_basic = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# 2) Determine Train/Val Indices (No Transform Yet)
#    We only need the length of the dataset here, not the actual images.
dummy_mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
total_size = len(dummy_mnist)
train_size = int(0.9 * total_size)  # 90% for training
val_size = total_size - train_size  # 10% for validation

# Split indices (no overlap)
train_indices, val_indices = random_split(
    range(total_size),
    [train_size, val_size],
    generator=torch.Generator().manual_seed(42)  # For reproducibility
)

# 3) Create Two MNIST Datasets: One Augmented, One Plain
train_dataset = datasets.MNIST(root='./data', train=True, transform=transform_aug)
val_dataset   = datasets.MNIST(root='./data', train=True, transform=transform_basic)
test_dataset  = datasets.MNIST(root='./data', train=False, transform=transform_basic)

# 4) Create DataLoaders Using SubsetRandomSampler
batch_size = 128

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    sampler=SubsetRandomSampler(train_indices),
    num_workers=2
)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    sampler=SubsetRandomSampler(val_indices),
    num_workers=2
)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=2
)

print(f"Train samples: {len(train_indices)} | Val samples: {len(val_indices)} | Test samples: {len(test_dataset)}")



Train samples: 54000 | Val samples: 6000 | Test samples: 10000




#### 5.3 Building the Model


In [46]:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(64 * 3 * 3, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = self.pool(torch.relu(self.conv3(x)))
        x = x.view(-1, 64 * 3 * 3)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters())


#### 5.4 Defining the Loss Function and Optimizer


In [47]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters())




#### 5.5 Training the Model


In [48]:
epochs = 10
for epoch in range(epochs):
    model.train()  # Set model to training mode
    train_loss = 0
    
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()  # Reset gradients

        outputs = model(batch_X)  # Forward pass
        loss = criterion(outputs, batch_y)  # Compute loss

        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        
        train_loss += loss.item()

    model.eval()  # Set model to evaluation mode
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient computation for validation
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)  # Get predictions
            loss = criterion(outputs, batch_y)  # Compute validation loss
            val_loss += loss.item()

            # Get the predicted class (argmax over class dimension)
            preds = torch.argmax(outputs, dim=1)  # Convert logits to class indices

            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

    # Calculate average losses
    train_loss /= len(train_loader)
    val_loss /= len(val_loader)
    val_acc = correct / total

    print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")


Epoch [1/10] - Train Loss: 0.4503 | Val Loss: 0.0920 | Val Acc: 0.9732
Epoch [2/10] - Train Loss: 0.1163 | Val Loss: 0.0564 | Val Acc: 0.9823
Epoch [3/10] - Train Loss: 0.0813 | Val Loss: 0.0553 | Val Acc: 0.9833
Epoch [4/10] - Train Loss: 0.0674 | Val Loss: 0.0816 | Val Acc: 0.9753
Epoch [5/10] - Train Loss: 0.0597 | Val Loss: 0.0440 | Val Acc: 0.9865
Epoch [6/10] - Train Loss: 0.0502 | Val Loss: 0.0358 | Val Acc: 0.9887
Epoch [7/10] - Train Loss: 0.0475 | Val Loss: 0.0298 | Val Acc: 0.9893
Epoch [8/10] - Train Loss: 0.0441 | Val Loss: 0.0320 | Val Acc: 0.9903
Epoch [9/10] - Train Loss: 0.0398 | Val Loss: 0.0460 | Val Acc: 0.9862
Epoch [10/10] - Train Loss: 0.0385 | Val Loss: 0.0358 | Val Acc: 0.9888



#### 5.6 Evaluating the Model


In [49]:

correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test accuracy: {100 * correct / total:.2f}%')


Test accuracy: 99.13%




### Discussion
- **Pros**: Data augmentation helps the model generalize better and reduces overfitting.
- **Cons**: Requires more computational resources and careful tuning of augmentation parameters.

## Conclusion

In this tutorial, we started with a simple dense layer model in Keras, moved to CNNs in Keras, and then transitioned to PyTorch, where we implemented dense layers and CNNs. Finally, we enhanced our PyTorch CNN model with data augmentation. Each step built upon the previous one, highlighting the differences between Keras and PyTorch, and explaining why certain choices were made.

By following this tutorial, you should now have a solid understanding of how to implement and improve neural networks for image classification tasks using both Keras and PyTorch.