# PyTorch II: Neural Network Training (MLP vs CNN on MNIST-1D)

In this notebook, we will train neural networks in PyTorch, comparing a **Multi-Layer Perceptron (MLP)** and a **Convolutional Neural Network (CNN)** on the **MNIST-1D** dataset.
MNIST-1D is a synthetic 1-dimensional analogue of MNIST, designed to be low-compute while highlighting model differences.
According to the original MNIST-1D result, a linear model achieves ~32% accuracy, an MLP around 68%, and a CNN about 94%.

What we'll cover:

- Setting up a PyTorch training loop with data loading, loss computation, backpropagation, and optimization
- Differences between MLPs and CNNs in architecture and performance
- Best practices for training and monitoring progress
- Awareness of high-level libraries like PyTorch Lightning and Hugging Face Accelerate

Let's get started by installing dependencies and preparing the dataset.


## Setup: Installing Dependencies and Loading Data

First, we'll install the `mnist1d` package and import PyTorch, NumPy, and Matplotlib for our analysis.

In [1]:
!pip install mnist1d

Collecting mnist1d
  Downloading mnist1d-0.0.2.post1-py3-none-any.whl.metadata (14 kB)
Collecting requests (from mnist1d)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting numpy (from mnist1d)
  Downloading numpy-2.2.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting matplotlib (from mnist1d)
  Downloading matplotlib-3.10.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting scipy (from mnist1d)
  Downloading scipy-1.15.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting contourpy>=1.0.1 (from matplotlib->mnist1d)
  Downloading contourpy-1.3.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib->mnist1d)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib->mnist1d)
  Downloading fonttools-4.56.0-cp312-cp312-macosx_10_13_universal2.whl.metadata (101 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib->mnist1d)
  Downloading kiwisolv

In [None]:
# Imports
import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np
import matplotlib.pyplot as plt

# Ensure reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check if GPU is available, and use it for faster training if possible
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)

Now, let's load the MNIST-1D dataset.
The dataset consists of 4000 training and 1000 test examples, each being a 1D sequence of length 40 representing a synthetic "digit" (0-9).
The label is the digit class (0-9).

In [None]:
# Load the MNIST-1D dataset
from urllib.request import urlopen
import pickle

url = 'https://github.com/greydanus/mnist1d/raw/master/mnist1d_data.pkl'
data = pickle.load(urlopen(url))

data.keys()


In [None]:
# Generate the default MNIST-1D dataset
X_train, y_train, X_test, y_test = data['x'], data['y'], data['x_test'], data['y_test']


print("Training data shape:", X_train.shape)
print("Training labels shape:", y_train.shape)
print("Test data shape:", X_test.shape)
print("Test labels shape:", y_test.shape)
# Let's peek at the first training sample and label
print("First training sample (x[0]):", X_train[0])
print("First training label (y[0]):", y_train[0])

Each data sample is a 40-dimensional vector representing a digit's "pen strokes" in 1D.
Let's visualize a couple examples:

In [None]:
# Plot a couple of sample 1D signals from the dataset
plt.figure(figsize=(6,4))
for digit in [0, 1]:  # let's visualize the first two classes for example
    idx = np.where(y_train == digit)[0][0]  # find first index of this digit in training set
    plt.plot(X_train[idx], label=f"Class {digit}")
plt.title("Example MNIST-1D signals for two classes")
plt.xlabel("Position (1D)")
plt.ylabel("Intensity")
plt.legend()
plt.show()

## MLP vs CNN: Intuition and Architecture Differences

Recall: MLP is a fully-connected neural network where every neuron connects to all neurons in the next layer. For our 40-dimensional input, each neuron looks at the entire input at once. MLPs don't explicitly leverage spatial structure in the data - they must learn relevant patterns from scratch, which can make them prone to overfitting if not regularized.

Recall: CNNs **sparse connectivity** and **weight sharing** in the form of kernels which slide across the input. Each filter has fewer weights and is reused across different input positions. This makes CNNs parameter-efficient and naturally able to detect local patterns regardless of position (translation invariance). This inductive bias works well for structured data like images or sequential signals.

Now, let's define two models for our task:

1. **A simple MLP** with one hidden layer.
2. **A simple 1D CNN** with one or two convolutional layers.

We'll then compare their performance on the MNIST-1D data.

## Define the Neural Network Models

We'll define the model architectures using PyTorch's `nn.Module`.
For the MLP, we'll use a single hidden layer with ReLU activation.
For the CNN, we'll use a small network with convolutional layers and pooling.
Both models will output 10 logits (one for each digit class 0-9).

In [None]:
# Define a simple MLP with one hidden layer
class SimpleMLP(nn.Module):
    def __init__(self, input_size=40, hidden_size=64, num_classes=10):
        super(SimpleMLP, self).__init__()
        self.hidden = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.output = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # x is of shape (batch_size, input_size)
        out = self.hidden(x)
        out = self.relu(out)
        out = self.output(out)
        return out

# Define a simple 1D CNN
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        # 1 input channel (since our signals are 1D sequences), let's use 8 filters for first conv
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=8, kernel_size=5, padding=0)
        self.pool = nn.MaxPool1d(kernel_size=2)  # will halve the sequence length
        # Second conv: take 8 channels from conv1, output 16 filters
        self.conv2 = nn.Conv1d(in_channels=8, out_channels=16, kernel_size=3, padding=0)
        # After conv and pooling layers, we'll flatten and use a fully connected layer
        # Compute the output dimension after conv+pool layers:
        # Input length 40 -> conv1(kernel5 no padding) -> length 36 -> pool -> length 18
        # 18 -> conv2(kernel3) -> length 16 -> pool -> length 8
        # So final feature map has size 16 (channels) x 8 (length) = 128 features
        self.fc = nn.Linear(16 * 8, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        # x is of shape (batch_size, 1, sequence_length)
        out = self.conv1(x)         # (batch, 8, 36)
        out = self.relu(out)
        out = self.pool(out)        # (batch, 8, 18)
        out = self.conv2(out)       # (batch, 16, 16)
        out = self.relu(out)
        out = self.pool(out)        # (batch, 16, 8)
        out = out.view(out.size(0), -1)  # flatten to (batch, 16*8)
        out = self.fc(out)          # (batch, num_classes)
        return out

# Instantiate the models
mlp_model = SimpleMLP().to(device)
cnn_model = SimpleCNN().to(device)

print(mlp_model)
print(cnn_model)


**Explanation:** The MLP has one hidden layer (40->64) with ReLU and an output layer (64->10). The CNN uses two 1D conv layers - conv1 (8 filters, size 5) and conv2 (16 filters, size 3), each followed by ReLU and max pooling. A final fully connected layer produces 10 outputs. Though small, the CNN's architecture gives it advantages.

Let's compare their parameter counts:

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"MLP model parameters: {count_parameters(mlp_model)}")
print(f"CNN model parameters: {count_parameters(cnn_model)}")

## Preparing Data Loaders

We'll convert our NumPy arrays into PyTorch `Dataset` and `DataLoader` objects for batch loading during training, using `TensorDataset` to wrap our tensors.

For evaluation, we'll use the provided test set each epoch, though in practice you might want a separate validation set.


In [None]:

from torch.utils.data import TensorDataset, DataLoader

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor  = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor  = torch.tensor(y_test, dtype=torch.long)

# Create TensorDataset and DataLoader for training and test
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset  = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=64, shuffle=False)

print("Batches in train loader:", len(train_loader))
print("Batches in test loader:", len(test_loader))


We set a batch size of 64.
Shuffling is enabled for training data (best practice to shuffle training examples each epoch for stochastic gradient descent).

## Training the Models

We will train both models using the Cross-Entropy Loss and Adam optimizer, recording loss and accuracy metrics during training.

A typical PyTorch training loop involves:

1. Set model to training mode (`model.train()`)
2. For each batch:
   - Forward pass and compute loss
   - Backpropagate and update weights
   - Zero gradients (**important!!**)
3. Evaluate on test set with `model.eval()`

We'll track both loss and accuracy to monitor training progress.

In [17]:
def train_model(model, train_loader, test_loader, epochs=10, learning_rate=0.001):
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Lists to store metrics per epoch
    train_losses, test_losses = [], []
    train_accuracies, test_accuracies = [], []

    for epoch in range(1, epochs+1):
        model.train()  # put model in training mode
        running_loss = 0.0
        correct = 0
        total = 0

        # Training loop
        for inputs, labels in train_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs if not isinstance(model, SimpleCNN) else inputs.unsqueeze(1))
            # (Note: for CNN, inputs shape [batch, 40] needs reshaping to [batch, 1, 40])

            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)  # accumulate summed loss (item() gives scalar)
            # Compute number of correct predictions for this batch
            _, predicted = torch.max(outputs, 1)  # get index of max logit
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

        # Compute average training loss and accuracy for the epoch
        train_loss = running_loss / total
        train_acc = correct / total
        train_losses.append(train_loss)
        train_accuracies.append(train_acc)

        # Evaluation on test data
        model.eval()  # evaluation mode
        test_loss = 0.0
        correct = 0
        total = 0
        # Disable gradient calculation for efficiency
        with torch.no_grad():
            for inputs, labels in test_loader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                outputs = model(inputs if not isinstance(model, SimpleCNN) else inputs.unsqueeze(1))
                loss = criterion(outputs, labels)
                test_loss += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs, 1)
                correct += (predicted == labels).sum().item()
                total += labels.size(0)
        test_loss = test_loss / total
        test_acc = correct / total
        test_losses.append(test_loss)
        test_accuracies.append(test_acc)

        # Print epoch summary
        print(f"Epoch {epoch}/{epochs} -> "
                f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc*100:.2f}% || "
                f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc*100:.2f}%")

    return train_losses, train_accuracies, test_losses, test_accuracies

In [None]:
print("Training MLP...")
# Train the MLP model
mlp_train_losses, mlp_train_acc, mlp_test_losses, mlp_test_acc = train_model(mlp_model, train_loader, test_loader, epochs=20, learning_rate=0.001)

We train for 20 epochs (this should be sufficient to see clear trends).
The training loop prints the loss and accuracy for both training and test sets at each epoch.
Monitoring both is a good practice to detect overfitting: if training accuracy keeps increasing but test accuracy starts decreasing, the model might be overfitting to the training data.

After training the MLP, let's train the CNN with the same number of epochs for a fair comparison:

In [None]:
# Train the CNN model
print("\nTraining CNN...")
cnn_train_losses, cnn_train_acc, cnn_test_losses, cnn_test_acc = train_model(cnn_model, train_loader, test_loader, epochs=20, learning_rate=0.001)

## Visualizing Training Progress

Now that both models are trained, let's visualize the loss and accuracy curves over epochs for the MLP and CNN.
This will help us compare their learning speed and final performance.

In [None]:
epochs = range(1, 21)
plt.figure(figsize=(12,5))

# Plot loss curves
plt.subplot(1,2,1)
plt.plot(epochs, mlp_train_losses, 'b-o', label='MLP Train Loss')
plt.plot(epochs, mlp_test_losses, 'b--s', label='MLP Test Loss')
plt.plot(epochs, cnn_train_losses, 'r-o', label='CNN Train Loss')
plt.plot(epochs, cnn_test_losses, 'r--s', label='CNN Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.title('MLP vs CNN Loss')
plt.legend()

# Plot accuracy curves
plt.subplot(1,2,2)
plt.plot(epochs, [a*100 for a in mlp_train_acc], 'b-o', label='MLP Train Acc')
plt.plot(epochs, [a*100 for a in mlp_test_acc], 'b--s', label='MLP Test Acc')
plt.plot(epochs, [a*100 for a in cnn_train_acc], 'r-o', label='CNN Train Acc')
plt.plot(epochs, [a*100 for a in cnn_test_acc], 'r--s', label='CNN Test Acc')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.title('MLP vs CNN Accuracy')
plt.legend()
plt.show()

**Discussion:** Examine the plots to see:
- **Convergence**: Which model converges faster and reaches a lower loss? CNNs typically do better on structured data by efficiently extracting features.
- **Accuracy**: The CNN should achieve higher test accuracy than the MLP, which tends to level off lower.
- **Overfitting**: Look for gaps between training and test curves. The MLP may overfit more due to having more parameters, while the CNN's local structure helps it generalize better.

## Inspecting Learned Weights for Intuition

Let's inspect the learned weights of the first layer of each model to understand what patterns they detect:
- For the **MLP**, we'll examine weight vectors (length 40) connecting inputs to the 64 hidden units
- For the **CNN**, we'll look at the 8 convolutional filters (length 5) that detect local features

**MLP first-layer weight vectors:** Each hidden neuron may specialize to detect patterns in different parts of the input. Let's visualize some weight vectors:

In [None]:
# Get the weight matrix of the MLP hidden layer (shape: [hidden_size, input_size])
mlp_weights = mlp_model.hidden.weight.detach().cpu().numpy()

plt.figure(figsize=(8,4))
for i in range(5):  # plot first 5 hidden unit weight vectors
    plt.plot(mlp_weights[i], label=f'Neuron {i}')
plt.title("MLP Hidden Layer Weight Vectors (first 5 neurons)")
plt.xlabel("Input index (0-39)")
plt.ylabel("Weight value")
plt.legend()
plt.show()

Each filter's 5 values determine what pattern it detects in any 5-position segment of the input. For example, a filter `[0.5, 1.0, -0.2, -0.8, -0.5]` detects where signals rise then fall.

Common filter patterns include:
- `[+, +, +, +, +]` (all positive): detects regions of high values
- `[+, 0, -, 0, +]` (alternating): detects up-down-up patterns  
- Large middle coefficient: detects peaks at center

CNN filters are easier to interpret since they look at local patterns. MLP weights mix information globally, making them harder to understand directly.

**Summary:**
- MLPs must use the entire input for each neuron, discovering local patterns on their own if needed
- CNN filters explicitly show what local features they detect (edges, peaks, slopes) in each 5-value window

## Conclusion and Best Practices

For general best practices, please read the following blogpost: https://karpathy.github.io/2019/04/25/recipe/

**Note on advanced tools:** Tools like **PyTorch Lightning** and **Hugging Face Accelerate** can simplify training by handling boilerplate code for loops, devices, and distributed training. While these tools speed up development, understanding the manual training process remains valuable.

Some ways to experiment further:

- Modify the MLP architecture (layers, neurons)
- Tune hyperparameters (learning rate, batch size, etc.)
- Adjust CNN parameters (filter sizes, number)
- Visualize model predictions on test data