Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
Portions of this notebook consist of AI-generated content.

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:



The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.



THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

# Lab 2: Deep Learning Basics - Foundation Concepts for LLM Development

## Lab Overview

This lab covers the fundamental concepts of deep learning that are essential for understanding Large Language Models and transformer architectures. We'll explore tensors, operations, gradients, and the mathematical foundations of neural networks.

## Learning Objectives

By the end of this lab, you will:
- Master PyTorch tensor operations and computational graphs
- Understand automatic differentiation and gradient computation
- Learn the mathematical foundations of neural network training
- Explore forward and backward propagation mechanics
- Connect basic operations to transformer architectures

---

## 1. Environment Setup

In [None]:
# GPU Setup and Essential Imports
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

AMD GPU environment initialized successfully
Using device: cuda
PyTorch version: 2.9.1+rocm7.10.0
GPU: Radeon 8060S Graphics
GPU Memory: 103.1 GB


## 2. Tensor Fundamentals

Tensors are the fundamental data structure in PyTorch and the building blocks of deep learning. Understanding tensors is crucial for working with neural networks and LLMs.

In [None]:
# 2.1 Creating Tensors and Basic Operations

print("=== Tensor Creation and Properties ===")

# Create tensors on GPU for optimal performance
x = torch.randn(10, 5, device=device)  # Input tensor: 10 samples, 5 features each
w = torch.randn(5, 3, requires_grad=True, device=device)  # Weight matrix: 5 inputs -> 3 outputs

print(f"Input tensor shape: {x.shape} | Device: {x.device}")
print(f"Weight tensor shape: {w.shape} | Device: {w.device}")
print(f"Weight requires gradients: {w.requires_grad}")

print(f"\nInput tensor (first 3 samples):\n{x[:3]}")
print(f"\nWeight tensor:\n{w}")

print("\nKey Insight: This setup simulates a neural network layer:")
print(" - Input: 10 samples with 5 features each")
print(" - Weights: Transform 5 input features to 3 output features")
print(" - This is the foundation of deep learning transformations!")

=== Tensor Creation and Properties ===
Input tensor shape: torch.Size([10, 5]) | Device: cuda:0
Weight tensor shape: torch.Size([5, 3]) | Device: cuda:0
Weight requires gradients: True

Input tensor (first 3 samples):
tensor([[-0.2717,  0.1274,  0.2204, -1.1929,  0.7937],
        [ 0.4641,  0.9397, -0.7696,  1.1303, -0.8973],
        [-0.4169, -0.3124,  1.2784, -0.7402,  0.4081]], device='cuda:0')

Weight tensor:
tensor([[-0.0205,  0.9367,  0.1821],
        [ 1.7163, -0.7320,  0.1128],
        [-0.2981,  0.9027,  1.7040],
        [-0.2520,  0.3341, -0.8225],
        [-0.8891, -0.8946,  0.4043]], device='cuda:0', requires_grad=True)

Key Insight: This setup simulates a neural network layer:
   • Input: 10 samples with 5 features each
   • Weights: Transform 5 input features to 3 output features
   • This is the foundation of deep learning transformations!


In [None]:
# 2.2 Matrix Multiplication - The Heart of Neural Networks

print("=== Matrix Multiplication Methods ===")

# Method 1: @ operator (most commonly used)
y1 = x @ w  # Shape: (10, 5) @ (5, 3) -> (10, 3)
print(f"Method 1 (@): {x.shape} @ {w.shape} → {y1.shape}")
print(f"Result (first 3 samples):\n{y1[:3]}")

# Method 2: .matmul() method
y2 = x.matmul(w)
print(f"\nMethod 2 (.matmul()): {y2.shape}")
print(f"Results identical: {torch.allclose(y1, y2)}")

# Method 3: torch.matmul() function
y3 = torch.matmul(x, w)
print(f"\nMethod 3 (torch.matmul()): {y3.shape}")
print(f"All methods identical: {torch.allclose(y1, y3)}")

print("\nUnderstanding the Transformation:")
print(" - Each of 10 samples (rows) gets transformed")
print(" - From 5 input features to 3 output features")
print(" - This is exactly what happens in neural network layers!")
print(" - Essential for: Linear layers, Attention mechanisms, LLM computations")

=== Matrix Multiplication Methods ===
Method 1 (@): torch.Size([10, 5]) @ torch.Size([5, 3]) Ã¢â€ â€™ torch.Size([10, 3])
Result (first 3 samples):
tensor([[-0.2464, -1.2575,  1.6425],
        [ 2.3457,  0.2325, -2.4134],
        [-1.0850,  0.3798,  2.8412]], device='cuda:0',
       grad_fn=<SliceBackward0>)

Method 2 (.matmul()): torch.Size([10, 3])
Results identical: True

Method 3 (torch.matmul()): torch.Size([10, 3])
All methods identical: True

Understanding the Transformation:
   • Each of 10 samples (rows) gets transformed
   • From 5 input features to 3 output features
   • This is exactly what happens in neural network layers!
   • Essential for: Linear layers, Attention mechanisms, LLM computations


In [None]:
# 2.3 PyTorch Linear Layers - Neural Network Building Blocks

print("=== PyTorch Linear Layer ===")

# Create linear layer and move to GPU
linear = torch.nn.Linear(5, 3, bias=False).to(device)
print(f"Linear layer weight shape: {linear.weight.shape}")
print(f"Linear layer on device: {linear.weight.device}")
print(f"Linear layer bias: {linear.bias}")

# Set the weights to match our manual example
linear.weight.data = w.T  # Note: Linear layers store weights transposed
print(f"\nOriginal weight matrix w:\n{w}")
print(f"Linear layer weight (w^T):\n{linear.weight}")

# Apply the linear transformation
y_linear = linear(x)
print(f"\nLinear layer output shape: {y_linear.shape}")
print(f"First 3 outputs:\n{y_linear[:3]}")

# Verify equivalence
print(f"\nManual matmul vs Linear layer: {torch.allclose(y3, y_linear)}")

print("\nPyTorch Linear Layers are:")
print(" - The building blocks of neural networks")
print(" - Used in: MLPs, Transformers, LLM projection layers")
print(" - Efficiently implement: y = xW^T + b")
print(" - Handle batching automatically")

=== PyTorch Linear Layer ===
Linear layer weight shape: torch.Size([3, 5])
Linear layer on device: cuda:0
Linear layer bias: None

Original weight matrix w:
tensor([[-0.0205,  0.9367,  0.1821],
        [ 1.7163, -0.7320,  0.1128],
        [-0.2981,  0.9027,  1.7040],
        [-0.2520,  0.3341, -0.8225],
        [-0.8891, -0.8946,  0.4043]], device='cuda:0', requires_grad=True)
Linear layer weight (w^T):
Parameter containing:
tensor([[-0.0205,  1.7163, -0.2981, -0.2520, -0.8891],
        [ 0.9367, -0.7320,  0.9027,  0.3341, -0.8946],
        [ 0.1821,  0.1128,  1.7040, -0.8225,  0.4043]], device='cuda:0',
       requires_grad=True)

Linear layer output shape: torch.Size([10, 3])
First 3 outputs:
tensor([[-0.2464, -1.2575,  1.6425],
        [ 2.3457,  0.2325, -2.4134],
        [-1.0850,  0.3798,  2.8412]], device='cuda:0',
       grad_fn=<SliceBackward0>)

Manual matmul vs Linear layer: True

PyTorch Linear Layers are:
   • The building blocks of neural networks
   • Used in: MLPs, Trans

## 3. Automatic Differentiation - The Magic of Backpropagation

Automatic differentiation (autograd) is PyTorch's system for computing gradients automatically. This is essential for training neural networks through backpropagation.

### 3.1 Gradient Computation Example

In [None]:
print("=== Automatic Differentiation in Action ===")

# Simple example: computing gradients of a loss function
input_data = torch.randn(4, 5, device=device)
target = torch.randn(4, 3, device=device)

# Create learnable parameters
weight = torch.randn(5, 3, requires_grad=True, device=device)
bias = torch.randn(3, requires_grad=True, device=device)

print(f"Input shape: {input_data.shape}")
print(f"Target shape: {target.shape}")
print(f"Weight shape: {weight.shape}")
print(f"Bias shape: {bias.shape}")

# Forward pass
prediction = input_data @ weight + bias
loss = torch.mean((prediction - target) ** 2)  # Mean Squared Error

print("\nForward pass:")
print(f"  Prediction shape: {prediction.shape}")
print(f"  Loss value: {loss.item():.4f}")

# Backward pass - compute gradients
loss.backward()

print("\nGradients computed:")
print(f"  Weight gradient shape: {weight.grad.shape}")
print(f"  Bias gradient shape: {bias.grad.shape}")
print(f"  Weight gradient norm: {weight.grad.norm().item():.4f}")
print(f"  Bias gradient norm: {bias.grad.norm().item():.4f}")

print("\nThis is how neural networks learn:")
print(" - Forward pass: Compute predictions")
print(" - Loss computation: Measure prediction error")
print(" - Backward pass: Compute gradients (∂Loss/∂parameters)")
print(" - Parameter update: Adjust weights to reduce loss")

=== Automatic Differentiation in Action ===
Input shape: torch.Size([4, 5])
Target shape: torch.Size([4, 3])
Weight shape: torch.Size([5, 3])
Bias shape: torch.Size([3])

Forward pass:
  Prediction shape: torch.Size([4, 3])
  Loss value: 8.6003

Gradients computed:
  Weight gradient shape: torch.Size([5, 3])
  Bias gradient shape: torch.Size([3])
  Weight gradient norm: 4.5640
  Bias gradient norm: 1.2576

This is how neural networks learn:
   1. Forward pass: Compute predictions
   2. Loss computation: Measure prediction error
   3. Backward pass: Compute gradients (Ã¢Ë†â€šLoss/Ã¢Ë†â€šparameters)
   4. Parameter update: Adjust weights to reduce loss


### 3.2 Element-wise multiplication 

Element-wise multiplication applies the same operation independently to every position of two tensors with the **same shape** (or shapes that are **broadcastable**). In PyTorch you can use `*`, `.mul()`, or `torch.mul()`—they’re equivalent for element-wise multiply.

**Definition (Hadamard product):** for tensors `A, B ∈ R^{m×n}`,
`(A ⊙ B)_{ij} = A_{ij} × B_{ij}`.

**Why it matters**
- Used in gating (e.g., LSTM/GRU gates), attention masks, feature-wise scaling (LayerNorm/FiLM), and residual modulation.
- Preserves shape, dtype, and device of the inputs; gradients flow element-by-element via autograd.
- Squaring a tensor is just element-wise multiplication with itself: `z = w * w` (same as `w.pow(2)`). The gradient is `∂z/∂w = 2w`.

**Element-wise vs. matrix multiply**
- `*` multiplies element-by-element and returns the same shape as the inputs (after broadcasting).
- `@` / `matmul` performs linear-algebraic matrix multiplication, e.g., `(m×k) @ (k×n) → (m×n)`.

Below, we print `w`, compute `z = w * w`, and verify that a single entry squared (`w[i][j]**2`) equals the corresponding entry `z[i][j]`.

In [6]:
# element wise multiplication

print(w)

z = w * w

print(z)

i = 3
j = 2
print(w[i][j] ** 2)

tensor([[-0.0205,  0.9367,  0.1821],
        [ 1.7163, -0.7320,  0.1128],
        [-0.2981,  0.9027,  1.7040],
        [-0.2520,  0.3341, -0.8225],
        [-0.8891, -0.8946,  0.4043]], device='cuda:0', requires_grad=True)
tensor([[4.1980e-04, 8.7742e-01, 3.3146e-02],
        [2.9456e+00, 5.3580e-01, 1.2734e-02],
        [8.8845e-02, 8.1490e-01, 2.9037e+00],
        [6.3523e-02, 1.1164e-01, 6.7657e-01],
        [7.9047e-01, 8.0027e-01, 1.6345e-01]], device='cuda:0',
       grad_fn=<MulBackward0>)
tensor(0.6766, device='cuda:0', grad_fn=<PowBackward0>)


## 3.3 Tensor dtypes & devices (precision casting)

Modern deep learning performance depends heavily on **dtype** (numeric precision) and **device** (CPU/GPU). PyTorch tensors carry both:
- `dtype` (e.g., `torch.float32`, `torch.bfloat16`, `torch.float16`, `torch.int64`)
- `device` (e.g., `cpu`, `cuda:0`, `mps:0`)

**Why change dtype?**

- **bfloat16 (bf16)** keeps the same exponent range as fp32 with fewer mantissa bits -> great for stable mixed-precision training on modern GPUs (including AMD ROCm), while saving memory and boosting throughput.
- **float16 (fp16)** is faster but has a smaller exponent range -> may underflow/overflow more easily than bf16.
- **float32 (fp32)** is the safe default for numerical stability, but uses more memory and compute.

Below, we print a tensor's dtype and device, cast it to bfloat16, and confirm the new dtype.


In [7]:
print(w, w.dtype, w.device)
w1 = w.bfloat16()
print(w1.dtype)

tensor([[-0.0205,  0.9367,  0.1821],
        [ 1.7163, -0.7320,  0.1128],
        [-0.2981,  0.9027,  1.7040],
        [-0.2520,  0.3341, -0.8225],
        [-0.8891, -0.8946,  0.4043]], device='cuda:0', requires_grad=True) torch.float32 cuda:0
torch.bfloat16


## 4. Neural Network Architectures

Now let's explore how these fundamental operations combine to create neural networks, from simple multi-layer perceptrons to convolutional networks.

In [8]:
m1 = nn.Linear(20, 30)
m2 = nn.Linear(30, 40)
x = torch.randn(128, 20)
y1 = m1(x)
y2 = m2(y1)
print(y2.shape)

torch.Size([128, 40])


In [9]:
x_large = torch.randn(128, 4096, 30, 20)
y1_large = m1(x_large)
y2_large = m2(y1_large)
print(y2.shape)

torch.Size([128, 40])


In [10]:
x = torch.ones((1, 1, 5, 5))
print(x)
x[:, :, :, 0] = 0
x[:, :, :, 1] = 0
x[:, :, 0, :] = 0
x[:, :, 1, :] = 0
print(x)
conv_w = torch.ones((1, 1, 3, 3))

y = nn.functional.conv2d(x, conv_w)
print(y)

tensor([[[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]]])
tensor([[[[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [0., 0., 1., 1., 1.],
          [0., 0., 1., 1., 1.],
          [0., 0., 1., 1., 1.]]]])
tensor([[[[1., 2., 3.],
          [2., 4., 6.],
          [3., 6., 9.]]]])


## 5. Complete Training Pipeline

Now let's put everything together into a complete deep learning training pipeline using a classic CNN architecture on a real dataset.

In [None]:
# 5.1 Dataset Loading and Preparation

print("=== Loading FashionMNIST Dataset ===")

from torch.utils.data import DataLoader, Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

# Load FashionMNIST dataset
training_data = datasets.FashionMNIST(root="data", train=True, download=True, transform=ToTensor())

test_data = datasets.FashionMNIST(root="data", train=False, download=True, transform=ToTensor())

# Create data loaders for batched training
batch_size = 64
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

print("Dataset Information:")
print(f"  Training samples: {len(training_data):,}")
print(f"  Test samples: {len(test_data):,}")
print(f"  Batch size: {batch_size}")
print(f"  Training batches: {len(train_dataloader):,}")
print(f"  Test batches: {len(test_dataloader):,}")

# Check data format
sample_batch = next(iter(train_dataloader))
images, labels = sample_batch
print("\nBatch Information:")
print(f"  Image batch shape: {images.shape}")  # [batch_size, channels, height, width]
print(f"  Label batch shape: {labels.shape}")  # [batch_size]
print(f"  Image range: [{images.min():.3f}, {images.max():.3f}]")
print(f"  Unique labels in batch: {torch.unique(labels).tolist()}")

# Fashion-MNIST class names
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
print(f"\nClass names: {class_names}")

=== Loading FashionMNIST Dataset ===


100%|██████████| 26.4M/26.4M [00:43<00:00, 601kB/s] 


Dataset Information:
  Training samples: 60,000
  Test samples: 10,000
  Batch size: 64
  Training batches: 938
  Test batches: 157

Batch Information:
  Image batch shape: torch.Size([64, 1, 28, 28])
  Label batch shape: torch.Size([64])
  Image range: [0.000, 1.000]
  Unique labels in batch: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Class names: ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']


In [None]:
# 5.2 LeNet Architecture - Classic Convolutional Neural Network
# 5.2 LeNet Architecture - Classic Convolutional Neural Network

print("=== LeNet-5 Architecture ===")


class LeNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)  # 28x28 -> 24x24
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)  # 12x12 -> 8x8

        # Fully connected layers
        self.fc1 = nn.Linear(16 * 4 * 4, 120)  # 16 channels * 4 * 4 = 256 -> 120
        self.fc2 = nn.Linear(120, 84)  # 120 -> 84
        self.fc3 = nn.Linear(84, num_classes)  # 84 -> 10 classes

    def forward(self, x):
        # Conv layer 1 + ReLU + MaxPool
        x = F.relu(self.conv1(x))  # [batch, 6, 24, 24]
        x = F.max_pool2d(x, 2, 2)  # [batch, 6, 12, 12]

        # Conv layer 2 + ReLU + MaxPool
        x = F.relu(self.conv2(x))  # [batch, 16, 8, 8]
        x = F.max_pool2d(x, 2, 2)  # [batch, 16, 4, 4]

        # Flatten for fully connected layers
        x = x.view(-1, 16 * 4 * 4)  # [batch, 256]

        # Fully connected layers
        x = F.relu(self.fc1(x))  # [batch, 120]
        x = F.relu(self.fc2(x))  # [batch, 84]
        x = self.fc3(x)  # [batch, 10]
        return x


# Create model and move to GPU
model = LeNet(num_classes=10).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("Model Architecture:")
print(model)
print("\nModel Statistics:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Model device: {next(model.parameters()).device}")

# Test forward pass
test_input = torch.randn(1, 1, 28, 28, device=device)
test_output = model(test_input)
print("\nTest forward pass:")
print(f"  Input shape: {test_input.shape}")
print(f"  Output shape: {test_output.shape}")
print(f"  Output probabilities: {F.softmax(test_output, dim=1)[0][:5]}")  # First 5 classes

=== LeNet-5 Architecture ===
Model Architecture:
LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=256, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

Model Statistics:
  Total parameters: 44,426
  Trainable parameters: 44,426
  Model device: cuda:0

Test forward pass:
  Input shape: torch.Size([1, 1, 28, 28])
  Output shape: torch.Size([1, 10])
  Output probabilities: tensor([0.1148, 0.0896, 0.0889, 0.0970, 0.0922], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [None]:
# 5.3 Complete Training Pipeline

print("=== Training Setup ===")

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)

print(f"Loss function: {criterion}")
print(f"Optimizer: {optimizer}")


def train_epoch(dataloader, model, loss_fn, optimizer, device):
    """Train the model for one epoch"""
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (X, y) in enumerate(dataloader):
        # Move data to GPU
        X, y = X.to(device), y.to(device)

        # Forward pass
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backward pass
        optimizer.zero_grad()  # Clear gradients
        loss.backward()  # Compute gradients
        optimizer.step()  # Update parameters

        # Statistics
        total_loss += loss.item()
        _, predicted = pred.max(1)
        total += y.size(0)
        correct += predicted.eq(y).sum().item()

        # Print progress
        if batch_idx % 200 == 0:
            print(
                f"  Batch {batch_idx:>3d}/{len(dataloader):>3d} | "
                f"Loss: {loss.item():>6.4f} | "
                f"Acc: {100.0 * correct / total:>5.1f}%"
            )

    avg_loss = total_loss / len(dataloader)
    accuracy = 100.0 * correct / total
    return avg_loss, accuracy


def test_epoch(dataloader, model, loss_fn, device):
    """Evaluate the model"""
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            loss = loss_fn(pred, y)

            total_loss += loss.item()
            _, predicted = pred.max(1)
            total += y.size(0)
            correct += predicted.eq(y).sum().item()

    avg_loss = total_loss / len(dataloader)
    accuracy = 100.0 * correct / total
    return avg_loss, accuracy


# Training loop
print("\n=== Training Started ===")
epochs = 5  # Reduced for demo
train_losses, train_accs = [], []
test_losses, test_accs = [], []

for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    print("-" * 50)

    # Train
    train_loss, train_acc = train_epoch(train_dataloader, model, criterion, optimizer, device)
    train_losses.append(train_loss)
    train_accs.append(train_acc)

    # Test
    test_loss, test_acc = test_epoch(test_dataloader, model, criterion, device)
    test_losses.append(test_loss)
    test_accs.append(test_acc)

    print(f"\nEpoch {epoch + 1} Summary:")
    print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
    print(f"  Test Loss:  {test_loss:.4f} | Test Acc:  {test_acc:.2f}%")

print("\nTraining Complete!")
print(f"Final Test Accuracy: {test_accs[-1]:.2f}%")

print("\nKey Learning Points:")
print(" - Forward pass: Data flows through network layers")
print(" - Loss computation: Measures prediction errors")
print(" - Backward pass: Computes gradients via chain rule")
print(" - Parameter update: Moves in direction of negative gradient")
print(" - This is the foundation of ALL deep learning!")
print(" - Same principles apply to LLMs, just different architectures")

=== Training Setup ===
Loss function: CrossEntropyLoss()
Optimizer: SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

=== Training Started ===

Epoch 1/5
--------------------------------------------------
  Batch   0/938 | Loss: 0.3842 | Acc:  89.1%
  Batch 200/938 | Loss: 0.5591 | Acc:  80.3%
  Batch 400/938 | Loss: 0.5601 | Acc:  80.1%
  Batch 600/938 | Loss: 0.5557 | Acc:  80.3%
  Batch 800/938 | Loss: 0.5397 | Acc:  80.5%

Epoch 1 Summary:
  Train Loss: 0.5240 | Train Acc: 80.73%
  Test Loss:  0.5266 | Test Acc:  81.10%

Epoch 2/5
--------------------------------------------------
  Batch   0/938 | Loss: 0.4744 | Acc:  89.1%
  Batch 200/938 | Loss: 0.4347 | Acc:  81.5%
  Batch 400/938 | Loss: 0.3430 | Acc:  81.4%
  Batch 600/938 | Loss: 0.3447 | Acc:  81.6%
  Batch 800/938 | Loss: 0.4633 | Acc:  81.8%

Epoch 2 Summary:
  Train Loss: 0.496

## Lab Summary

### Technical Concepts Learned
- **Tensor Fundamentals**: Creating tensors, understanding shapes, devices, and basic properties in PyTorch
- **Matrix Multiplication**: Core operation for neural networks using torch.matmul, and nn.Linear layers
- **Automatic Differentiation**: Introduction to requires_grad, backward(), and gradient computation
- **Element-wise Operations**: Hadamard product and basic tensor arithmetic
- **Data Types**: Understanding float32, float16, bfloat16 and their memory implications
- **Basic Neural Networks**: Building simple MLPs and LeNet CNN architecture
- **Training Fundamentals**: Forward pass, loss computation, backward pass, and parameter updates

### Experiment Further
- Create tensors with different shapes and verify matrix multiplication dimensions
- Compute gradients manually and compare with autograd results
- Modify the MLP hidden layer sizes and observe effects on accuracy
- Try different activation functions (ReLU, Tanh, Sigmoid) in the networks
- Change the number of training epochs and observe convergence
