# Introduction to PyTorch (2026)

Iona Biggart, Antigone Fogel, Anastasia Gailly de Taurine, Nan Fletcher-Lloyd, Payam Barnaghi

## What is the point of this notebook?

The goal of this notebook is to understand **how data flows through a PyTorch machine learning pipeline**.

We will:
- Turn raw data into **tensors**
- Load data in **Dataloaders and batches**
- Define a **model** that makes predictions
- Measure errors with a **loss function**
- Improve the model using **optimisation**
- Evaluate performance on **unseen data**

The focus is on **understanding the structure**, not on building a complex model.

> By the end, you should be able to read, write, and modify a basic PyTorch training loop.


In [None]:
# Core PyTorch library
import torch
import torch.nn as nn
import torch.optim as optim

# Utilities for datasets and batching
from torch.utils.data import DataLoader, Dataset

# Vision datasets (MNIST is included here)
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import random_split

# Visualisation
import matplotlib.pyplot as plt

# additional
import numpy as np
import os
import pandas as pd


## Tensors: The Building Blocks of PyTorch

Tensors are the main data structure used in PyTorch.  
They store numbers and have a **shape**, **data type**, and **device**.

Everything in PyTorch — inputs, outputs, and model parameters — is a tensor.


### 1. Creating tensors from Python lists


In [None]:
# Scalar (0D tensor)
a = torch.tensor(3)

# Vector (1D tensor)
b = torch.tensor([1, 2, 3])

# Matrix (2D tensor)
c = torch.tensor([[1, 2], [3, 4]])

a, b, c


### 2. Checking tensor properties

In [None]:
print("Shape:", c.shape)
print("Data type:", c.dtype)
print("Device:", c.device) # CPU or GPU 


### 3. Creating tensors with built-in functions

In [None]:
zeros = torch.zeros(3, 4) # 3 rows, 4 columns of 0s
ones = torch.ones(2, 2) # 2 rows, 2 columns of 1s
random = torch.rand(2, 3) # 2 rows, 3 columns of random values between 0 and 1

zeros, ones, random


### 4. Basic tensor operations
#### 4a. Element-wise addition

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

print(x + y) # Element-wise addition


#### 4b. Matrix multiplication 


In [None]:
# Input matrix (e.g. a batch of 2 samples with 3 features)
X = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])

# Weight matrix (maps 3 input features to 2 outputs)
W = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0],
                  [5.0, 6.0]])

# Matrix multiplication
Y = X @ W

Y


### 6. Moving tensors to GPU (if available)

In [None]:
# CPU information
num_cpus = os.cpu_count()
print(f"Number of CPUs available: {num_cpus}")

# GPU information
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs available: {num_gpus}")

if num_gpus > 0:
    for i in range(num_gpus):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("No GPU available. Using CPU.")


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)

x.device

### Side note: GPU vs CUDA (What’s the Difference?)

These two terms are often confused, but they are **not the same thing**.

### GPU (Graphics Processing Unit)
- A **piece of hardware**
- Designed to perform many calculations in parallel
- Used to speed up deep learning computations
- Can come from different vendors (NVIDIA, AMD, Apple)

### CUDA
- A **software platform** created by NVIDIA
- Allows programs (like PyTorch) to run code on **NVIDIA GPUs**
- Provides tools, drivers, and libraries for GPU acceleration
- Only works with **NVIDIA GPUs**

### How this relates to PyTorch
- PyTorch can run on:
  - CPU
  - NVIDIA GPUs (via CUDA)
  - Apple GPUs (via Metal / MPS)
- When you use `device="cuda"`, you are using **CUDA on an NVIDIA GPU**

### Key takeaway
> *GPU is the hardware. CUDA is NVIDIA’s way of using it.*


## Loading the MNIST Dataset

We now load a **real dataset** so we can train and evaluate a neural network.

**MNIST** is a classic dataset of handwritten digits:
- Each image is a **28 × 28 grayscale image**
- Each image shows a digit from **0 to 9**
- It is commonly used to learn and test deep learning pipelines

### Why we need preprocessing (transforms)

Raw images cannot be used directly by PyTorch models.
We apply **transforms** to prepare the data:

- Convert images into **PyTorch tensors**
- Scale pixel values to a **standard numerical range**
- Ensure the data has a consistent format for training

### Train vs Validation vs Test data

- **Training data** is used to teach the model
- **Validation set**  
  - Monitor performance
  - Detect overfitting
  - Tune hyperparameters

- **Test set**  
  - Used only once at the very end  
  - Gives an unbiased estimate of real-world performance

> The test set should never influence training decisions!!!

- This separation helps measure how well the model generalises

In the next cell, we:
- Define the preprocessing steps
- Download the MNIST dataset
- Create training and test datasets ready for batching


In [None]:
# Set random seeds for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

In [None]:
transform = transforms.Compose([
    transforms.ToTensor(),              # Convert image → PyTorch tensor
    transforms.Normalize((0.5,), (0.5,)) # Normalise values to ~[-1, 1]
])


train_dataset = torchvision.datasets.MNIST(
    root="./data",
    train=True,
    download=True,
    transform=transform
)

test_dataset = torchvision.datasets.MNIST(
    root="./data",
    train=False,
    download=True,
    transform=transform
)

train_size = int(0.9 * len(train_dataset))
val_size = len(train_dataset) - train_size

generator = torch.Generator().manual_seed(42)

train_dataset, val_dataset = random_split(
    train_dataset,
    [train_size, val_size],
    generator=generator
)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")



## Creating DataLoaders

Datasets store the data, but models are trained using **batches of data**.  
A **DataLoader** controls how data is grouped, ordered, and fed to the model.

### What a DataLoader does
- Splits data into **mini-batches**
- Controls **shuffling** of samples
- Efficiently loads data during training and evaluation

### Shuffling: why it matters

- **Training (`shuffle=True`)**
  - Randomises the order of samples each epoch
  - Prevents the model from learning order-specific patterns
  - Improves generalisation

- **Validation & Test (`shuffle=False`)**
  - Keeps data order fixed
  - Ensures reproducible and comparable evaluation
  - Reflects real-world inference conditions

### Batch size

- `batch_size` controls how many samples the model sees at once
- Smaller batches → noisier updates, lower memory usage
- Larger batches → smoother updates, higher memory usage

In the next cell, we create DataLoaders for:
- Training
- Validation
- Testing


In [None]:
batch_size = 64

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False
)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False
)

In [None]:
# Visualise some training data

images, labels = next(iter(train_loader))

plt.figure(figsize=(8, 4))
for i in range(8):
    plt.subplot(2, 4, i+1)
    plt.imshow(images[i].squeeze(), cmap="gray")
    plt.title(f"Label: {labels[i].item()}")
    plt.axis("off")
plt.show()


## Defining the Model

We now define a **neural network** that maps input images to digit predictions.

### What this model does
- Takes a **28 × 28 image** as input
- Flattens it into a vector
- Learns intermediate representations
- Outputs **10 scores**, one for each digit (0–9)

### Key ideas
- Models are built by subclassing `nn.Module`
- Layers define **what can be learned**
- The `forward()` method defines **how data flows**
- Inputs and outputs are **tensors**


In [None]:
class SimpleNeuralNetwork(nn.Module):

    """
    This class defines a simple feedforward neural network for image classification.

    - Takes 28×28 images as input (MNIST)
    - Flattens each image into a vector of length 784
    - Uses fully connected (Linear) layers with a ReLU activation
    - Outputs 10 scores, one for each digit class (0–9)

    """

    def __init__(self): # Define layers and components 
        super().__init__() # Initialize the parent class (nn.Module)

        # Flatten 28x28 image → vector of length 784
        self.flatten = nn.Flatten()

        # Fully connected layers
        self.fc1 = nn.Linear(28 * 28, 128) # Hidden layer with 128 neurons, input size 784
        self.relu = nn.ReLU() # Activation function, introduces non-linearity
        self.fc2 = nn.Linear(128, 10)  # 10 classes (digits 0–9), output layer

    def forward(self, x):
        # Define how data flows through the network
        x = self.flatten(x)  # Flatten the input
        x = self.fc1(x)  # Apply first fully connected layer
        x = self.relu(x)  # Apply ReLU activation
        x = self.fc2(x) # Output layer
        return x


In [None]:
model = SimpleNeuralNetwork().to(device) #Define the model and move to device. All parameters are now on the device.
print(model)

In [None]:
# CrossEntropyLoss: loss function for multi-class classification
criterion = nn.CrossEntropyLoss()

# Adam is a commonly used adaptive optimiser
optimizer = optim.Adam(model.parameters(), lr=0.001)


## Training and Validation Loop

We now train the model and evaluate it on **validation data** after each epoch.

### What happens during training
- The model sees batches of training data
- Predictions are compared to true labels
- Errors (loss) are backpropagated
- Model weights are updated

### What happens during validation
- The model is switched to **evaluation mode**
- No gradients are computed
- Performance is measured on unseen data
- We monitor generalisation, not learning


In [None]:
train_losses = []
val_losses = []

num_epochs = 5 # Number of times to iterate over the entire training dataset

for epoch in range(num_epochs):
    # ---- Training ----
    model.train()  # Set model to training mode
    running_loss = 0.0 # Initialise running loss for the epoch

    for images, labels in train_loader:
        # Move data to CPU/GPU
        images = images.to(device) # Important: move images to the same device as the model
        labels = labels.to(device) # Move labels to device

        # ---- Forward pass ----
        outputs = model(images) # Get model predictions
        loss = criterion(outputs, labels) # Compute loss

        # ---- Backward pass ----
        optimizer.zero_grad()  # Reset gradients (remove old gradients from previous step)
        loss.backward()        # Compute new gradients 
        optimizer.step()       # Update weights
        # These three steps ensure each weight update is based only on the current batch’s loss.

        running_loss += loss.item() # Accumulate loss over batches

    avg_loss = running_loss / len(train_loader) # Average loss for the epoch
    train_losses.append(avg_loss) # Store training loss
    print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {avg_loss:.4f}")


     # ---- Validation ----
    model.eval()  # Set model to evaluation mode
    val_loss = 0.0

    with torch.no_grad(): # Disable gradient computation
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)

            val_loss += loss.item() # accumulate validation loss over batches

    avg_val_loss = val_loss / len(val_loader) # average validation loss per epoch
    val_losses.append(avg_val_loss)
    print(f"Validation Loss: {avg_val_loss:.4f}")



## Identifying Overfitting

- Training loss usually **decreases steadily**
- Validation loss decreases at first, then may **stop improving**
- If validation loss increases while training loss keeps decreasing:
  → the model is **overfitting**

In [None]:
import matplotlib.pyplot as plt

epochs = range(1, num_epochs + 1)

plt.plot(epochs, train_losses, label="Training Loss")
plt.plot(epochs, val_losses, label="Validation Loss")

plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.show()


## Final Evaluation on the Test Set

After training and validation, we evaluate the model **once** on the test dataset.

### Purpose of the test set
- Measures **true performance** on unseen data
- Must not influence training or model choices
- Used only after all training decisions are final

### What happens during testing
- The model is set to **evaluation mode**
- Gradient computation is disabled
- Predictions are compared to true labels
- Accuracy is computed over the entire test set

This gives an unbiased estimate of how the model would perform in the real world.


In [None]:
model.eval()  # Evaluation mode
correct = 0
total = 0

with torch.no_grad():  # Disable gradient computation
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        predictions = outputs.argmax(dim=1)
 
        total += labels.size(0) # Total number of labels
        correct += (predictions == labels).sum().item() # Count correct predictions

accuracy = 100 * correct / total 
print(f"Test Accuracy: {accuracy:.2f}%")


## Example 2: Working with Tabular Data (CSV)

So far, we used image data.  
Now we show the **same PyTorch pipeline** using data from a **CSV file**, which is very common in science and industry.

In this example:
- Rows = data samples
- Columns = features
- One column = target (label)

The goal is to show that **PyTorch works the same way regardless of data type**.


In [None]:
# Set random seeds for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame   # includes features + target

# Separate features and labels
X = df.drop(columns="target").values 
y = df["target"].values # 3 flower specicies

In [None]:
df.head()

In [None]:
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

In [None]:
# tells PyTorch how to access our data

class IrisDataset(Dataset):
    def __init__(self, features, labels):
        self.X = features
        self.y = labels

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx] # return features and label for given index

dataset = IrisDataset(X, y)

In [None]:
train_size = int(0.7 * len(dataset))
val_size = int(0.15 * len(dataset))
test_size = len(dataset) - train_size - val_size # Remaining samples go to test set

generator = torch.Generator().manual_seed(42) # For reproducibility

train_dataset, val_dataset, test_dataset = random_split(
    dataset,
    [train_size, val_size, test_size],
    generator=generator
)


In [None]:
batch_size = 16

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=batch_size, shuffle=False)
test_loader  = DataLoader(test_dataset,  batch_size=batch_size, shuffle=False)

In [None]:
"""
This class defines a simple Multi-Layer Perceptron (MLP).

- Takes input feature vectors as input
- Uses fully connected (Linear) layers with ReLU activations
- Learns non-linear relationships in the data
- Outputs 3 scores, one for each Iris class
"""

class MLP(nn.Module): 
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 3)  # 3 Iris classes
        )

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=X.shape[1]) # Input dimension based on features
print(model)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
num_epochs = 50
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    # ---- Training ----
    model.train()
    running_loss = 0.0

    for features, labels in train_loader:
        outputs = model(features)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    avg_train_loss = running_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # ---- Validation ----
    model.eval()
    val_loss = 0.0

    with torch.no_grad():
        for features, labels in val_loader:
            outputs = model(features)
            loss = criterion(outputs, labels)
            val_loss += loss.item()

    avg_val_loss = val_loss / len(val_loader)
    val_losses.append(avg_val_loss)

    print(
        f"Epoch [{epoch+1}/{num_epochs}] "
        f"Train Loss: {avg_train_loss:.4f} "
        f"Val Loss: {avg_val_loss:.4f}"
    )

In [None]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for features, labels in test_loader:
        outputs = model(features)
        predictions = outputs.argmax(dim=1)

        total += labels.size(0)
        correct += (predictions == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

## PyTorch Beyond Scratch Models: Using Pretrained LLMs

So far, we built neural networks **from scratch** using PyTorch.

In practice, many modern AI systems use:
- **Large pretrained models**
- Built using **PyTorch**
- Loaded with high-level libraries like Hugging Face or Ollama

The key idea:
> PyTorch is the engine under the hood — these libraries provide convenience.


### Hugging Face Transformers

Hugging Face provides access to pretrained models for:
- Text generation
- Classification
- Translation
- Question answering

Most Hugging Face models are implemented in **PyTorch**.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_name = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32  # CPU-safe
)

model.eval()

  from .autonotebook import tqdm as notebook_tqdm
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.96s/it]


PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (rotary_emb): PhiRotaryEmbedding()
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (final_layernorm): LayerNorm((2560,), eps=1

In [2]:
prompt = "What is dementia?"

inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=40,
        temperature=0.7,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is dementia?
Dementia is a syndrome that describes a group of symptoms, including memory loss, confusion and social withdrawal. The symptoms are caused by brain damage from Alzheimer’s, Parkinson’s
