<a href="https://colab.research.google.com/github/BedinEduardo/Colab_Repositories/blob/master/10_PyTorch_2_dot_0_ZTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What we are going to cover?

* New PyTorch 2.0 features - mainly torch.compile()
* Setting up a **series of experiments** to test PyTorch 2.0's improvements
* **Compare** the results of the experiments
* Discuss where to learn more

# PyTorch 2 Quick Intro

* PyTorch 2.0 realese notes: https://pytorch.org/blog/pytorch-2.0-release/


In [None]:
import torch
print(torch.__version__)

2.5.1+cu124


## Quick code Examples

## Before PyTorch 2.0

In [None]:
import torch
import torchvision
torch.multiprocessing.set_start_method('spawn')

model = torchvision.models.resnet50()

## After PyTorch 2.0

Note: some PyTorch 2.0 features may hinder the deployment of models: https://pytorch.org/get-started/pytorch-2.0/#inference-and-export

In [None]:
model = torchvision.models.resnet50()   # note: this could any model
compiled_model = torch.compile(model)

#https://pytorch.org/get-started/pytorch-2.0/#inference-and-export
### Training Code


### Testing Code


## 0. Getting start

In [None]:
import torch

# Check PyTorch version
pt_version = torch.__version__
print(f"[INFO] Current PyTorch version: {pt_version} (should be 2.x+)")

# Install PyTorch 2.0 if necessary
if pt_version.split(".")[0] == "1": # Check if PyTorch version begins with 1
    !pip3 install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    print("[INFO] PyTorch 2.x installed, if you're on Google Colab, you may need to restart your runtime.\
          Though as of April 2023, Google Colab comes with PyTorch 2.0 pre-installed.")
    import torch
    pt_version = torch.__version__
    print(f"[INFO] Current PyTorch version: {pt_version} (should be 2.x+)")
else:
    print("[INFO] PyTorch 2.x installed, you'll be able to use the new features.")

[INFO] Current PyTorch version: 2.5.1+cu124 (should be 2.x+)
[INFO] PyTorch 2.x installed, you'll be able to use the new features.


## Get GPU Ifo

Why get GPU info?

Because PyTorch 2.0 features - `torch.compile()` work best on newer NVIDIA GPUs.

Well, what's a newer NVIDA GPU?

To find out if your GPU is compatible, see NVIDIA GPU compability score: https://developer.nvidia.com/cuda-gpus

If ypur GPU has a score of 8.0+, i t can leverage *most* if not *all* of the new PyTorch 2.0 features.

GPUs under 8.0 can still leverage PyTorch 2.0, however, the improvements may not be as noticiable as those with 8.0+

**Note:**  If you are wondering what GPU you should use for deep learning, check out Tim Dettmers blog post "Which GPU for Deep Learning?"

In [None]:
# Make sure we're using a NVIDIA GPU
if torch.cuda.is_available():
  gpu_info = !nvidia-smi
  gpu_info = '\n'.join(gpu_info)
  if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

  # Get GPU name
  gpu_name = !nvidia-smi --query-gpu=gpu_name --format=csv
  gpu_name = gpu_name[1]
  GPU_NAME = gpu_name.replace(" ", "_") # remove underscores for easier saving
  print(f'GPU name: {GPU_NAME}')

  # Get GPU capability score
  GPU_SCORE = torch.cuda.get_device_capability()
  print(f"GPU capability score: {GPU_SCORE}")
  if GPU_SCORE >= (8, 0):
    print(f"GPU score higher than or equal to (8, 0), PyTorch 2.x speedup features available.")
  else:
    print(f"GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).")

  # Print GPU info
  print(f"GPU information:\n{gpu_info}")

else:
  print("PyTorch couldn't find a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

GPU name: Tesla_T4
GPU capability score: (7, 5)
GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).
GPU information:
Wed Mar  5 11:10:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   55C    P8             11W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |

## 1.1 Globally set devices

Previosly, we have set the devices of our tensors/models using `.to(device)`.

* `tensor.to(device)`

* `model.to(device)`

But in PyTorch 2.0, it is possible to set the device with a context manager as well as a global device: https://pytorch.org/blog/pytorch-2.0-release/#beta-gnn-inference-and-training-optimization-on-cpu

See the docs: https://pytorch.org/tutorials/recipes/recipes/changing_default_device.html

In [None]:
import torch

# set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

device

'cuda'

In [None]:
# Set the device with context manager  - requires PyTorch 2.x+
with torch.device(device):
  # All tensor or PyTorch objects builded in context manager will be on target device - without using .to()
  print(f"device: {device}")
  layer = torch.nn.Linear(20,30)
  print(f"Layer weights on device {layer.weight.device}")
  print(f"Layer building data on device: {layer(torch.randn(128,20)).device}")

device: cuda
Layer weights on device cuda:0
Layer building data on device: cuda:0


In [None]:
import torch

# Set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set the device globally - requires PyTorch 2.x+
torch.set_default_device(device)

layer = torch.nn.Linear(20,30)
print(f"device: {device}")
print(f"Layer weights on device {layer.weight.device}")
print(f"Layer building data on device: {layer(torch.randn(128,20)).device}")

device: cuda
Layer weights on device cuda:0
Layer building data on device: cuda:0


## 2. Setup the experiments

Time to test speed!

To keep things, we will run 4 experiments

* Model: ResNet50 from torchvision
* Data: CIFAR10 from torchvison
* Epochs: 5 (single run) and 3x5 (multi run)
* Batch size: 128 - may you want to change this depending on the amount of memory your GPU has
* Image size: 224 - **Note:** you may adjust this given the amount of GPU memory you have -

In [None]:
import torch
import torchvision

print(f"PyTorch version: {torch.__version__}")
print(f"TorchVision version: {torchvision.__version__}")

# Set the target device
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

PyTorch version: 2.5.1+cu124
TorchVision version: 0.20.1+cu124
Using device: cuda


### 2.1 Build model and transforms

* Resnet50 from PyTorch - https://pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html#torchvision.models.ResNet50_Weights

In [None]:
# Build model weights and Transforms
model_weights = torchvision.models.ResNet50_Weights.IMAGENET1K_V2   #.DEFAULT - also works
transforms = model_weights.transforms()

transforms

ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

In [None]:
# Build model
model = torchvision.models.resnet50(weights=model_weights)

model


Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 112MB/s]


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

In [None]:
# Count the number of parameters in the model
total_params = sum(
    param.numel() for param in model.parameters()   # Count all parameters
    #param.numel() for param in model.parameters() if param.requires_grad = True   # Only count parameters that are trainable
)

total_params

25557032

**Note:** PyTorch 2.0 *relative* speedups will be most noticiable when as much of the GPU as possible is being used. This mean a larger model - more trainable parameters - may take longer to train on the whole but will relatively faster. E.g.: a model with 1M parameters may take ~10 minutes to train, but a model with 25M parameters may take 20 min to train.

In [None]:
def create_model(num_classes=10):
  """
    Build a ResNet50 model with transforms and returns them both.
  """
  model_weights = torchvision.models.ResNet50_Weights.DEFAULT
  transforms = model_weights.transforms()
  model = torchvision.models.resnet50(weights=model_weights)

  # Adjuste the head layer to suit our number of classes
  model.fc = torch.nn.Linear(in_features=2048,
                             out_features=num_classes)

  return model, transforms

model, transforms = create_model()
transforms


ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

### 2.2 Speedups are most noticeble when a large portion on GPU are being used

Since modern GPU are so *fast* at performing operations, you will often notice the majority of *relative* speedups when as much data is possible on the GPU.

In practice, you generally want to use as much of your GPU memory as possible.

* Increasing the batch size - we have using batch size 32 but for GPU with a larger memory capacity you generally to use large as possible, eg. 128, 256, 512 ...
* Increasing the data size - for example instead of using images that are 32x32, use 224x224 or 336x336, also you could use and increase embedding size for your data
* Increase the model size - for example instead of using a model with 1M parameters, you can use a model with 10M parameters.
* Decreasing data transfer - since bandwith costs - transfering data - will slow down a GPU - because it wants to compute on data -

As a result of doing the above, your relative speedups shoul be better.

E.g overall training time may take longer but not linearly.

Resource for learning how to improve PyTorch model speed: https://sebastianraschka.com/blog/2023/pytorch-faster.html

**Note:** This concept of using as much data on the GPU as possible isn't restrict especifically to PyTorch 2.0, it applies to all version on PyTorch and basically all models that train on GPU

### 2.3 Checking the memory limits of your GPU

Can do using torch.cuda: https://pytorch.org/docs/stable/generated/torch.cuda.mem_get_info.html

In [None]:
# Check available GPU memory and total GPU memory
total_free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()

print(f"Total free GPU memory: {round(total_free_gpu_memory * 1e-9,3)} GB")
print(f"Total GPU memory: {round(total_gpu_memory * 1e-9,3)} GB")


Total free GPU memory: 15.446 GB
Total GPU memory: 15.828 GB


* If the GPU has 16GB+ of free memory, set the batch size to 128
* If the GPU has 16GB of free memory, set the batch size to 32

In [None]:
# Set batch size depending on amount of GPU memory
total_free_gpu_memory_gb = round(total_free_gpu_memory * 1e-9, 3)
if total_free_gpu_memory_gb >= 16:
  BATCH_SIZE = 128 # Note: you could experiment with higher values here if you like.
  IMAGE_SIZE = 224
  print(f"GPU memory available is {total_free_gpu_memory_gb} GB, using batch size of {BATCH_SIZE} and image size {IMAGE_SIZE}")
else:
  BATCH_SIZE = 32
  IMAGE_SIZE = 128
  print(f"GPU memory available is {total_free_gpu_memory_gb} GB, using batch size of {BATCH_SIZE} and image size {IMAGE_SIZE}")

GPU memory available is 15.446 GB, using batch size of 32 and image size 128


In [None]:
transforms.crop_size = 224
transforms.resize_size = 224

print(f"Updated data transforms: \n {transforms}")

Updated data transforms: 
 ImageClassification(
    crop_size=224
    resize_size=224
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)


### 2.4 More potential speedups with TF32

TF32 = TensorFloat32

TensorFloat = datatype that bridges Flat32 and Float16

Float32 = a number is represented by 32 bytes

Float16 = a number is represented by 16 bytes

See more on precision in computing:

What we want is:
1. Fast model training
2. Accurate model training

TensorFloat32 = a datatype from NVIDIA which combines float32 and float16.

TF32 is available on Ampere GPU+  

In [None]:
GPU_SCORE


(7, 5)

In [None]:
if GPU_SCORE >= (7,5):  # Check if GPU is compatible with TF32  - (8,0)
  print(f"[INFO] Using GPU with score: {GPU_SCORE} enables TensorFloat32")
  torch.backends.cuda.matmul.allow_tf32 = True

else:
  print(f"[INFO] Using GPU with score: {GPU_SCORE}, TensorFloat32")
  torch.backends.cuda.matmul.allow_tf32 = False

[INFO] Using GPU with score: (7, 5) enables TensorFloat32


## 2.5 Preparing Datasets

As before, we discussed we are going to use CIFAR10.

Home of CIFAR10: https://www.cs.toronto.edu/~kriz/cifar.html

We can download the dataset from torchvision - https://pytorch.org/vision/main/generated/torchvision.datasets.CIFAR10.html

In [None]:
transforms

ImageClassification(
    crop_size=224
    resize_size=224
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

In [None]:
# Build train and test datasets
import torchvision

train_dataset = torchvision.datasets.CIFAR10(root=".", # where to store data
                                             train=True, # taining dataset
                                             download=True,
                                             transform=transforms)

test_dataset = torchvision.datasets.CIFAR10(root=".",
                                            train=False,
                                            download=True,
                                            transform=transforms)

# Get the len of the dataset
train_len = len(train_dataset)
test_len = len(test_dataset)

print(f"[INFO] Train dataset lenght: {train_len}")
print(f"[INFO] Test dataset lenght: {test_len}")

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:04<00:00, 36.8MB/s]


Extracting ./cifar-10-python.tar.gz to .
Files already downloaded and verified
[INFO] Train dataset lenght: 50000
[INFO] Test dataset lenght: 10000


In [None]:
train_dataset[0][0].shape

torch.Size([3, 224, 224])

In [None]:
train_dataset[0][1]

6

### 2.6 Build dataloaders

Next:
* Turn datasets into DataLoaders

In [None]:
from torch.utils.data import DataLoader

import os

NUMBER_WORKERS = os.cpu_count()  # We want highest number of CPU cores to load data to GPU

#NUMBER_WORKERS
train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=NUMBER_WORKERS,
                              generator=torch.Generator(device='cuda:0')
)
test_dataloader = DataLoader(dataset=test_dataset,
                             batch_size=BATCH_SIZE,
                             shuffle=False,
                             num_workers=NUMBER_WORKERS,
                             generator=torch.Generator(device='cuda:0'))

# Print detaisl
print(f"Train Dataloader numbatches: {len(train_dataloader)} of batch size: {BATCH_SIZE}")
print(f"Tes Dataloader numbatches: {len(test_dataloader)} of batch size: {BATCH_SIZE}")
print(f"Using num workers to load data - more is generally better: {NUMBER_WORKERS}")

Train Dataloader numbatches: 1563 of batch size: 32
Tes Dataloader numbatches: 313 of batch size: 32
Using num workers to load data - more is generally better: 2


### 2.7 Building training and test loops

Want to build:
* Training and test loops + timing step for each, so we know how long our model take to train and test

We covered this functionallity in previous sections, one is here: https://www.learnpytorch.io/05_pytorch_going_modular/#4-creating-train_step-and-test_step-functions-and-train-to-combine-them

In [None]:
import time
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

def train_step(epoch: int,
               model: torch.nn.Module,
               dataloader: torch.utils.data.DataLoader,
               loss_fn: torch.nn.Module,
               optimizer: torch.optim.Optimizer,
               device: torch.device,
               disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Trains a PyTorch model for a single epoch.

  Turns a target PyTorch model to training mode and then
  runs through all of the required training steps (forward
  pass, loss calculation, optimizer step).

  Args:
    model: A PyTorch model to be trained.
    dataloader: A DataLoader instance for the model to be trained on.
    loss_fn: A PyTorch loss function to minimize.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A tuple of training loss and training accuracy metrics.
    In the form (train_loss, train_accuracy). For example:

    (0.1112, 0.8743)
  """
  # Put model in train mode
  model.train()

  # Setup train loss and train accuracy values
  train_loss, train_acc = 0, 0

  # Loop through data loader data batches
  progress_bar = tqdm(
        enumerate(dataloader),
        desc=f"Training Epoch {epoch}",
        total=len(dataloader),
        disable=disable_progress_bar
    )

  for batch, (X, y) in progress_bar:
      # Send data to target device
      X, y = X.to(device), y.to(device)

      # 1. Forward pass
      y_pred = model(X)

      # 2. Calculate  and accumulate loss
      loss = loss_fn(y_pred, y)
      train_loss += loss.item()

      # 3. Optimizer zero grad
      optimizer.zero_grad()

      # 4. Loss backward
      loss.backward()

      # 5. Optimizer step
      optimizer.step()

      # Calculate and accumulate accuracy metrics across all batches
      y_pred_class = torch.argmax(torch.softmax(y_pred, dim=1), dim=1)
      train_acc += (y_pred_class == y).sum().item()/len(y_pred)

      # Update progress bar
      progress_bar.set_postfix(
            {
                "train_loss": train_loss / (batch + 1),
                "train_acc": train_acc / (batch + 1),
            }
        )


  # Adjust metrics to get average loss and accuracy per batch
  train_loss = train_loss / len(dataloader)
  train_acc = train_acc / len(dataloader)
  return train_loss, train_acc

def test_step(epoch: int,
              model: torch.nn.Module,
              dataloader: torch.utils.data.DataLoader,
              loss_fn: torch.nn.Module,
              device: torch.device,
              disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Tests a PyTorch model for a single epoch.

  Turns a target PyTorch model to "eval" mode and then performs
  a forward pass on a testing dataset.

  Args:
    model: A PyTorch model to be tested.
    dataloader: A DataLoader instance for the model to be tested on.
    loss_fn: A PyTorch loss function to calculate loss on the test data.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A tuple of testing loss and testing accuracy metrics.
    In the form (test_loss, test_accuracy). For example:

    (0.0223, 0.8985)
  """
  # Put model in eval mode
  model.eval()

  # Setup test loss and test accuracy values
  test_loss, test_acc = 0, 0

  # Loop through data loader data batches
  progress_bar = tqdm(
      enumerate(dataloader),
      desc=f"Testing Epoch {epoch}",
      total=len(dataloader),
      disable=disable_progress_bar
  )

  # Turn on inference context manager
  with torch.no_grad(): # no_grad() required for PyTorch 2.0, I found some errors with `torch.inference_mode()`, please let me know if this is not the case
      # Loop through DataLoader batches
      for batch, (X, y) in progress_bar:
          # Send data to target device
          X, y = X.to(device), y.to(device)

          # 1. Forward pass
          test_pred_logits = model(X)

          # 2. Calculate and accumulate loss
          loss = loss_fn(test_pred_logits, y)
          test_loss += loss.item()

          # Calculate and accumulate accuracy
          test_pred_labels = test_pred_logits.argmax(dim=1)
          test_acc += ((test_pred_labels == y).sum().item()/len(test_pred_labels))

          # Update progress bar
          progress_bar.set_postfix(
              {
                  "test_loss": test_loss / (batch + 1),
                  "test_acc": test_acc / (batch + 1),
              }
          )

  # Adjust metrics to get average loss and accuracy per batch
  test_loss = test_loss / len(dataloader)
  test_acc = test_acc / len(dataloader)
  return test_loss, test_acc

def train(model: torch.nn.Module,
          train_dataloader: torch.utils.data.DataLoader,
          test_dataloader: torch.utils.data.DataLoader,
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          disable_progress_bar: bool = False) -> Dict[str, List]:
  """Trains and tests a PyTorch model.

  Passes a target PyTorch models through train_step() and test_step()
  functions for a number of epochs, training and testing the model
  in the same epoch loop.

  Calculates, prints and stores evaluation metrics throughout.

  Args:
    model: A PyTorch model to be trained and tested.
    train_dataloader: A DataLoader instance for the model to be trained on.
    test_dataloader: A DataLoader instance for the model to be tested on.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    loss_fn: A PyTorch loss function to calculate loss on both datasets.
    epochs: An integer indicating how many epochs to train for.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A dictionary of training and testing loss as well as training and
    testing accuracy metrics. Each metric has a value in a list for
    each epoch.
    In the form: {train_loss: [...],
                  train_acc: [...],
                  test_loss: [...],
                  test_acc: [...]}
    For example if training for epochs=2:
                 {train_loss: [2.0616, 1.0537],
                  train_acc: [0.3945, 0.3945],
                  test_loss: [1.2641, 1.5706],
                  test_acc: [0.3400, 0.2973]}
  """
  # Create empty results dictionary
  results = {"train_loss": [],
      "train_acc": [],
      "test_loss": [],
      "test_acc": [],
      "train_epoch_time": [],
      "test_epoch_time": []
  }

  # Loop through training and testing steps for a number of epochs
  for epoch in tqdm(range(epochs), disable=disable_progress_bar):

      # Perform training step and time it
      train_epoch_start_time = time.time()
      train_loss, train_acc = train_step(epoch=epoch,
                                        model=model,
                                        dataloader=train_dataloader,
                                        loss_fn=loss_fn,
                                        optimizer=optimizer,
                                        device=device,
                                        disable_progress_bar=disable_progress_bar)
      train_epoch_end_time = time.time()
      train_epoch_time = train_epoch_end_time - train_epoch_start_time

      # Perform testing step and time it
      test_epoch_start_time = time.time()
      test_loss, test_acc = test_step(epoch=epoch,
                                      model=model,
                                      dataloader=test_dataloader,
                                      loss_fn=loss_fn,
                                      device=device,
                                      disable_progress_bar=disable_progress_bar)
      test_epoch_end_time = time.time()
      test_epoch_time = test_epoch_end_time - test_epoch_start_time

      # Print out what's happening
      print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f} | "
          f"train_epoch_time: {train_epoch_time:.4f} | "
          f"test_epoch_time: {test_epoch_time:.4f}"
      )

      # Update results dictionary
      results["train_loss"].append(train_loss)
      results["train_acc"].append(train_acc)
      results["test_loss"].append(test_loss)
      results["test_acc"].append(test_acc)
      results["train_epoch_time"].append(train_epoch_time)
      results["test_epoch_time"].append(test_epoch_time)

  # Return the filled results at the end of the epochs
  return results

## 3. Time model across a single run

Experiment 1: single run without `torch.compile()` and for 5 epochs

### 3.1 Experiment 1 - Single run, no compile

In [None]:
# Set number of epochs
NUM_EPOCHS = 5

# Set the learning rate as a constant
LEARNING_RATE = 0.003

**Note:** Depending on your GPU/machine, the following code may take a while to run. E.g.: in my experience on an A100, it takes about 7 minutes

In [None]:
# Build a model
model, _ = create_model()

#model

model.to(device)

# Build loss function and optmizer
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),
                                lr=LEARNING_RATE)

# Train the model and track the results
single_run_no_compile_results = train(model=model,
                                      train_dataloader=train_dataloader,
                                      test_dataloader=test_dataloader,
                                      loss_fn=loss_fn,
                                      optimizer=optimizer,
                                      epochs=NUM_EPOCHS,
                                      device=device)

  0%|          | 0/5 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 0:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.2255 | train_acc: 0.5592 | test_loss: 1.0244 | test_acc: 0.6506 | train_epoch_time: 558.7603 | test_epoch_time: 44.4801


Training Epoch 1:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 1:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 2 | train_loss: 0.7520 | train_acc: 0.7372 | test_loss: 0.6567 | test_acc: 0.7761 | train_epoch_time: 551.3076 | test_epoch_time: 46.5078


Training Epoch 2:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 2:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 3 | train_loss: 0.5536 | train_acc: 0.8091 | test_loss: 0.7277 | test_acc: 0.7486 | train_epoch_time: 549.7059 | test_epoch_time: 44.8069


Training Epoch 3:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 3:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 4 | train_loss: 0.4337 | train_acc: 0.8482 | test_loss: 0.5169 | test_acc: 0.8227 | train_epoch_time: 542.0150 | test_epoch_time: 45.8811


Training Epoch 4:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 4:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 5 | train_loss: 0.3323 | train_acc: 0.8854 | test_loss: 0.4677 | test_acc: 0.8445 | train_epoch_time: 542.9109 | test_epoch_time: 46.6581


### 3.2 Experiment 2, single, using `torch.compile()`

Same setup as experiment 1 execpt with the new line `torch.compile()`.

In [None]:
# Build model and transforms
model, _ = create_model()
model.to(device)

# Build the loss function and optimzer
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),
                             lr=LEARNING_RATE)

# Compile the model  (requires PyTorch 2.0+)
import time
compile_start_time = time.time()

### New in PyTorch 2.x ###
compiled_model = torch.compile(model) #.to(device)   #Make sure the compile model is in the right device

compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time


print(f"Time to compile: {compile_time} | Note: the first time that you run the model, the first epoch may take longer due to optimization behind the scenes")

# Trainin the compiled model
single_run_compile_results = train(model=compiled_model,
                                   train_dataloader=train_dataloader,
                                   test_dataloader=test_dataloader,
                                   loss_fn=loss_fn,
                                   optimizer=optimizer,
                                   epochs=NUM_EPOCHS,
                                   device=device)

### 3.3 Compare the results of Experiments 01 and 02

In [None]:
# Turn experiments results into dataframes
import pandas as pd
single_run_no_compile_results_df = pd.DataFrame(single_run_no_compile_results)
single_run_compile_results_df = pd.DataFrame(single_run_compile_results)


In [None]:
single_run_no_compile_results_df

In [None]:
single_run_no_compile_results

In [None]:
# Create filename to save the results
DATASET_NAME = "CIFAR10"
MODEL_NAME = "ResNet50"

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_mean_epoch_times(non_compiled_results: pd.DataFrame,
                          compiled_results: pd.DataFrame,
                          multi_runs: bool=False,
                          num_runs: int=0,
                          save: bool=False,
                          save_path: str="",
                          dataset_name: str=DATASET_NAME,
                          model_name: str=MODEL_NAME,
                          num_epochs: int=NUM_EPOCHS,
                          image_size: int=IMAGE_SIZE,
                          batch_size: int=BATCH_SIZE) -> plt.figure:

    # Get the mean epoch times from the non-compiled models
    mean_train_epoch_time = non_compiled_results.train_epoch_time.mean()
    mean_test_epoch_time = non_compiled_results.test_epoch_time.mean()
    mean_results = [mean_train_epoch_time, mean_test_epoch_time]

    # Get the mean epoch times from the compiled models
    mean_compile_train_epoch_time = compiled_results.train_epoch_time.mean()
    mean_compile_test_epoch_time = compiled_results.test_epoch_time.mean()
    mean_compile_results = [mean_compile_train_epoch_time, mean_compile_test_epoch_time]

    # Calculate the percentage difference between the mean compile and non-compile train epoch times
    train_epoch_time_diff = mean_compile_train_epoch_time - mean_train_epoch_time
    train_epoch_time_diff_percent = (train_epoch_time_diff / mean_train_epoch_time) * 100

    # Calculate the percentage difference between the mean compile and non-compile test epoch times
    test_epoch_time_diff = mean_compile_test_epoch_time - mean_test_epoch_time
    test_epoch_time_diff_percent = (test_epoch_time_diff / mean_test_epoch_time) * 100

    # Print the mean difference percentages
    print(f"Mean train epoch time difference: {round(train_epoch_time_diff_percent, 3)}% (negative means faster)")
    print(f"Mean test epoch time difference: {round(test_epoch_time_diff_percent, 3)}% (negative means faster)")

    # Create a bar plot of the mean train and test epoch time for both compiled and non-compiled models
    plt.figure(figsize=(10, 7))
    width = 0.3
    x_indicies = np.arange(len(mean_results))

    plt.bar(x=x_indicies, height=mean_results, width=width, label="non_compiled_results")
    plt.bar(x=x_indicies + width, height=mean_compile_results, width=width, label="compiled_results")
    plt.xticks(x_indicies + width / 2, ("Train Epoch", "Test Epoch"))
    plt.ylabel("Mean epoch time (seconds, lower is better)")

    # Create the title based on the parameters passed to the function
    if multi_runs:
        plt.suptitle("Multiple run results")
        plt.title(f"GPU: {gpu_name} | Epochs: {num_epochs} ({num_runs} runs) | Data: {dataset_name} | Model: {model_name} | Image size: {image_size} | Batch size: {batch_size}")
    else:
        plt.suptitle("Single run results")
        plt.title(f"GPU: {gpu_name} | Epochs: {num_epochs} | Data: {dataset_name} | Model: {model_name} | Image size: {image_size} | Batch size: {batch_size}")
    plt.legend();

    # Save the figure
    if save:
        assert save_path != "", "Please specify a save path to save the model figure to via the save_path parameter."
        plt.savefig(save_path)
        print(f"[INFO] Plot saved to {save_path}")

In [None]:
# Buiild directory for saving figures
import os
dir_to_save_figures_in = "pytorch_2_results/figures/"
os.makedirs(dir_to_save_figures_in, exist_ok= True)

# Build a save path for the single run results
save_path_multi_run = f"{dir_to_save_figures_in}single_run_{GPU_NAME}_{MODEL_NAME}"
print(f"[INFO] Save path for single run results: {save_path_multi_run}")

# Plot the results and save the figures
plot_mean_epoch_times(non_compiled_results=single_run_no_compile_results_df,
                      compiled_results=single_run_compile_results_df,
                      multi_runs=False,
                      save_path=save_path_multi_run,
                      save=True)

In [None]:
os.cpu_count()

### 3.4 Save Single run results to file with GPU details

In [None]:
# Make a directory for single_run results
import os
pytorch_2_results_dir = "pytorch_2_results"
pytorch_2_single_run_results_dir = f"{pytorch_2_results_dir}/single_run_results"
os.makedirs(pytorch_2_single_run_results_dir, exist_ok=True)

# Create filenames for each of the dataframes
save_name_for_non_compiled_results = f"single_run_non_compiled_results_{DATASET_NAME}_{MODEL_NAME}_{GPU_NAME}.csv"
save_name_for_compiled_results = f"single_run_compiled_results_{DATASET_NAME}_{MODEL_NAME}_{GPU_NAME}.csv"

# Create filepaths to save the results to
single_run_no_compile_save_path = f"{pytorch_2_single_run_results_dir}/{save_name_for_non_compiled_results}"
single_run_compile_save_path = f"{pytorch_2_single_run_results_dir}/{save_name_for_compiled_results}"
print(f"[INFO] Saving non-compiled experiment 1 results to: {single_run_no_compile_save_path}")
print(f"[INFO] Saving compiled experiment 2 results to: {single_run_compile_save_path}")

# Save the results
single_run_no_compile_results_df.to_csv(single_run_no_compile_save_path)
single_run_compile_results_df.to_csv(single_run_compile_save_path)

## 4. Time models across multiple runs

Time for multi-run experiments!

* Experiment 3 - 3x5 epochs without `torch.compile()`
* Experiment 4 - 3x5 epochs with `torch.compile()`

Before running experiment 3 and 4, let's build 3 functions:

1. **Experiment 3:** `create_and_train_non_compiled_model()` - build and trains a model for single run. Can put this function in a loop for multiple runs.
2. **Experiment 4:** `create_compiled_model()` - Build and compiles a model, returns the compiled model.
3. **Experiment 4:** `train_compiled_model()` - Trains a compiled model for a sinngle run - can put it in a loop for multiple runs.

In [None]:
def create_and_train_no_compiled_model(epochs=NUM_EPOCHS,
                                       learning_rate=LEARNING_RATE,
                                       disable_progress_bar=False):
  """
    Build and train a non-compiled PyTorch Model.
  """
  model, _ = create_model()
  model.to(device)

  loss_fn = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters(),
                               lr=learning_rate)

  results = train(model=model,
                  train_dataloader=train_dataloader,
                  test_dataloader=test_dataloader,
                  loss_fn=loss_fn,
                  optimizer=optimizer,
                  epochs=epochs,
                  device=device,
                  disable_progress_bar=disable_progress_bar)

  return results

def create_compiled_mode():
  """
    Build a compiled PyTorch model and return it.
  """
  model, _ = create_model()
  model.to(device)

  compile_start_time = time.time()

  ### New in PyTorch 2.0!!! ###
  compiled_model = torch.compile(model)

  model_end_time = time.time()

  compile_time = compile_end_time - compile_start_time
  print(f"[INFO] Model compile time: {compile_time}")

  return compiled_model

def train_compiled_model(model=compiled_model,
                         epochs=NUM_EPOCHS,
                         learning_rate=LEARNING_RATE,
                         disable_progress_bar=False,
                         ):
  """
    Traing a compiled model and return the results.
  """
  loss_fn = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(compiled_model.parameters(),
                               lr=learning_rate)

  compile_results = train(model=model,
                          train_dataloader=train_dataloader,
                          test_dataloader=test_dataloader,
                          loss_fn=loss_fn,
                          optimizer=optimizer,
                          epochs=epochs,
                          device=device,
                          disable_progress_bar=disable_progress_bar)

  return compile_results


### 4.1 Experiment 3 - Multiple-runs, no compile

**Note:** Because we are running a multiple runs, the code below may take a while to run.If one single runs take 7 minutes on an A100, the following code may could take 20 min in an A100.

> One of the most painful things in machine learning is that models take a while to train, and one of the most beatiful things in ML take a while to train. So plenty a time to go for a walk.

In [None]:
# Run non-compiled model for multiple runs
NUM_RUNS = 3
NUM_EPOCHS = 5

# Build an empty list to store multiple run results
non_compile_results_multiple_runs = []

# Run non-compiled model for multiple runs
for i in tqdm(range(NUN_RUNS)):
  print(f"[INFO] {i+1} of {NUM_RUNS} for non-compiled models"])
  results = create_and_train_no_compiled_model(epochs=NUM_EPOCHS, disable_progress_bar=True)
  non_compile_results_multiple_runs.append(results)


In [None]:
non_compile_results_multiple_runs

In [None]:
# Go trough the non compile_results_multiple_results and build a dataframe for each dataframe
non_compile_results_dfs = []
for result in non_compile_results_multiple_runs:
  result_df = pd.DataFrame(result)
  non_compile_results_dfs.append(result_df)

non_compile_results_multiple_runs_df = pd.concat(non_compile_results_dfs)

#
non_compile_results_multiple_runs_df

In [None]:
# Get the average  results across the board
non_compile_results_multiple_run_df = non_compile_results_multiple_runs_df.groupby(non_compile_results_multiple_runs_df.index).mean()

In [None]:
non_compile_results_multiple_runs_df

### 4.2 Experiment 4, multiple runs with compile

In [None]:
os.cpu_count()

In [None]:
# Build a compiled model
compiled_model = create_compiled_model()