# The Premise

Pytorch 2.0 is something released march 15, 2023, roughly a year ago as I write this

<img src="assets/Before.png" alt="The WorkFlow" width="500">

The largest difference is that everything can run faster now on pytorch (especially for newer GPUs), with the help of one line, `torch.compile()`

<img src="assets/After.png" alt="The WorkFlow" width="800">

By how much faster? Well, the PyTorch team found that torch.compile() provides an average speedup of 43% in training on a NVIDIA A100 GPU

Here are some of the comparisons they've made on training, using models from different places

<img src="assets/Speedup.png" alt="The WorkFlow" width="800">

So how does it work? (Uh, you don't expect me to know how GPUs, Pytorch, Memory, and everything of how programming works fundamentally under the hood, right?)

Two big things:

- Operator Fusion
- Graph Capture

But honestly... this is like "front of the line mathematics and computing", I'm no where close to that

## Operator Fusion

The big takeaway is that GPUs are fast at computing numbers, it's not the speed at which they compute that limit them, but the speed at which data is being sent to the GPU from the CPU.

This is known as the bandwidth, think it as how many lanes there are to drive on a highway, the more lanes the more cars the drive in parallel, so more data can be sent

What it means in our case, is instead of data being sent back and fourth from GPU to CPU to GPU to CPU... 

We let the GPU do as many operations as it can hold in memory in 1 go before sending results back to CPU and receiving the next set of instructions

<img src="assets/Fusion.gif" alt="The WorkFlow" width="800">

## Graph Capture

This, on a large scheme of things is understanding and enhancing the data flow and computation of a program, which is often represented as a computational graph.

When we tell pytorch to go through a series of operations (like forward propagation in neural network), it has to look up what each operation does as it performs them

That makes overhead, in a nutshell meaning we have these little gaps of time where the GPU does nothing, as it doesn't know what the operation does, and pytorch has to look it up and tell the GPU

With graph capture, we pre-capture the definition of all operations that need to happen ahead of time and sent that to the GPU in 1 go

So no repeated lookups as we go, the GPU has access to the definition of all operations that are going to happen directly

<img src="assets/Graph_Capture.gif" alt="The WorkFlow" width="800">

Though, capturing the definition of all operations does take a little bit of time, so it takes a little longer for the startup, but faster for every step afterwards

## Additional Notes

For a deeper understanding of compute, memory bandwidth, and overhead for making models run faster, go read https://horace.io/brrr_intro.html

1. Pytorch 2.0 doesn't kill your old code, it's backwards compatible

2. `torch.inference_mode` bugs with `torch.compile`, and is slower than `torch.no_grad` with `torch.compile`, so please use `torch.no_grad` in testing/using the model

3. You can now set a GLOBAL device with `torch.set_default_device()` which all your data, models will default go to, no more .to(device) for every line of code

4. There's a new thing called `torch.amp`, which provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (float) datatype

4. There's even additional features for newer GPUS for computing matrix multiplication faster!  `torch.backends.cuda.matmul.allow_tf32` 


For the faster matrix multiplication, it only works on GPUS with computer score of 8.0 or above, check out https://developer.nvidia.com/cuda-gpus to know your Nvidia GPU compute score

## What we Doing?

We will make and train 2 models, one with torch.compile and one without, then comparing their training speeds (as that's supposedly what the improvement should be from torch.compile)

Though, this computer is only a 2080, which is considered a rather old GPU in the rapidly evolving AI field, and torch.compile works better on newer GPUs, so we might not have as noticeable performance increases

In this case, we've chosen a simple setup:

- ResNet50
- CIFAR10

# Setup

## Get GPU Info

Many of the speedups PyTorch 2.0 offers are best experienced on newer NVIDIA GPUs, we can see what our GPU is, and how good it is

Again... I have no clue how to work with Nvidia-smi, so this is just copied code to see how good our GPU is

In [7]:
import torch

# Make sure we're using a NVIDIA GPU
if torch.cuda.is_available():
  gpu_info = !nvidia-smi
  gpu_info = '\n'.join(gpu_info)
  if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

  # Get GPU name
  gpu_name = !nvidia-smi --query-gpu=gpu_name --format=csv
  gpu_name = gpu_name[1]
  GPU_NAME = gpu_name.replace(" ", "_") # remove underscores for easier saving
  print(f'GPU name: {GPU_NAME}')

  # Get GPU capability score
  GPU_SCORE = torch.cuda.get_device_capability()
  print(f"GPU capability score: {GPU_SCORE}")
  if GPU_SCORE >= (8, 0):
    print(f"GPU score higher than or equal to (8, 0), PyTorch 2.x speedup features available.")
  else:
    print(f"GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).")
  
  # Print GPU info
  print(f"GPU information:\n{gpu_info}")

else:
  print("PyTorch couldn't find a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

GPU name: NVIDIA_GeForce_RTX_2080
GPU capability score: (7, 5)
GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).
GPU information:
Sat Apr 13 16:45:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.29                 Driver Version: 546.29       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 2080      WDDM  | 00000000:01:00.0  On |                  N/A |
|  8%   52C    P8              36W / 300W |   1289MiB /  8192MiB |      3%      Default |
|                                         | 

## GPU Speedup Aspects

When there is as much data as possible is on the GPU being computed in parallel, we will see better speedups

This can be achieved by:

- increasing batch size
- increasing data size
- increasing model size
- decreasing data transfer (less talk between gpu and cpu)

Though that does raise a question, "But doesn't this mean that the GPU will be slower because it has to do more work?"

Yes, it does become slower on individual operations, but we have more operations going in parallel

## Checking GPU Memory Limit

But... our GPU is also limited by it's own memory for how much data it can hold and process at the same time for faster training/running, which we can check on Nvidia-smi

In [8]:
total_free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
print(f"Total free GPU memory: {round(total_free_gpu_memory * 1e-9, 3)} GB")
print(f"Total GPU memory: {round(total_gpu_memory * 1e-9, 3)} GB")

Total free GPU memory: 7.227 GB
Total GPU memory: 8.59 GB


This is really not a lot in machine learning terms, heck even I got 32G of DRAM on this computer

## Globally Setup Devices

Ah yes, all tensors created will be on the global device by default, so everything to the GPU

This is an amazing quality of life update, I'm always worried about setting devices wrongly

In [9]:
# Set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set the device globally
torch.set_default_device(device=device)

## Create Models and Transforms

The two experiments we will be running is:

- Model: ResNet50 (from TorchVision)
- Data: CIFAR10 (from TorchVision)
- Epochs: 5
- Batch size: 32 (Originally he tried 128, which is definitely too big for our GPU)
- Image size: 128 (Originally he tried 224, which is also too big for our GPU)

Each experiment will be run with and without torch.compile()

In [10]:
import torchvision


def create_resnet50(num_classes = 10):

    #import weights and model
    weights = torchvision.models.ResNet50_Weights.IMAGENET1K_V2
    transforms = weights.transforms()

    #create model using weights
    model = torchvision.models.resnet50(weights=weights)

    #adjust classifier head
    model.fc = torch.nn.Linear(in_features=2048, out_features=num_classes)

    #see the number of parameters in model
    total_params = sum(param.numel() for param in model.parameters())
                       
    #print it out
    print(f"Total parameters of model: {total_params} (the more parameters, the more GPU memory the model will use, the more *relative* of a speedup you'll get)")
    print(f"Model transforms:\n{transforms}")

    return model, transforms

model, transforms = create_resnet50()

Total parameters of model: 23528522 (the more parameters, the more GPU memory the model will use, the more *relative* of a speedup you'll get)
Model transforms:
ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)


Adjust transforms to our image size

In [11]:
transforms.crop_size = 128
transforms.resize_size = 128
print(f"Updated data transforms:\n{transforms}")

Updated data transforms:
ImageClassification(
    crop_size=128
    resize_size=128
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)


## More Potential Speedup with TF32

TF32 stands for TensorFloat-32, a data format which is a combination of 16-bit and 32-bit floating point numbers

Oh, floating point precision and the deep rabbit hole of how computers actually represent and calculate numbers... we aren't diving into that

The big thing is it can provide faster matrix multiplication on GPUs with Ampere architecture and above (a compute capability score of 8.0+)

In [12]:
# By default this is set to false
if GPU_SCORE >= (8, 0):
    print("GPU Score higher than 8.0, enabling TensorFloat32 computing")
    torch.backends.cuda.matmul.allow_tf32 = True
else:
    print("GPU Score lower than 8.0, not enabling TensorFloat32 computing")
    torch.backends.cuda.matmul.allow_tf32 = False

GPU Score lower than 8.0, not enabling TensorFloat32 computing


# The Data

## Download Dataset

Downloading the raw data to our local computer

In [14]:
# Create train and test datasets
train_dataset = torchvision.datasets.CIFAR10(root='.', 
                                             train=True, 
                                             download=True, 
                                             transform=transforms)

test_dataset = torchvision.datasets.CIFAR10(root='.', 
                                            train=False, # want the test split
                                            download=True, 
                                            transform=transforms)

# Get the lengths of the datasets
train_len = len(train_dataset)
test_len = len(test_dataset)

print(f"[INFO] Train dataset length: {train_len}")
print(f"[INFO] Test dataset length: {test_len}")

Downloading https://www.cs.toronto.edu/~kriz/../data/cifar-10-python.tar.gz to .\../data/cifar-10-python.tar.gz


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 170498071/170498071 [58:55<00:00, 48221.94it/s] 


Extracting .\../data/cifar-10-python.tar.gz to .
Files already downloaded and verified
[INFO] Train dataset length: 50000
[INFO] Test dataset length: 10000


## Dataloaders

oh, though do be reminded, as we know transferring data from the CPU to the GPU is the main bottleneck of machine learning, so we want as many cores as possible from the CPU to be transferring data to the GPU

this can be achieved using checking cpu count through os library

In [22]:
from torch.utils.data import DataLoader
import os

#Getting the Number of Available Cores
num_workers = os.cpu_count()
batch_size = 32

train_dataloader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

test_dataloader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

## Training/Testing Loop

This will be similar to the one in going modular, but we need a slight modification to calculate the time it took for each loop

- a start time for an epoch and an end time for an epoch has been added in the train/test function, the epoch training/testing time has been calculated, and stored in results

We'll do this by measuring the start and end time of each training and testing epoch with Python's time.time() and tracking it in a dictionary.

We will also not be using torch.inference_mode(), as there are compatibility issues with torch.compile(), so torch.no_grad()

In [17]:
import time
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

def train_step(epoch: int,
               model: torch.nn.Module, 
               dataloader: torch.utils.data.DataLoader, 
               loss_fn: torch.nn.Module, 
               optimizer: torch.optim.Optimizer,
               device: torch.device,
               disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Trains a PyTorch model for a single epoch.

  Turns a target PyTorch model to training mode and then
  runs through all of the required training steps (forward
  pass, loss calculation, optimizer step).

  Args:
    model: A PyTorch model to be trained.
    dataloader: A DataLoader instance for the model to be trained on.
    loss_fn: A PyTorch loss function to minimize.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A tuple of training loss and training accuracy metrics.
    In the form (train_loss, train_accuracy). For example:

    (0.1112, 0.8743)
  """
  # Put model in train mode
  model.train()

  # Setup train loss and train accuracy values
  train_loss, train_acc = 0, 0

  # Loop through data loader data batches
  progress_bar = tqdm(
        enumerate(dataloader), 
        desc=f"Training Epoch {epoch}", 
        total=len(dataloader),
        disable=disable_progress_bar
    )

  for batch, (X, y) in progress_bar:
      # Send data to target device
      X, y = X.to(device), y.to(device)

      # 1. Forward pass
      y_pred = model(X)

      # 2. Calculate  and accumulate loss
      loss = loss_fn(y_pred, y)
      train_loss += loss.item() 

      # 3. Optimizer zero grad
      optimizer.zero_grad()

      # 4. Loss backward
      loss.backward()

      # 5. Optimizer step
      optimizer.step()

      # Calculate and accumulate accuracy metric across all batches
      y_pred_class = torch.argmax(torch.softmax(y_pred, dim=1), dim=1)
      train_acc += (y_pred_class == y).sum().item()/len(y_pred)

      # Update progress bar
      progress_bar.set_postfix(
            {
                "train_loss": train_loss / (batch + 1),
                "train_acc": train_acc / (batch + 1),
            }
        )


  # Adjust metrics to get average loss and accuracy per batch 
  train_loss = train_loss / len(dataloader)
  train_acc = train_acc / len(dataloader)
  return train_loss, train_acc

def test_step(epoch: int,
              model: torch.nn.Module, 
              dataloader: torch.utils.data.DataLoader, 
              loss_fn: torch.nn.Module,
              device: torch.device,
              disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Tests a PyTorch model for a single epoch.

  Turns a target PyTorch model to "eval" mode and then performs
  a forward pass on a testing dataset.

  Args:
    model: A PyTorch model to be tested.
    dataloader: A DataLoader instance for the model to be tested on.
    loss_fn: A PyTorch loss function to calculate loss on the test data.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A tuple of testing loss and testing accuracy metrics.
    In the form (test_loss, test_accuracy). For example:

    (0.0223, 0.8985)
  """
  # Put model in eval mode
  model.eval() 

  # Setup test loss and test accuracy values
  test_loss, test_acc = 0, 0

  # Loop through data loader data batches
  progress_bar = tqdm(
      enumerate(dataloader), 
      desc=f"Testing Epoch {epoch}", 
      total=len(dataloader),
      disable=disable_progress_bar
  )

  # Turn on inference context manager
  with torch.no_grad(): # no_grad() required for PyTorch 2.0, I found some errors with `torch.inference_mode()`, please let me know if this is not the case
      # Loop through DataLoader batches
      for batch, (X, y) in progress_bar:
          # Send data to target device
          X, y = X.to(device), y.to(device)

          # 1. Forward pass
          test_pred_logits = model(X)

          # 2. Calculate and accumulate loss
          loss = loss_fn(test_pred_logits, y)
          test_loss += loss.item()

          # Calculate and accumulate accuracy
          test_pred_labels = test_pred_logits.argmax(dim=1)
          test_acc += ((test_pred_labels == y).sum().item()/len(test_pred_labels))

          # Update progress bar
          progress_bar.set_postfix(
              {
                  "test_loss": test_loss / (batch + 1),
                  "test_acc": test_acc / (batch + 1),
              }
          )

  # Adjust metrics to get average loss and accuracy per batch 
  test_loss = test_loss / len(dataloader)
  test_acc = test_acc / len(dataloader)
  return test_loss, test_acc

def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          disable_progress_bar: bool = False) -> Dict[str, List]:
  """Trains and tests a PyTorch model.

  Passes a target PyTorch models through train_step() and test_step()
  functions for a number of epochs, training and testing the model
  in the same epoch loop.

  Calculates, prints and stores evaluation metrics throughout.

  Args:
    model: A PyTorch model to be trained and tested.
    train_dataloader: A DataLoader instance for the model to be trained on.
    test_dataloader: A DataLoader instance for the model to be tested on.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    loss_fn: A PyTorch loss function to calculate loss on both datasets.
    epochs: An integer indicating how many epochs to train for.
    device: A target device to compute on (e.g. "cuda" or "cpu").

  Returns:
    A dictionary of training and testing loss as well as training and
    testing accuracy metrics. Each metric has a value in a list for 
    each epoch.
    In the form: {train_loss: [...],
                  train_acc: [...],
                  test_loss: [...],
                  test_acc: [...]} 
    For example if training for epochs=2: 
                 {train_loss: [2.0616, 1.0537],
                  train_acc: [0.3945, 0.3945],
                  test_loss: [1.2641, 1.5706],
                  test_acc: [0.3400, 0.2973]} 
  """
  # Create empty results dictionary
  results = {"train_loss": [],
      "train_acc": [],
      "test_loss": [],
      "test_acc": [],
      "train_epoch_time": [],
      "test_epoch_time": []
  }

  # Loop through training and testing steps for a number of epochs
  for epoch in tqdm(range(epochs), disable=disable_progress_bar):

      # Perform training step and time it
      train_epoch_start_time = time.time()
      train_loss, train_acc = train_step(epoch=epoch, 
                                        model=model,
                                        dataloader=train_dataloader,
                                        loss_fn=loss_fn,
                                        optimizer=optimizer,
                                        device=device,
                                        disable_progress_bar=disable_progress_bar)
      train_epoch_end_time = time.time()
      train_epoch_time = train_epoch_end_time - train_epoch_start_time
      
      # Perform testing step and time it
      test_epoch_start_time = time.time()
      test_loss, test_acc = test_step(epoch=epoch,
                                      model=model,
                                      dataloader=test_dataloader,
                                      loss_fn=loss_fn,
                                      device=device,
                                      disable_progress_bar=disable_progress_bar)
      test_epoch_end_time = time.time()
      test_epoch_time = test_epoch_end_time - test_epoch_start_time

      # Print out what's happening
      print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f} | "
          f"train_epoch_time: {train_epoch_time:.4f} | "
          f"test_epoch_time: {test_epoch_time:.4f}"
      )

      # Update results dictionary
      results["train_loss"].append(train_loss)
      results["train_acc"].append(train_acc)
      results["test_loss"].append(test_loss)
      results["test_acc"].append(test_acc)
      results["train_epoch_time"].append(train_epoch_time)
      results["test_epoch_time"].append(test_epoch_time)

  # Return the filled results at the end of the epochs
  return results

## Experiment 1

no torch.compile()

In [23]:
# setting hyperparameters
epochs = 5
learning_rate = 0.005

# making model and transforms
model, transforms = create_resnet50()
model.to(device)

# defining loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# train the not compiled model
no_compile_results = train(model=model, train_dataloader=train_dataloader, test_dataloader=test_dataloader, loss_fn=loss_function, optimizer=optimizer, epochs=epochs, device=device)

Total parameters of model: 23528522 (the more parameters, the more GPU memory the model will use, the more *relative* of a speedup you'll get)
Model transforms:
ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)


  0%|          | 0/5 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 0:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.4266 | train_acc: 0.4817 | test_loss: 1.0487 | test_acc: 0.6179 | train_epoch_time: 157.2176 | test_epoch_time: 27.6073


Training Epoch 1:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 1:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 2 | train_loss: 0.9465 | train_acc: 0.6635 | test_loss: 0.8630 | test_acc: 0.6989 | train_epoch_time: 156.4521 | test_epoch_time: 27.6936


Training Epoch 2:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 2:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 3 | train_loss: 0.7479 | train_acc: 0.7353 | test_loss: 0.7471 | test_acc: 0.7412 | train_epoch_time: 155.2478 | test_epoch_time: 27.0205


Training Epoch 3:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 3:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 4 | train_loss: 0.5938 | train_acc: 0.7916 | test_loss: 0.7337 | test_acc: 0.7516 | train_epoch_time: 150.4613 | test_epoch_time: 27.1190


Training Epoch 4:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 4:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 5 | train_loss: 0.4564 | train_acc: 0.8392 | test_loss: 0.7351 | test_acc: 0.7642 | train_epoch_time: 150.6046 | test_epoch_time: 27.0609


Been running into issue "Expected a 'cuda' device type for generator but found 'cpu'", apparently turning shuffle = False solves this issue

Or you can go change generator = torch.Generator() to generator = torch.Generator(device='cuda') in torch\utils\data\sampler.py in line 115, but I'm not yet willing to play with source code

# Experiment 2

with torch.compile

In [None]:
# setting hyperparameters
epochs = 5
learning_rate = 0.005

# making model and transforms
model, transforms = create_resnet50()
model.to(device)

# create loss function and optimizer
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Compile the model and time how long it takes
compile_start_time = time.time()
compiled_model = torch.compile(model)
compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time
print(f"Time to compile: {compile_time} | Note: The first time you compile your model, the first few epochs will be slower than subsequent runs.")

# Train the compiled model
compile_results = train(model=compiled_model, train_dataloader=train_dataloader, test_dataloader=test_dataloader, loss_fn=loss_fn, optimizer=optimizer, epochs=epochs, device=device)

RuntimeError "Windows not yet supported for torch.compile", oof, I don't know how to feel about this, then I can't make the comparison using my data, and my training

This is a truly sad moment

# Comparing Results

Uh, we can't, cause we can't run torch.compile on windows

So I'll just grab whatever his results are, and show it here

<img src="assets/Results.png" alt="The WorkFlow" width="800">

So, slightly better speeds, but keep in mind the duration which we've trained is really short, and this difference only grows larger

And, hah, that's the end of the road at the moment for this tutorial, we've came to the end to this course

Here's a cookie üç™