# Setup:

## 1. Get GPU info

Why get GPU info?

Becouse PyTorch 2.0 features (torch.compile()) work best on newer NVIDIA GPUs.

Well, what's a newer NVIDIA GPU?

To find out if your GPU is compatible, see -->>  [NVIDIA GPU compatibility score](https://developer.nvidia.com/cuda-gpus)

If your GPU has a score of 8.0+, it can leverage *most* if not *all* of the new PyTorch 2.0 features

GPUs under 8.0 can still leverage PyTorch 2.0, however, the improvements may not be as noticable as those with 8.0+.

In [62]:
!nvidia-smi

Wed Feb 21 10:40:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              32W / 250W |  16238MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [63]:
import torch
# Make sure we're using a NVIDIA GPU
if torch.cuda.is_available():
  gpu_info = !nvidia-smi
  gpu_info = '\n'.join(gpu_info)
  if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

  # Get GPU name
  gpu_name = !nvidia-smi --query-gpu=gpu_name --format=csv
  gpu_name = gpu_name[1]
  GPU_NAME = gpu_name.replace(" ", "_") # remove underscores for easier saving
  print(f'GPU name: {GPU_NAME}')

  # Get GPU capability score
  GPU_SCORE = torch.cuda.get_device_capability()
  print(f"GPU capability score: {GPU_SCORE}")
  if GPU_SCORE >= (8, 0):
    print(f"GPU score higher than or equal to (8, 0), PyTorch 2.x speedup features available.")
  else:
    print(f"GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).")
  
  # Print GPU info
  print(f"GPU information:\n{gpu_info}")

else:
  print("PyTorch couldn't find a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

GPU name: Tesla_P100-PCIE-16GB
GPU capability score: (6, 0)
GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).
GPU information:
Wed Feb 21 10:40:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              32W / 250W |  16238MiB / 16384MiB |      0%      Default |
|                                         |    

### 1.1 Globally set devices


Previously, we've set the device of our tensors/models using `.to(device)`

* `tensor.to(device)`
* `model.to(device)`

But in PyTorch 2.0, it's possible to set the device with a context manager as well as a global device

In [64]:
import torch

# Set the device
device = "cuda" if torch.cuda.is_available else "cpu"
# device

# Set the device with context manager(requires python 2.x+)
with torch.device(device):
    # All tensors or PyTorch objects created in the context manager will be on the target device without using .to()
    layer = torch.nn.Linear(20,30)
    print(f"Layer weights are on device: {layer.weight.device}")
    print(f"Layer creating data on device: {layer(torch.randn(128,20)).device}")


Layer weights are on device: cuda:0
Layer creating data on device: cuda:0


In [65]:
import torch

# Set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set the device globally
torch.set_default_device(device)
# All tensors or PyTorch object from here on out will be on the target device without using `.to()`
layer = torch.nn.Linear(20,30)
print(f"Layer weights are on device: {layer.weight.device}")
print(f"Layer creating data on device: {layer(torch.randn(128,20)).device}")

Layer weights are on device: cuda:0
Layer creating data on device: cuda:0


In [102]:
import torch

# Set the device globally
torch.set_default_device("cpu")
# All tensors or PyTorch object from here on out will be on the target device without using `.to()`
layer = torch.nn.Linear(20,30)
print(f"Layer weights are on device: {layer.weight.device}")
print(f"Layer creating data on device: {layer(torch.randn(128,20)).device}")

Layer weights are on device: cpu
Layer creating data on device: cpu


## 2. Setting up the experiments

Time to test speed!

To keep things simple, we'll run 4 experiments:

* Model: ResNet50 from torchvision
* Model: CIFAR10 from torchvision
* Epochs: 5 (single run) and 3x5 (multi run)
* Batch size: 128 (note: can be changes depending on the amount of memory of GPU)
* Image size: 224 (note: can be changes depending on the amount of memory of GPU)


In [103]:
import torch
import torchvision

print(f"PyTroch version {torch.__version__}")
print(f"Torchvision version {torchvision.__version__}")

# Set a target device
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

PyTroch version 2.1.2
Torchvision version 0.16.2
Using device: cuda


## 2.1 Create model and transforms

* ResNet50 from PyTorch https://pytorch.org/vision/stable/models/generated/torchvision.models.resnet50.html#torchvision.models.ResNet50_Weights

In [104]:
# Create nidek weights and transforms

model_weights = torchvision.models.ResNet50_Weights.IMAGENET1K_V2 # also can be DEFAULT
transforms = model_weights.transforms()

transforms

ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

In [105]:
# Create model
model = torchvision.models.resnet50(weights=model_weights)
model

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 177, in close
    

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

In [106]:
# Count the number of parameters in the model
total_params = sum(
    param.numel() for param in model.parameters() # Count all parameters
    # param.numel() for param in model.parameters() if param.requires_grad = True # only count parameters that are trainable
)
total_params

25557032

**NOTE:** PyTorch 2.0 *relative* speedups will be most noticeable when as much of the GPU as poissible is being used. This means a larger model (more trainable parameters) may take longer to train on the whole but will be relatively faster. E.g. a model with 1M parameters may take ~10min to train but a model with 25M parameters may take ~20min to train.

# LESSON 338

In [107]:
def create_model(num_classes=10):
    """
    Creates a resnet50 model with transforms and returns them both.
    """
    
    model_weights = torchvision.models.ResNet50_Weights.DEFAULT
    transforms = model_weights.transforms()
    model = torchvision.models.resnet50()
    
    # Adjust the head layer to suit our number of classes
    model.fc = torch.nn.Linear(in_features=2048,
                              out_features=num_classes)
    
    return model, transforms

model, transforms = create_model()

transforms

ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

### 2.2 Speedups are most noticeable when a large portion of the GPU(s) is being used

Since modern GPUs are *fast* at performing operations, you will often notice the majority of *relative* speedups when as much data as possible on the GPU.

In practice, you generally want to use as much of GPU memory as possible.

* Increasing the batch size
* Increasing data size - for eg. instead of using images that are 32x32, use 224x224 or 336x336, also you could use an increased embedding size for your data
* Increase the model size - for example instead of using a model with 1M parameters, use a model with 10M parameters
* Decrease data transfer - since bandwidth costs (transferring data) will slow down a GPU (becouse it wants to compute on data)

As a result of doing the above, your relative speedups should be better.

E.g. overall training time will take longer but not lineary.

Resource for learning how to improve PyTorch model speed: https://sebastianraschka.com/blog/2023/pytorch-faster.html
        
**NOTE:** This concept of using as much data on the GPU as possible isn't restricted specificially to PyTorch 2.0, it applies to all version on PyTorch and bbasically all models that train on GPUs

### 2.3 Checking the memory limits for our GPU

In [108]:
# Check available GPU memory and total GPU memory
total_free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()

print(f"Total free GPU memory: {round(total_free_gpu_memory * 1e-9, 3)} GB")
print(f"Total GPU memory: {round(total_gpu_memory * 1e-9, 3)} GB")

Total free GPU memory: 0.021 GB
Total GPU memory: 17.067 GB


    ---- P100 ----
    
    Total free GPU memory: 16.509 GB
    Total GPU memory: 17.067 GB
    
    ---- 2xT4 ----
    
    Total free GPU memory: 15.453 GB
    Total GPU memory: 15.836 GB
    
    

* If the GPU has 16+GB of free memory, set batch size to 128
* If the GPU has less than 16GB of free memory, set batch size to 32

In [109]:
# Set batch size depending on amount of GPU memory
total_free_gpu_memory_gb = round(total_free_gpu_memory * 1e-9, 3)
if total_free_gpu_memory_gb >= 16:
  BATCH_SIZE = 128 # Note: you could experiment with higher values here if you like.
  IMAGE_SIZE = 224
  print(f"GPU memory available is {total_free_gpu_memory_gb} GB, using batch size of {BATCH_SIZE} and image size {IMAGE_SIZE}")
else:
  BATCH_SIZE = 32
  IMAGE_SIZE = 128
  print(f"GPU memory available is {total_free_gpu_memory_gb} GB, using batch size of {BATCH_SIZE} and image size {IMAGE_SIZE}")

GPU memory available is 0.021 GB, using batch size of 32 and image size 128


In [110]:
transforms

ImageClassification(
    crop_size=[224]
    resize_size=[232]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

In [111]:
transforms.crop_size = 224
transforms.resize_size = 224
print(f"Updated data transforms:\n{transforms}")

Updated data transforms:
ImageClassification(
    crop_size=224
    resize_size=224
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)


### 2.4 More potential speedups with TF32

TF32 - TensorFloat32

TensorFloat32 = a datatype that bridges Float32 and Float16

Float32 = a number is represented by 32 bytes 

Float16 - a number is represented by 16 bytes

See more on precision in computing: https://en.wikipedia.org/wiki/Precision_(computer_science)

What we want is:
1. Fast model training
2. Accurate model training

TensorFloat32 = a datatype from NVIDIA which combines float32 and float16

TF32 is available on Ampere GPUs+ -> https://www.nvidia.com/en-us/data-center/ampere-architecture/



In [112]:
GPU_SCORE

(6, 0)

In [113]:
if GPU_SCORE >= (8, 0): # check if GPU is compatible with TF32
    print(f"[INFO] Using GPU with score: {GPU_SCORE}, enabling TensorFloat32")
    torch.backends.cuda.matmul.allow_tf32 = True
else:
    print(f"[INFO] Using GPU with score: {GPU_SCORE}, TensorFloat32 not available")
    torch.backends.cuda.matmul.allow_tf32 = False

[INFO] Using GPU with score: (6, 0), TensorFloat32 not available


## 2.5 Preparing datasets

As before, we discussed we're going to use CIFAR10.

homepage: https://www.cs.toronto.edu/~kriz/cifar.html

We can download the dataset from torchvision - https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html

In [114]:
# Create train and test datasets
import torchvision
train_dataset = torchvision.datasets.CIFAR10(root=".", # where to store data
                                             train=True, # do we want training dataset?
                                             download=True,
                                             transform=transforms
                                             )

test_dataset = torchvision.datasets.CIFAR10(root=".",
                                            train=False,
                                            download=True,
                                            transform=transforms)

# Get the length of the datasets
train_len = len(train_dataset)
test_len = len(test_dataset)

print(f"[INFO] Train dataset length: {train_len}")
print(f"[INFO] Test dataset length: {test_len}")

Files already downloaded and verified
Files already downloaded and verified
[INFO] Train dataset length: 50000
[INFO] Test dataset length: 10000


In [115]:
train_dataset[0][0].shape

torch.Size([3, 224, 224])

In [116]:
train_dataset[0][1] # label <?>

6

## 2.6 Create DataLoaders

Next:
* Turn datasets into DataLoaders

In [117]:
from torch.utils.data import DataLoader

import os
NUM_WORKERS = os.cpu_count() # we want highest number of CPU cores to load data to GPU

train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=NUM_WORKERS)

test_dataloader = DataLoader(dataset=test_dataset,
                             batch_size=BATCH_SIZE,
                             shuffle=False,
                             num_workers=NUM_WORKERS)

# Print details:
print(f"Train dataloader num batches: {len(train_dataloader)} of batch size: {BATCH_SIZE}")
print(f"Test dataloader num batches: {len(test_dataloader)} of batch size: {BATCH_SIZE}")
print(f"Using num workers to load data (more is generally better): {NUM_WORKERS}")

Train dataloader num batches: 1563 of batch size: 32
Test dataloader num batches: 313 of batch size: 32
Using num workers to load data (more is generally better): 4


### 2.7 Creating training and test loops

Want to create:
* Training and test loops + timing step for each, so we know how long our models take to train/test

In [118]:
import time
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

def train_step(epoch: int,
               model: torch.nn.Module, 
               dataloader: torch.utils.data.DataLoader, 
               loss_fn: torch.nn.Module, 
               optimizer: torch.optim.Optimizer,
               device: torch.device,
               disable_progress_bar: bool = False) -> Tuple[float, float]:
    # Put model in train mode
    model.train()
    
    # Setup Train loss nad train acc values
    train_loss, train_acc = 0, 0
    
    # Loop through data loader and data batches
    progress_bar=tqdm(
        enumerate(dataloader),
        desc=f"Training Epoch {epoch}",
        total=len(dataloader),
        disable=disable_progress_bar
        )
    
    for batch, (X, y) in progress_bar:
        # Send data to target device
        X, y = X.to(device), y.to(device)
        
        # 1. Forward pass
        y_pred = model(X)
        
        # 2. Calculate anc accumulate loss
        loss = loss_fn(y_pred, y)
        train_loss += loss.item()
        
        # 3. Optimizer zero grad
        optimizer.zero_grad()
        
        # 4. Loss backward
        loss.backward()
        
        # 5. Optimizer step
        optimizer.step()
        
        # Calculate and accumulate accuracy metric across all batches
        y_pred_class = torch.argmax(torch.softmax(y_pred, dim=1), dim=1)
        train_acc += (y_pred_class == y).sum().item()/len(y_pred)
        
        # Upgrade progress bar
        progress_bar.set_postfix(
            {
                "test_loss":train_loss/(batch+1),
                "train_acc":train_acc/(batch+1),
            }
        )
    
    # Adjust metrics to get average loss and accuracy per batch
    train_loss = train_loss / len(dataloader)
    train_acc = train_acc / len(dataloader)
    return train_loss, train_acc

def test_step(epoch: int,
             model: torch.nn.Module,
             dataloader: torch.utils.data.DataLoader,
             loss_fn: torch.nn.Module,
             device: torch.device,
             disable_progress_bar: bool = False) -> Tuple[float, float]:
    
    # Put model in eval mode
    model.eval()
    
    # Setup test loss and tes accuracy values
    test_loss, test_acc = 0, 0
    
    # Loop through data loader data batches
    progress_bar = tqdm(
        enumerate(dataloader),
        desc=f"Testing Epoch {epoch}",
        total=len(dataloader),
        disable=disable_progress_bar
    )
    
    # Turn on inference context manager
    with torch.no_grad(): # no_grad() required for PyTorch 2.0
        # Loop through DataLoader batches
        for batch, (X, y) in progress_bar:
            # Send data to target device
            X, y = X.to(device), y.to(device)
            
            # 1. Forward pass
            test_pred_logits = model(X)
            
            # 2. Calculate and accumulate loss
            loss = loss_fn(test_pred_logits, y)
            test_loss += loss.item()
            
            # 3. Calculate and accumulate accuracy
            test_pred_labels = test_pred_logits.argmax(dim=1)
            test_acc += ((test_pred_labels == y).sum().item()/len(test_pred_labels))
            
            # Update progress bar
            progress_bar.set_postfix(
                {
                    "test_loss": test_loss / (batch + 1),
                    "test_acc": test_acc / (batch + 1),
                }
            )
            
    # Adjust metrics tp get average loss and accuracy per batch
    test_loss = test_loss / len(dataloader)
    test_acc = test_acc / len(dataloader)
    return test_loss, test_acc

def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          disable_progress_bar: bool = False) -> Dict[str, List]:
    
    # Create empty results dictionary
    results = {"train_loss": [],
        "train_acc": [],
        "test_loss": [],
        "test_acc": [],
        "train_epoch_time": [],
        "test_epoch_time": []
    }
    
    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs), disable=disable_progress_bar):

        # Perform training step and time it
        train_epoch_start_time = time.time()
        train_loss, train_acc = train_step(epoch=epoch, 
                                        model=model,
                                        dataloader=train_dataloader,
                                        loss_fn=loss_fn,
                                        optimizer=optimizer,
                                        device=device,
                                        disable_progress_bar=disable_progress_bar)
        train_epoch_end_time = time.time()
        train_epoch_time = train_epoch_end_time - train_epoch_start_time
      
        # Perform testing step and time it
        test_epoch_start_time = time.time()
        test_loss, test_acc = test_step(epoch=epoch,
                                      model=model,
                                      dataloader=test_dataloader,
                                      loss_fn=loss_fn,
                                      device=device,
                                      disable_progress_bar=disable_progress_bar)
        test_epoch_end_time = time.time()
        test_epoch_time = test_epoch_end_time - test_epoch_start_time
        
        # Print out what's happening
        print(
            f"Epoch: {epoch+1} | "
            f"train_loss: {train_loss:.4f} | "
            f"train_acc: {train_acc:.4f} | "
            f"test_loss: {test_loss:.4f} | "
            f"test_acc: {test_acc:.4f} | "
            f"train_epoch_time: {train_epoch_time:.4f} | "
            f"test_epoch_time: {test_epoch_time:.4f}"
        )
        
        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)
        results["train_epoch_time"].append(train_epoch_time)
        results["test_epoch_time"].append(test_epoch_time)

    # Return the filled results at the end of the epochs
    return results

## 3. Time models across a single run

Experiment 1: single run without `torch.compile()` for 5 epochs

### 3.1 Experiment 1 - single run, no compile

In [119]:
# Set the number of epochs
NUM_EPOCHS = 5

# Set the learning rate as a constant
LEARNING_RATE = 0.003

**NOTE** Depending on GPU/Machine, the following code may take a while to run.

In [120]:
# Create model
model, transforms = create_model()
# model
model.to(device)

# Create los function and optimizer
loss_fn = torch.nn.CrossEntropyLoss() # CrossEntropyLoss for multiclass classification, BinaryEntropyLoss for binary classification
optimizer = torch.optim.Adam(model.parameters(),
                            lr=LEARNING_RATE)

# Train the model and track the results
single_run_no_compile_results = train(model=model,
                                     train_dataloader=train_dataloader,
                                     test_dataloader=test_dataloader,
                                     loss_fn=loss_fn,
                                     optimizer=optimizer,
                                     epochs=NUM_EPOCHS,
                                     device=device)

  0%|          | 0/5 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 0:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.8324 | train_acc: 0.3217 | test_loss: 1.5045 | test_acc: 0.4408 | train_epoch_time: 281.1971 | test_epoch_time: 19.4332


Training Epoch 1:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 1:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 2 | train_loss: 1.3030 | train_acc: 0.5232 | test_loss: 1.1850 | test_acc: 0.5826 | train_epoch_time: 280.6531 | test_epoch_time: 19.5310


Training Epoch 2:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 2:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 3 | train_loss: 0.9953 | train_acc: 0.6473 | test_loss: 1.0610 | test_acc: 0.6323 | train_epoch_time: 280.1993 | test_epoch_time: 19.4261


Training Epoch 3:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 3:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 4 | train_loss: 0.8256 | train_acc: 0.7111 | test_loss: 0.7304 | test_acc: 0.7436 | train_epoch_time: 280.3763 | test_epoch_time: 19.5309


Training Epoch 4:   0%|          | 0/1563 [00:00<?, ?it/s]

Testing Epoch 4:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch: 5 | train_loss: 0.6583 | train_acc: 0.7709 | test_loss: 0.6439 | test_acc: 0.7756 | train_epoch_time: 280.1874 | test_epoch_time: 19.3941


In [121]:
single_run_no_compile_results

{'train_loss': [1.832363571177975,
  1.3029596147747735,
  0.9953359342201047,
  0.8255518803173963,
  0.658285133364257],
 'train_acc': [0.321717050543826,
  0.5232125719769674,
  0.6473328534868842,
  0.7111324376199616,
  0.7708933141394754],
 'test_loss': [1.504468339319808,
  1.1850133789613986,
  1.0609695787627857,
  0.7303736295562964,
  0.6439015637761869],
 'test_acc': [0.4407947284345048,
  0.5825678913738019,
  0.6322883386581469,
  0.7436102236421726,
  0.7755591054313099],
 'train_epoch_time': [281.1970942020416,
  280.6530992984772,
  280.1992573738098,
  280.3763077259064,
  280.1873722076416],
 'test_epoch_time': [19.43321681022644,
  19.531025648117065,
  19.42613410949707,
  19.530938386917114,
  19.394145727157593]}

### 3.2 Experiment 2, single, using `torch.compile()`

Same setup as experiment 1 except with the new line `torch.compile()`.

In [123]:
# Create model and transforms
model, _ = create_model()
model.to(device)

# Create loss function and optimizer
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),
                            lr=LEARNING_RATE)

# Compile the model (req. pytorch 2.0+)
import time
compile_start_time = time.time()

### New in PyTorch 2.x ###
compiled_model = torch.compile(model)

compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time

print(f"Time to compile: {compile_time} | NOTE: the first time you compile a model/train a compiled model, the first epoch may take longer due to optimizations happening behind the scenes")

# Train the compiled model
single_run_compile_results = train(model=compiled_model,
                                  train_dataloader=train_dataloader,
                                  test_dataloader=test_dataloader,
                                  loss_fn=loss_fn,
                                  optimizer=optimizer,
                                  epochs=NUM_EPOCHS,
                                  device=device)

Time to compile: 0.0007686614990234375 | NOTE: the first time you compile a model/train a compiled model, the first epoch may take longer due to optimizations happening behind the scenes


  0%|          | 0/5 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found Tesla P100-PCIE-16GB which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 6.0

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found Tesla P100-PCIE-16GB which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 6.0

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
add Codeadd Markdown