# **Part 1: Run MobileNet on GPU**

In this tutorial, we will explore how to train a neural network with PyTorch.

### Setup (5%)

We will first install a few packages that will be used in this tutorial and also define the path of CUDA library:

In [None]:
!pip install torchprofile 1>/dev/null
!ldconfig /usr/lib64-nvidia 2>/dev/null
!pip install onnx 1>/dev/null
!pip install onnxruntime 1>/dev/null

We will then import a few libraries:

In [2]:
import random

import numpy as np
import torch
import torchvision
from torch import nn
from torch.optim import *
from torch.optim.lr_scheduler import *
from torch.utils.data import DataLoader
from torchprofile import profile_macs
from torchvision.datasets import *
from torchvision.transforms import *
from tqdm.auto import tqdm

In [3]:
print(torch.__version__)
print(torchvision.__version__)

2.5.1+cu124
0.20.1+cu124


To ensure the reproducibility, we will control the seed of random generators:

In [4]:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x787d247a2f30>

We must decide the HYPER-parameter before training the model:

In [5]:
NUM_CLASSES = 10

# TODO:
# Decide your own hyper-parameters
BATCH_SIZE = 32
LEARNING_RATE = 1e-5
NUM_EPOCH = 10

### Data  (5%)

In this lab, we will use CIFAR-10 as our target dataset. This dataset contains images from 10 classes, where each image is of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

Before using the data as input, we can do data pre-processing with transform function:

In [6]:
# TODO:
# Resize images to 224x224, i.e., the input image size of MobileNet,
# Convert images to PyTorch tensors, and
# Normalize the images with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
transform = transforms.Compose([
    transforms.Resize((224, 224)),                    # Resize to 224x224
    transforms.ToTensor(),                            # Convert to PyTorch Tensor
    transforms.Normalize(                             # Normalize with MobileNet's mean and std
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

dataset = {}
for split in ["train", "test"]:
  dataset[split] = CIFAR10(
    root="data/cifar10",
    train=(split == "train"),
    download=True,
    transform=transform,
  )

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar10/cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:05<00:00, 30.0MB/s]


Extracting data/cifar10/cifar-10-python.tar.gz to data/cifar10
Files already downloaded and verified


To train a neural network, we will need to feed data in batches.

We create data loaders with the batch size determined previously in setup section:

In [7]:
dataflow = {}
for split in ['train', 'test']:
  dataflow[split] = DataLoader(
    dataset[split],
    batch_size=BATCH_SIZE,
    shuffle=(split == 'train'),
    num_workers=0,
    pin_memory=True,
    drop_last=True
  )

We can print the data type and shape from the training data loader:

In [8]:
for inputs, targets in dataflow["train"]:
  print(f"[inputs] dtype: {inputs.dtype}, shape: {inputs.shape}")
  print(f"[targets] dtype: {targets.dtype}, shape: {targets.shape}")
  break

[inputs] dtype: torch.float32, shape: torch.Size([32, 3, 224, 224])
[targets] dtype: torch.int64, shape: torch.Size([32])


### Model (10%)

In this tutorial, we will import MobileNet provided by torchvision, and use the pre-trained weight:

In [9]:
# TODO:
# Load pre-trained MobileNetV2
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
print(model)

Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
100%|██████████| 13.6M/13.6M [00:00<00:00, 81.0MB/s]


MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

You should observe that the output dimension of the classifier does not match the number of cleasses in CIFAR-10.

Now change the output dimension of the classifer to number of classes:

In [10]:
# TODO:
# Change the output dimension of the classifer to number of classes
model.classifier[1] = nn.Linear(1280, NUM_CLASSES)
print(model)

# Send the model from cpu to gpu
model = model.cuda()

MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

Now the output dimension of the classifer matches.

As this course focuses on efficiency, we will then inspect its model size and (theoretical) computation cost.


* The model size can be estimated by the number of trainable parameters:

In [11]:
num_params = 0
for param in model.parameters():
  if param.requires_grad:
    num_params += param.numel()
print("#Params:", num_params)

#Params: 2236682


* The computation cost can be estimated by the number of [multiply–accumulate operations (MACs)](https://en.wikipedia.org/wiki/Multiply–accumulate_operation) using [TorchProfile](https://github.com/zhijian-liu/torchprofile), we will further use this profiling tool in the future labs .

In [12]:
num_macs = profile_macs(model, torch.zeros(1, 3, 224, 224).cuda())
print("#MACs:", num_macs)

#MACs: 306186464


This model has 2.2M parameters and requires 306M MACs for inference. We will work together in the next few labs to improve its efficiency.

### Optimization (10%)

As we are working on a classification problem, we will apply [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) as our loss function to optimize the model:

In [13]:
# TODO:
# Apply cross entropy as our loss function
criterion = nn.CrossEntropyLoss()

We should decide an optimizer for the model:

In [14]:
# TODO:
# Choose an optimizer.
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)

(Optional) We can apply a learning rate scheduler during the training:

In [15]:
# TODO(optional):
scheduler = StepLR(optimizer, step_size=5, gamma=0.5)

### Training (25%)

We first define the function that optimizes the model for one batch:

In [16]:
def train_one_batch(
  model: nn.Module,
  criterion: nn.Module,
  optimizer: Optimizer,
  inputs: torch.Tensor,
  targets: torch.Tensor,
  scheduler: LRScheduler
) -> None:

    # TODO:
    # Step 1: Reset the gradients (from the last iteration)
    optimizer.zero_grad()

    # Step 2: Forward inference
    outputs = model(inputs)

    # Step 3: Calculate the loss
    loss = criterion(outputs, targets)

    # Step 4: Backward propagation
    loss.backward()

    # Step 5: Update optimizer
    optimizer.step()

    # (Optional Step 6: scheduler)
    # if scheduler:
    #     scheduler.step()

    # Print the loss for monitoring (optional)
    # print(f"Loss: {loss.item():.4f}")

We then define the training function:

In [17]:
def train(
    model: nn.Module,
    dataflow: DataLoader,
    criterion: nn.Module,
    optimizer: Optimizer,
    scheduler: LRScheduler
):

  model.train()

  for inputs, targets in tqdm(dataflow, desc='train', leave=False):
    # Move the data from CPU to GPU
    inputs = inputs.cuda()
    targets = targets.cuda()

    # Call train_one_batch function
    train_one_batch(model, criterion, optimizer, inputs, targets, scheduler)

Last, we define the evaluation function:

In [18]:
def evaluate(
  model: nn.Module,
  dataflow: DataLoader
) -> float:

    model.eval()
    num_samples = 0
    num_correct = 0

    with torch.no_grad():
        for inputs, targets in tqdm(dataflow, desc="eval", leave=False):
            # TODO:
            # Step 1: Move the data from CPU to GPU
            inputs = inputs.cuda()
            targets = targets.cuda()

            # Step 2: Forward inference
            outputs = model(inputs)

            # Step 3: Convert logits to class indices (predicted class)
            predicts = torch.argmax(outputs, dim=1)

            # Update metrics
            num_samples += targets.size(0)
            num_correct += (predicts == targets).sum()

    return (num_correct / num_samples * 100).item()

With training and evaluation functions, we can finally start training the model!

If the training is done properly, the accuracy should simply reach higher than 0.925:

***Please screenshot the output model accuracy, hand in as YourID_acc_1.png***

In [19]:
for epoch_num in tqdm(range(1, NUM_EPOCH + 1)):
  train(model, dataflow["train"], criterion, optimizer, scheduler)
  acc = evaluate(model, dataflow["test"])
  print(f"epoch {epoch_num}:", acc)

print(f"final accuracy: {acc}")

  0%|          | 0/10 [00:00<?, ?it/s]

train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 1: 87.3096923828125


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 2: 90.51482391357422


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 3: 91.82691955566406


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 4: 92.21754455566406


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 5: 92.9286880493164


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 6: 93.09896087646484


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 7: 93.33934020996094


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 8: 93.36939239501953


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 9: 93.27925109863281


train:   0%|          | 0/1562 [00:00<?, ?it/s]

eval:   0%|          | 0/312 [00:00<?, ?it/s]

epoch 10: 93.50961303710938
final accuracy: 93.50961303710938


Save the weight of the model as "model.pt":

In [20]:
# TODO:
# Save the model weight

torch.save(model.state_dict(), 'model.pt')

You will find "model.pt" in the current folder.

### Export Model (5%)

We can also save the model weight in [ONNX Format](https://pytorch.org/docs/stable/onnx_torchscript.html):

In [21]:
import torch.onnx

# TODO:
# Specify the input shape
input_shape = torch.randn(1, 3, 224, 224)
input_shape = input_shape.cuda()

onnx_path = 'model.onnx'

# TODO:
# Export the model to ONNX format
torch.onnx.export(model, input_shape, onnx_path)

print(f"Model exported to {onnx_path}")

Model exported to model.onnx


In onnx format, we can observe the model structure using [Netron](https://netron.app/).

***Please download the model structure, hand in as YourID_onnx.png.***

### Inference (10%)

Load the saved model weight:



In [22]:
# TODO:
# Step 1: Get the model structure (mobilenet_v2 and the classifier)
loaded_model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
loaded_model.classifier[1] = nn.Linear(1280, NUM_CLASSES)

# Step 2: Load the model weight from "model.pt".
loaded_model.load_state_dict(torch.load('model.pt'))

# Step 3: Send the model from cpu to gpu
loaded_model = loaded_model.cuda()

  loaded_model.load_state_dict(torch.load('model.pt'))


Run inference with the loaded model weight and check the accuracy

***Please screenshot the output model accuracy, hand in as YourID_acc_2.png***

In [23]:
acc = evaluate(loaded_model, dataflow["test"])
print(f"accuracy: {acc}")

eval:   0%|          | 0/312 [00:00<?, ?it/s]

accuracy: 93.50961303710938


If the accurracy is the same as the accuracy before saved, you have completed PART 1.

Congratulations!

# **Part 2: LLM with torch.compile**

In part 2, we will compare the inference speed of the LLM whether we use torch.compile.

```torch.compile``` is a new feature in PyTorch 2.0.

The following tutorial will help you get to know the usage.

[Introduction to torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)

We will choose ```Llama-3.2-1B-Instruct``` as our LLM model.

Make sure you have access to llama before starting Part 2.

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

### Loading LLM (20%)

We will first install huggingface and login with your token

In [None]:
!pip install -U "huggingface_hub[cli]"
!huggingface-cli login

We choose LLaMa 3.2 1B Instruct as our LLM model and load the pretrained model.

Model ID: **"meta-llama/Llama-3.2-1B-Instruct"**


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# TODO:
# Load the LLaMA 3.2 1B Instruct model
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).cuda()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

First we need to decide our prompt to feed into LLM and the maximum token length as well.

You can also change the iteration times of testing for the following tests.

In [3]:
# TODO:
# Input prompt
# You can change the prompt whatever you want, e.g. "How to learn a new language?", "What is Edge AI?"

prompt = "What is Edge AI?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
max_token_length = 1000
iter_times = 10

### Inference with torch.compile (10%)


Let's define a timer function to compare the speed up of ```torch.compile```

In [4]:
def timed(fn):
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  start.record()
  result = fn()
  end.record()
  torch.cuda.synchronize()
  return result, start.elapsed_time(end) / 1000

After everything is set up, let's start!

We first simply run the inference without ```torch.compile```


In [5]:
original_times = []

# Timing without torch.compile
for i in range(iter_times):
  with torch.no_grad():
    original_output, original_time = timed(lambda: model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  original_times.append(original_time)
  print(f"Time taken without torch.compile: {original_time} seconds")

# Decode the output
output_text = tokenizer.decode(original_output[0], skip_special_tokens=True)
print(f"Output without torch.compile: {output_text}")

Time taken without torch.compile: 11.424921875 seconds
Time taken without torch.compile: 12.580337890625 seconds
Time taken without torch.compile: 9.7221005859375 seconds
Time taken without torch.compile: 11.7945615234375 seconds
Time taken without torch.compile: 9.242859375 seconds
Time taken without torch.compile: 10.7993837890625 seconds
Time taken without torch.compile: 8.480259765625 seconds
Time taken without torch.compile: 11.38494140625 seconds
Time taken without torch.compile: 7.37633837890625 seconds
Time taken without torch.compile: 10.5784658203125 seconds
Output without torch.compile: What is Edge AI? Edge AI is a type of artificial intelligence (AI) that operates at the edge of the network, closer to the data sources than traditional cloud-based AI systems. It enables real-time processing and analysis of data, reducing latency and improving the overall performance of applications.

Edge AI typically involves the deployment of AI models on edge devices, such as smartphones

Before using ```torch.compile```, we need to access the model's ```generation_config``` attribute and set the ```cache_implementation``` to "static".

To use ```torch.compile```, we need to call ```torch.compile``` on the model to compile the forward pass with the static kv-cache.

Reference: https://huggingface.co/docs/transformers/llm_optims?static-kv=basic+usage%3A+generation_config

In [8]:
compile_times = []

# Remind that whenever you use torch.compile, you need to use torch._dynamo.reset() to clear all compilation caches and restores the system to its initial state.
import torch._dynamo
torch._dynamo.reset()

# TODO:
# Compile the model
compiled_model = model
compiled_model.generation_config.cache_implementation = "static"
compiled_model.forward = torch.compile(compiled_model.forward, mode="reduce-overhead", fullgraph=True)

# Timing with torch.compile
for i in range(iter_times):
  with torch.no_grad():
    compile_output, compile_time = timed(lambda: compiled_model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  compile_times.append(compile_time)
  print(f"Time taken with torch.compile: {compile_time} seconds")

# Decode output
output_text = tokenizer.decode(compile_output[0], skip_special_tokens=True)
print(f"\nOutput with torch.compile: {output_text}")

Time taken with torch.compile: 42.86746875 seconds
Time taken with torch.compile: 6.3912197265625 seconds
Time taken with torch.compile: 6.1536083984375 seconds
Time taken with torch.compile: 7.27107080078125 seconds
Time taken with torch.compile: 9.065380859375 seconds
Time taken with torch.compile: 5.38320947265625 seconds
Time taken with torch.compile: 8.9884755859375 seconds
Time taken with torch.compile: 4.345921875 seconds
Time taken with torch.compile: 6.26843994140625 seconds
Time taken with torch.compile: 6.9929287109375 seconds

Output with torch.compile: What is Edge AI? Edge AI, or Edge Intelligence, refers to the processing and analysis of data at the edge of a network, such as a smartphone or a camera, rather than at a centralized server or cloud. This approach is gaining popularity due to the increasing availability of low-power, low-bandwidth devices and the need for real-time processing and analysis of data.

Edge AI involves the use of specialized hardware, such as ap

We can easily observe that after the first inference, the inference time drops a lot!

Below code can tell you how much faster did ```torch.compile``` did.

***Please screenshot the inference time and speedup below, hand in as YourID_speedup.png***

In [9]:
import numpy as np
original_med = np.median(original_times)
compile_med = np.median(compile_times)
speedup = original_med / compile_med
print(f"Original median: {original_med},\nCompile median: {compile_med},\nSpeedup: {speedup}x")

Original median: 10.688924804687499,
Compile median: 6.69207421875,
Speedup: 1.5972513835454836x


You've finished part 2.

Congratulations!