# Lab5: Deep Learning on PYNQ
## Scope
In the last lab, we have learned how to map a traditional image processing algorithm on the FPGA in a HLS manner.


For this lab, we will explore how to deploy a Quantised Neural Network(QNN) on our FPGAs to finish a keyword spotting (KWS).


We will finish this task with:
- Dataset: Google Speech V2 (preprocessed version, 12 classes, MFCC feature extracted)
- Model:   QMLP (3bits)
- Board:   PYNQ-Z2


This Lab5 contains 3 parts:
- Lab5 A: Train a quantised model and find out the difference between the float NN and the QNN.
- Lab5 B (optional): Export the quantised model into a hardware design which could be excuted on our PYNQ board.
- Lab5 C: Excute the model in the jupyter notebook to benchmark its performance.

## Note
We do encourage you to finish this lab in a FINN docker enviroment, but considering limited time, you could also try this in a normal conda/python/colab enviroment.


In Lab5B, to generate your own DNN IP, it must be done in the FINN docker. Alternatively, you can also use the generated files provided in the blackboard to continue Lab5 C, or ask TA for a online jupyter sever link with configured enviroment to execute your IP/overlay generation scripts.


For what is FINN and how to set up a FINN enviroment, here are some links might be helpful for you:
- Enviroment setup: https://github.com/CNStanLee/start_with_finn.git
- FINN official docs: https://finn.readthedocs.io/en/latest/
- FINN github repo: https://github.com/Xilinx/finn
- FINN examples repo: https://github.com/Xilinx/finn-examples




# Lab5 A: Train A Quantised Model


## Setup basic enviroment

In [79]:
! nvidia-smi

Fri Nov  7 21:39:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.03              Driver Version: 575.64.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   52C    P8              4W /   80W |      72MiB /   8188MiB |     26%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [80]:
! pip install brevitas==0.11.0
! pip install onnx
! pip install onnxscript
! pip install qonnx
! pip install onnxoptimizer

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [81]:
import os
from pathlib import Path
import urllib.request
import tarfile
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from brevitas.nn import QuantConv2d, QuantLinear, QuantReLU
import torch.nn as nn

In [82]:
root_path = Path("lab_new")  # replace with your root path
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cpu


In [83]:
npz_path = root_path / "data" / "kws_12cls_mfcc_10x49_quant_flat.npz"
data_download_link = 'https://drive.google.com/file/d/1ndk0v3vCNPMWtzx9Kubg3jqYRRqDc-55/view?usp=sharing'
# download from google drive to npz_path if the file not existed

In [84]:
from pathlib import Path
import sys, subprocess

npz_path.parent.mkdir(parents=True, exist_ok=True)

def _ensure_gdown():
    try:
        import gdown
        return gdown
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "gdown"])
        import gdown
        return gdown

if not npz_path.exists():
    gdown = _ensure_gdown()
    out = gdown.download(url=data_download_link, output=str(npz_path), quiet=False, fuzzy=True)
    if not out or not Path(out).exists():
        raise RuntimeError(f"down load failed: {npz_path}")
    print(f"Downloaded to: {npz_path}")
else:
    print(f"File already exists: {npz_path}")


File already exists: lab_new/data/kws_12cls_mfcc_10x49_quant_flat.npz


## Import the dataset

In [85]:
data = np.load(npz_path, allow_pickle=True)

X_train = data["X_train"]    # (N_train, 1, 10, 49)
y_train = data["y_train"]    # (N_train,)
X_val   = data["X_valid"]    # (N_val, 1, 10, 49)
y_val   = data["y_valid"]
X_test  = data["X_test"]     # (N_test, 1, 10, 49)
y_test  = data["y_test"]
label_names = data["label_names"]  # ['yes','no',...,'silence','unknown']

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_val  :", X_val.shape,   "y_val  :", y_val.shape)
print("X_test :", X_test.shape,  "y_test :", y_test.shape)
print("labels:", label_names)

def print_label_stats(name, y):
    uniq, cnt = np.unique(y, return_counts=True)
    print(f"\n{name} label stats:")
    for u, c in zip(uniq, cnt):
        print(f"  idx={u:2d} ({label_names[u]:8s}): {c:6d}")

print_label_stats("Train", y_train)
print_label_stats("Val",   y_val)
print_label_stats("Test",  y_test)


X_train: (36769, 490) y_train: (36769,)
X_val  : (4503, 490) y_val  : (4503,)
X_test : (4874, 490) y_test : (4874,)
labels: ['yes' 'no' 'up' 'down' 'left' 'right' 'on' 'off' 'stop' 'go' 'silence'
 'unknown']

Train label stats:
  idx= 0 (yes     ):   3228
  idx= 1 (no      ):   3130
  idx= 2 (up      ):   2948
  idx= 3 (down    ):   3134
  idx= 4 (left    ):   3037
  idx= 5 (right   ):   3019
  idx= 6 (on      ):   3086
  idx= 7 (off     ):   2970
  idx= 8 (stop    ):   3111
  idx= 9 (go      ):   3106
  idx=10 (silence ):   3000
  idx=11 (unknown ):   3000

Val label stats:
  idx= 0 (yes     ):    397
  idx= 1 (no      ):    406
  idx= 2 (up      ):    350
  idx= 3 (down    ):    377
  idx= 4 (left    ):    352
  idx= 5 (right   ):    363
  idx= 6 (on      ):    363
  idx= 7 (off     ):    373
  idx= 8 (stop    ):    350
  idx= 9 (go      ):    372
  idx=10 (silence ):    400
  idx=11 (unknown ):    400

Test label stats:
  idx= 0 (yes     ):    419
  idx= 1 (no      ):    405
  idx= 

In [86]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

npz_path = "lab_new/data/kws_12cls_mfcc_10x49_quant_flat.npz"


data = np.load(npz_path, allow_pickle=True)

X_train = data["X_train"]   # (N_train, 490), int8
y_train = data["y_train"]   # (N_train,)
X_val   = data["X_valid"]   # (N_val, 490)
y_val   = data["y_valid"]
X_test  = data["X_test"]    # (N_test, 490)
y_test  = data["y_test"]

print("X_train shape:", X_train.shape, "dtype:", X_train.dtype)
print("X_val   shape:", X_val.shape,   "dtype:", X_val.dtype)
print("X_test  shape:", X_test.shape,  "dtype:", X_test.dtype)


class KWSDataset(Dataset):
    def __init__(self, X, y):

        self.X = torch.from_numpy(X).float()        
        self.y = torch.from_numpy(y).long()  

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


train_ds = KWSDataset(X_train, y_train)
val_ds   = KWSDataset(X_val,   y_val)
test_ds  = KWSDataset(X_test,  y_test)

batch_size = 128

train_loader = DataLoader(train_ds, batch_size=batch_size,
                          shuffle=True,  drop_last=False)
val_loader   = DataLoader(val_ds,   batch_size=batch_size,
                          shuffle=False, drop_last=False)
test_loader  = DataLoader(test_ds,  batch_size=batch_size,
                           shuffle=False, drop_last=False)

print("len(train_ds) =", len(train_ds))
print("len(val_ds)   =", len(val_ds))
print("len(test_ds)  =", len(test_ds))

for xb, yb in train_loader:
    print("batch X:", xb.shape, xb.dtype)  # torch.Size([B, 490]) torch.int8
    print("batch y:", yb.shape, yb.dtype)  # torch.Size([B])     torch.int64
    break


X_train shape: (36769, 490) dtype: int8
X_val   shape: (4503, 490) dtype: int8
X_test  shape: (4874, 490) dtype: int8
len(train_ds) = 36769
len(val_ds)   = 4503
len(test_ds)  = 4874
batch X: torch.Size([128, 490]) torch.float32
batch y: torch.Size([128]) torch.int64


## Define the Float Model

In [87]:
class FloatMLP(nn.Module):
    def __init__(self, num_classes=12, hidden_dim=256, dropout_p=0.3):
        super().__init__()
        self.in_features = 1 * 10 * 49
        self.net = nn.Sequential(
            nn.Linear(self.in_features, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        #x = x.view(x.size(0), -1)
        return self.net(x)


## Define the Quantised Model

In [88]:
from brevitas.nn import QuantIdentity

class QuantMLPKWS_Dropout(nn.Module):
    def __init__(self, num_classes=12, hidden_dim=256, dropout_p=0.2,
                 w_bit=3, a_bit=3, in_bit=8):
        super().__init__()
        self.in_features = 1 * 10 * 49

        # self.input_quant = QuantIdentity(
        #     bit_width=in_bit,        # 8
        #     return_quant_tensor=False
        # )
        self.output_quant = QuantIdentity(
            bit_width=in_bit,       
            return_quant_tensor=False
        )

        # Layer 1: 490 -> 256
        self.fc1 = QuantLinear(
            in_features=self.in_features,
            out_features=hidden_dim,
            weight_bit_width=w_bit,   # W3
            bias=True,
            return_quant_tensor=False
        )
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.act1 = QuantReLU(
            bit_width=a_bit,          # A3
            return_quant_tensor=False
        )
        self.drop1 = nn.Dropout(p=dropout_p)

        # Layer 2: 256 -> 256
        self.fc2 = QuantLinear(
            in_features=hidden_dim,
            out_features=hidden_dim,
            weight_bit_width=w_bit,
            bias=True,
            return_quant_tensor=False
        )
        self.bn2 = nn.BatchNorm1d(hidden_dim)
        self.act2 = QuantReLU(
            bit_width=a_bit,
            return_quant_tensor=False
        )
        self.drop2 = nn.Dropout(p=dropout_p)

        # Layer 3: 256 -> 256
        self.fc3 = QuantLinear(
            in_features=hidden_dim,
            out_features=hidden_dim,
            weight_bit_width=w_bit,
            bias=True,
            return_quant_tensor=False
        )
        self.bn3 = nn.BatchNorm1d(hidden_dim)
        self.act3 = QuantReLU(
            bit_width=a_bit,
            return_quant_tensor=False
        )
        self.drop3 = nn.Dropout(p=dropout_p)

        # Output layer: 256 -> num_classes
        self.fc_out = QuantLinear(
            in_features=hidden_dim,
            out_features=num_classes,
            weight_bit_width=w_bit,
            bias=True,
            return_quant_tensor=False
        )
        self.flatten = nn.Flatten(start_dim=1)

    def forward(self, x):
        # x: (B, 1, 10, 49)
        # x = self.input_quant(x)
        # x = self.flatten(x)



        x = self.fc1(x)
        x = self.bn1(x)
        x = self.act1(x)
        x = self.drop1(x)

        x = self.fc2(x)
        x = self.bn2(x)
        x = self.act2(x)
        x = self.drop2(x)

        x = self.fc3(x)
        x = self.bn3(x)
        x = self.act3(x)
        x = self.drop3(x)

        x = self.fc_out(x)
        x = self.output_quant(x)
        return x




## Train functions

In [89]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader


class Trainer:
    def __init__(
        self,
        model: nn.Module,
        train_loader: DataLoader,
        val_loader: DataLoader = None,
        test_loader: DataLoader = None,
        device: torch.device = None,
        # --- Hyperparameters ---
        lr: float = 3e-4,
        weight_decay: float = 1e-4,
        batch_size: int = 64,
        num_epochs: int = 100,
        scheduler_factor: float = 0.5,
        scheduler_patience: int = 3,
        optimizer_cls=torch.optim.Adam,
        criterion: nn.Module = None,
    ):
        """
        A simple training framework for classification tasks.

        Args:
            model: Neural network model (nn.Module)
            train_loader: DataLoader for training set
            val_loader: DataLoader for validation set
            test_loader: DataLoader for test set (optional)
            device: torch.device (if None, automatically selects cuda or cpu)
            lr: Learning rate
            weight_decay: Weight decay (L2 regularization)
            batch_size: Batch size (for reference or logging)
            num_epochs: Number of training epochs
            scheduler_factor: Factor by which LR is reduced (ReduceLROnPlateau)
            scheduler_patience: Number of epochs with no improvement before LR reduction
            optimizer_cls: Optimizer class (e.g., Adam, SGD)
            criterion: Loss function (default: CrossEntropyLoss)
        """
        self.device = device or torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = model.to(self.device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.test_loader = test_loader


        # --- Save hyperparameters ---
        self.lr = lr
        self.weight_decay = weight_decay
        self.batch_size = batch_size
        self.num_epochs = num_epochs
        self.scheduler_factor = scheduler_factor
        self.scheduler_patience = scheduler_patience

        # --- Training components ---
        self.criterion = criterion or nn.CrossEntropyLoss()
        self.optimizer = optimizer_cls(self.model.parameters(), lr=lr, weight_decay=weight_decay)

        # Scheduler triggered by validation accuracy
        if self.val_loader is not None:
            self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                self.optimizer,
                mode="max",
                factor=scheduler_factor,
                patience=scheduler_patience,
            )
        else:
            self.scheduler = None

        # --- Bookkeeping ---
        self.best_val_acc = 0.0
        self.best_state_dict = None
        self.history = {
            "train_loss": [],
            "train_acc": [],
            "val_loss": [],
            "val_acc": [],
        }




    def _run_one_epoch(self, loader, train: bool = True):
        """
        Run one epoch of training or evaluation.
        """
        if train:
            self.model.train()
        else:
            self.model.eval()

        total_loss = 0.0
        total_correct = 0
        total_samples = 0

        context = torch.enable_grad() if train else torch.no_grad()
        with context:
            for X, y in loader:
                X, y = X.to(self.device), y.to(self.device)
                if train:
                    self.optimizer.zero_grad()

                logits = self.model(X)
                loss = self.criterion(logits, y)

                if train:
                    loss.backward()
                    self.optimizer.step()

                total_loss += loss.item() * X.size(0)
                preds = logits.argmax(dim=1)
                total_correct += (preds == y).sum().item()
                total_samples += X.size(0)

        avg_loss = total_loss / total_samples
        acc = total_correct / total_samples
        return avg_loss, acc

    def train(self):
        """
        Main training loop.
        Tracks and reports both training and validation performance.
        """
        for epoch in range(1, self.num_epochs + 1):
            train_loss, train_acc = self._run_one_epoch(self.train_loader, train=True)

            if self.val_loader is not None:
                val_loss, val_acc = self._run_one_epoch(self.val_loader, train=False)

                # Step the LR scheduler based on validation accuracy
                if self.scheduler is not None:
                    self.scheduler.step(val_acc)

                # Track best model
                if val_acc > self.best_val_acc:
                    self.best_val_acc = val_acc
                    self.best_state_dict = {
                        k: v.cpu().clone() for k, v in self.model.state_dict().items()
                    }

                # Log metrics
                self.history["train_loss"].append(train_loss)
                self.history["train_acc"].append(train_acc)
                self.history["val_loss"].append(val_loss)
                self.history["val_acc"].append(val_acc)

                print(
                    f"Epoch {epoch:02d}/{self.num_epochs} | "
                    f"Train Loss={train_loss:.4f}, Train Acc={train_acc*100:5.2f}% | "
                    f"Val Loss={val_loss:.4f}, Val Acc={val_acc*100:5.2f}%"
                )
            else:
                # No validation set
                self.history["train_loss"].append(train_loss)
                self.history["train_acc"].append(train_acc)
                print(
                    f"Epoch {epoch:02d}/{self.num_epochs} | "
                    f"Train Loss={train_loss:.4f}, Train Acc={train_acc*100:5.2f}%"
                )

        if self.val_loader is not None:
            print(f"\n[INFO] Best Validation Accuracy = {self.best_val_acc*100:.2f}%")

    def load_best_model(self):
        """
        Restore the best-performing model parameters (based on validation accuracy).
        """
        if self.best_state_dict is not None:
            self.model.load_state_dict(self.best_state_dict)
            self.model.to(self.device)
        else:
            print("[WARN] No best_state_dict found. Ensure validation was used during training.")

    def test(self, test_loader: DataLoader = None):
        """
        Evaluate the model on the test set.
        Automatically loads the best checkpoint if available.
        """
        loader = test_loader or self.test_loader
        if loader is None:
            raise ValueError("No test_loader provided.")

        # Use the best model checkpoint if available
        if self.best_state_dict is not None:
            self.load_best_model()

        test_loss, test_acc = self._run_one_epoch(loader, train=False)
        print(f"[TEST] Loss={test_loss:.4f}, Accuracy={test_acc*100:5.2f}%")
        return test_loss, test_acc


## Train the float model

In [90]:
model = FloatMLP(num_classes=12, hidden_dim=256, dropout_p=0.3).to(device)

trainer = Trainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    device=device,
    lr=3e-4,
    weight_decay=1e-4,
    batch_size=256,
    num_epochs=20,
    scheduler_factor=0.5,
    scheduler_patience=3,
)

# Train and evaluate
trainer.train()
trainer.test()


Epoch 01/20 | Train Loss=1.6770, Train Acc=43.44% | Val Loss=1.1412, Val Acc=64.11%
Epoch 02/20 | Train Loss=1.1349, Train Acc=61.52% | Val Loss=0.9226, Val Acc=69.53%
Epoch 03/20 | Train Loss=0.9879, Train Acc=66.53% | Val Loss=0.8114, Val Acc=72.66%
Epoch 04/20 | Train Loss=0.9042, Train Acc=69.29% | Val Loss=0.7529, Val Acc=74.33%
Epoch 05/20 | Train Loss=0.8477, Train Acc=71.42% | Val Loss=0.7280, Val Acc=75.02%
Epoch 06/20 | Train Loss=0.8011, Train Acc=72.83% | Val Loss=0.7397, Val Acc=75.53%
Epoch 07/20 | Train Loss=0.7701, Train Acc=73.78% | Val Loss=0.6482, Val Acc=78.04%
Epoch 08/20 | Train Loss=0.7330, Train Acc=75.19% | Val Loss=0.6610, Val Acc=77.39%
Epoch 09/20 | Train Loss=0.7151, Train Acc=75.57% | Val Loss=0.6254, Val Acc=78.97%
Epoch 10/20 | Train Loss=0.6865, Train Acc=76.58% | Val Loss=0.6245, Val Acc=78.68%
Epoch 11/20 | Train Loss=0.6680, Train Acc=77.21% | Val Loss=0.5929, Val Acc=79.68%
Epoch 12/20 | Train Loss=0.6510, Train Acc=77.82% | Val Loss=0.6061, Val Acc

(0.5603775709048128, 0.8057037340993024)

## Train the quantised model

In [91]:
model = QuantMLPKWS_Dropout(
    num_classes=12,
    hidden_dim=256,
    dropout_p=0.3,
    w_bit=3, a_bit=3
).to(device)

trainer = Trainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    device=device,
    lr=3e-4,
    weight_decay=1e-4,
    batch_size=256,
    num_epochs=20,
    scheduler_factor=0.5,
    scheduler_patience=3,
)

# Train and evaluate
trainer.train()
trainer.test()

# save weights
weight_dir = root_path / "weights"
weight_dir.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), weight_dir / "mlpw3a3_model_weights.pth")


Epoch 01/20 | Train Loss=1.7366, Train Acc=41.27% | Val Loss=1.2414, Val Acc=59.92%
Epoch 02/20 | Train Loss=1.2003, Train Acc=58.95% | Val Loss=0.9809, Val Acc=67.82%
Epoch 03/20 | Train Loss=1.0486, Train Acc=64.15% | Val Loss=0.9842, Val Acc=67.09%
Epoch 04/20 | Train Loss=0.9683, Train Acc=67.09% | Val Loss=0.8583, Val Acc=70.89%
Epoch 05/20 | Train Loss=0.9207, Train Acc=68.47% | Val Loss=0.8006, Val Acc=73.77%
Epoch 06/20 | Train Loss=0.8783, Train Acc=70.15% | Val Loss=0.7681, Val Acc=74.08%
Epoch 07/20 | Train Loss=0.8397, Train Acc=71.62% | Val Loss=0.7925, Val Acc=72.88%
Epoch 08/20 | Train Loss=0.8119, Train Acc=72.45% | Val Loss=0.7400, Val Acc=75.35%
Epoch 09/20 | Train Loss=0.7967, Train Acc=73.03% | Val Loss=0.7306, Val Acc=75.37%
Epoch 10/20 | Train Loss=0.7737, Train Acc=73.69% | Val Loss=0.7371, Val Acc=74.99%
Epoch 11/20 | Train Loss=0.7536, Train Acc=74.29% | Val Loss=0.7023, Val Acc=76.37%
Epoch 12/20 | Train Loss=0.7309, Train Acc=75.05% | Val Loss=0.7559, Val Acc

## Export your weights

In [92]:
weight_dir = root_path / "weights"
weight_dir.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), weight_dir / "mlpw3a3_model_weights.pth")

## Now we try 4 bit model

In [93]:
model = QuantMLPKWS_Dropout(
    num_classes=12,
    hidden_dim=256,
    dropout_p=0.3,
    w_bit=4, a_bit=4
).to(device)

trainer = Trainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    device=device,
    lr=3e-4,
    weight_decay=1e-4,
    batch_size=256,
    num_epochs=20,
    scheduler_factor=0.5,
    scheduler_patience=3,
)

# Train and evaluate
trainer.train()
trainer.test()

# save weights
weight_dir = root_path / "weights"
weight_dir.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), weight_dir / "mlpw4a4_model_weights.pth")


Epoch 01/20 | Train Loss=1.7023, Train Acc=42.18% | Val Loss=1.2132, Val Acc=59.63%
Epoch 02/20 | Train Loss=1.1455, Train Acc=61.32% | Val Loss=0.9586, Val Acc=67.95%
Epoch 03/20 | Train Loss=0.9955, Train Acc=66.24% | Val Loss=0.8188, Val Acc=72.73%
Epoch 04/20 | Train Loss=0.9169, Train Acc=68.84% | Val Loss=0.7961, Val Acc=73.68%
Epoch 05/20 | Train Loss=0.8714, Train Acc=70.32% | Val Loss=0.7226, Val Acc=75.35%
Epoch 06/20 | Train Loss=0.8242, Train Acc=72.20% | Val Loss=0.7400, Val Acc=74.28%
Epoch 07/20 | Train Loss=0.7923, Train Acc=73.11% | Val Loss=0.6918, Val Acc=76.24%
Epoch 08/20 | Train Loss=0.7597, Train Acc=74.35% | Val Loss=0.6856, Val Acc=76.59%
Epoch 09/20 | Train Loss=0.7314, Train Acc=75.32% | Val Loss=0.6569, Val Acc=77.95%
Epoch 10/20 | Train Loss=0.7196, Train Acc=75.53% | Val Loss=0.6616, Val Acc=77.13%
Epoch 11/20 | Train Loss=0.6933, Train Acc=76.42% | Val Loss=0.6608, Val Acc=77.22%
Epoch 12/20 | Train Loss=0.6808, Train Acc=76.97% | Val Loss=0.6615, Val Acc

In [94]:
weight_dir = root_path / "weights"
weight_dir.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), weight_dir / "mlpw4a4_model_weights.pth")

## Export your onnx graph

In [95]:
import torch
from pathlib import Path
from brevitas.export import export_qonnx
from qonnx.util.cleanup import cleanup as qonnx_cleanup
from qonnx.core.modelwrapper import ModelWrapper
from qonnx.core.datatype import DataType
model = QuantMLPKWS_Dropout(
    num_classes=12,
    hidden_dim=256,
    dropout_p=0.3,
    w_bit=3,
    a_bit=3
).cpu()

# --- 1. Load trained weights ---
weight_dir = root_path / "weights"
state_dict = torch.load(weight_dir / "mlpw3a3_model_weights.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()  # Always switch to eval mode before export

# --- 2. Prepare dummy input ---
# The dummy input shape must match the modelâ€™s expected input (B, 1, 10, 49)
dummy_input = torch.randn(1, 490)

# --- 3. Export to QONNX (for FINN / FPGA deployment) ---
export_path = str(root_path / "exports" / "kws_mlp_w3a3_qonnx.onnx")

# Ensure the export directory exists
export_dir = Path(export_path).parent
export_dir.mkdir(parents=True, exist_ok=True)


with torch.no_grad():
    export_qonnx(
        model,
        args=dummy_input,  # sometimes use input_t=dummy_input depending on brevitas version
        export_path=export_path
    )

# clean-up
qonnx_cleanup(export_path, out_file=export_path)

# Setting the input datatype explicitly because it doesn't get derived from the export function
model = ModelWrapper(export_path)
model.set_tensor_datatype(model.graph.input[0].name, DataType["INT8"])
model.set_tensor_datatype(model.graph.output[0].name, DataType["INT8"])
model.save(export_path)

print("QONNX model successfully exported to:", export_path)


QONNX model successfully exported to: lab_new/exports/kws_mlp_w3a3_qonnx.onnx


## Export 4 bit model

In [96]:
import torch
from pathlib import Path
from brevitas.export import export_qonnx
from qonnx.util.cleanup import cleanup as qonnx_cleanup
from qonnx.core.modelwrapper import ModelWrapper
from qonnx.core.datatype import DataType
model = QuantMLPKWS_Dropout(
    num_classes=12,
    hidden_dim=256,
    dropout_p=0.3,
    w_bit=4,
    a_bit=4
).cpu()

# --- 1. Load trained weights ---
weight_dir = root_path / "weights"
state_dict = torch.load(weight_dir / "mlpw4a4_model_weights.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()  # Always switch to eval mode before export

# --- 2. Prepare dummy input ---
dummy_input = torch.randn(1, 490)

# --- 3. Export to QONNX (for FINN / FPGA deployment) ---
export_path = str(root_path / "exports" / "kws_mlp_w4a4_qonnx.onnx")

# Ensure the export directory exists
export_dir = Path(export_path).parent
export_dir.mkdir(parents=True, exist_ok=True)


with torch.no_grad():
    export_qonnx(
        model,
        args=dummy_input,  # sometimes use input_t=dummy_input depending on brevitas version
        export_path=export_path
    )

# clean-up
qonnx_cleanup(export_path, out_file=export_path)

# Setting the input datatype explicitly because it doesn't get derived from the export function
model = ModelWrapper(export_path)
model.set_tensor_datatype(model.graph.input[0].name, DataType["INT8"])
model.set_tensor_datatype(model.graph.output[0].name, DataType["INT8"])
model.save(export_path)

print("QONNX model successfully exported to:", export_path)


QONNX model successfully exported to: lab_new/exports/kws_mlp_w4a4_qonnx.onnx


## Try other bit width and answer:
- Q1: What is the bitwidth of float DNN model
- Q2: What are the difference between the float and the quantised model?
- Q3: Try different weight and activation bit width, what did you find? Weight bw and activation bw which is more important?
- Q4: What is the accuracy - bw trade-off here? In practice, how to make the decision?
- Q5(optional): Fine tune the hyper parameters, can you break the accuracy - bw edge (illustrate with a Acc-bw curve)?
- Q6(optional): Considering other model compression strategies, can you further break the accuracy - model size edge  (illustrate with a Acc-model_size curve)?


In [97]:
your_weight_bitwidth = 4  # Example: change to 4 bits
your_activation_bitwidth = 2  # Example: change to 4 bits


model = QuantMLPKWS_Dropout(
    num_classes=12,
    hidden_dim=256,
    dropout_p=0.3,
    w_bit=your_weight_bitwidth, a_bit=your_activation_bitwidth
).to(device)

trainer = Trainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    device=device,
    lr=3e-4,
    weight_decay=1e-4,
    batch_size=256,
    num_epochs=20,
    scheduler_factor=0.5,
    scheduler_patience=3,
)

# Train and evaluate
# trainer.train()
# trainer.test()
# Then use same code below to export the model


In [None]:
# download the deploy file for lab5C
link = 'https://drive.google.com/file/d/12297dQhuYW0E9R6JS-q2FL9bTC5eJYez/view?usp=sharing'

from pathlib import Path
import sys, subprocess
deploy_path = root_path / "deploy" / "Lab5C_onboard.zip"
deploy_path.parent.mkdir(parents=True, exist_ok=True)

def _ensure_gdown():
    try:
        import gdown
        return gdown
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "gdown"])
        import gdown
        return gdown

if not deploy_path.exists():
    gdown = _ensure_gdown()
    out = gdown.download(url=link, output=str(deploy_path), quiet=False, fuzzy=True)
    if not out or not Path(out).exists():
        raise RuntimeError(f"down load failed: {deploy_path}")
    print(f"Downloaded to: {deploy_path}")
else:
    print(f"File already exists: {deploy_path}")

Defaulting to user installation because normal site-packages is not writeable


KeyboardInterrupt: 

: 

: 

: 