# 1. Data & Model Loading

This notebook prepares the data and models used for the subsequent optimisation pipeline. This is to emulate a non-compressed model training and evaluation process, where the model is adapted to a specific dataset and then exported for further compression for embedded deployment.

The process is defined as such:
* A Torch dataset (already split into train and val) and model are loaded. Those must be specialized for classification tasks, but are agnostic
of the modality.
* The model"s classification head is adapted to the number of classes in the dataset, trained on the training set while freezing the backbone, and evaluated on the validation set.
* The whole model (backbone + classification head) is then adapted to the dataset by freezing all layers except the classification head, which is trained on the training set.
* The adapted model is then exported as a Torch model for later use in the optimisation pipeline.

An image MobileNetV2 model with a classification head adapted to the CIFAR-10 dataset is used as an example in this notebook.

## Setup

In [1]:
import torch
import torchvision

from nnopt.model.train import adapt_model_head_to_dataset
from nnopt.model.eval import eval_model
from nnopt.model.const import DEVICE, DTYPE
from nnopt.recipes.mobilenetv2_cifar10 import get_cifar10_datasets, save_mobilenetv2_cifar10_model

2025-06-11 14:08:11,693 - nnopt.recipes.mobilenetv2_cifar10 - INFO - Using device: cuda, dtype: torch.bfloat16


# MobileNetV2 and CIFAR-10 adaptation

In [2]:
mobilenetv2 = torchvision.models.mobilenet_v2(
    weights=torchvision.models.MobileNet_V2_Weights.DEFAULT
)
cifar10_train_dataset, cifar10_val_dataset, cifar10_test_dataset = get_cifar10_datasets()

# Adapt the MobileNetV2 model to CIFAR-10 dataset
mobilenetv2_cifar10_baseline = adapt_model_head_to_dataset(
    model=mobilenetv2,
    num_classes=10,  # CIFAR-10 has 10 classes
    train_dataset=cifar10_train_dataset,
    val_dataset=cifar10_val_dataset,
    batch_size=64,  # Adjust batch size as needed
    head_train_epochs=5,  # Train head for fewer epochs
    fine_tune_epochs=3,  # Fine-tune for fewer epochs
    optimizer_cls=torch.optim.Adam,  # Use Adam optimizer
    head_train_lr=0.001,  # Learning rate for head training
    fine_tune_lr=0.0001,  # Learning rate for fine-tuning
    use_amp=True,  # Use mixed precision training
    device=DEVICE,
    dtype=DTYPE
)

2025-06-11 14:08:11,958 - nnopt.recipes.mobilenetv2_cifar10 - INFO - Loading existing training and validation datasets...
2025-06-11 14:08:16,984 - nnopt.recipes.mobilenetv2_cifar10 - INFO - Loading existing test dataset...
2025-06-11 14:08:17,552 - nnopt.model.train - INFO - Training head of the model with backbone frozen...
Epoch 1/5 [Training]: 100%|██████████| 704/704 [00:38<00:00, 18.39it/s, acc=0.4703, cpu=4.7%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=39.0%, loss=1.4604, ram=10.6/30.9GB (46.2%), samples/s=343.6]  
Epoch 1/5 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 41.51it/s, acc=0.6534, cpu=3.7%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=37.0%, loss=1.2511, ram=10.6/30.9GB (46.0%), samples/s=1364.8]  


Epoch 1/5, Train Loss: 1.5474, Train Acc: 0.4703, Train Throughput: 3663.66 samples/s | Val Loss: 1.0272, Val Acc: 0.6534, Val Throughput: 8299.74 samples/s | CPU Usage: 10.90% | RAM Usage: 10.4/30.9GB (45.4%) | GPU 0 Util: 37.00% | GPU 0 Mem: 16.0/24.0GB (66.7%)


Epoch 2/5 [Training]: 100%|██████████| 704/704 [00:35<00:00, 19.86it/s, acc=0.5236, cpu=3.1%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=42.0%, loss=1.0861, ram=10.6/30.9GB (46.1%), samples/s=1045.5] 
Epoch 2/5 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 40.70it/s, acc=0.6734, cpu=0.0%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=39.0%, loss=1.3651, ram=10.6/30.9GB (46.2%), samples/s=1357.5]  


Epoch 2/5, Train Loss: 1.3628, Train Acc: 0.5236, Train Throughput: 3947.83 samples/s | Val Loss: 0.9437, Val Acc: 0.6734, Val Throughput: 8042.15 samples/s | CPU Usage: 10.20% | RAM Usage: 10.4/30.9GB (45.4%) | GPU 0 Util: 39.00% | GPU 0 Mem: 16.0/24.0GB (66.7%)


Epoch 3/5 [Training]: 100%|██████████| 704/704 [00:35<00:00, 19.75it/s, acc=0.5315, cpu=2.7%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=41.0%, loss=1.0953, ram=10.6/30.9GB (46.1%), samples/s=948.5]  
Epoch 3/5 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 41.51it/s, acc=0.6868, cpu=7.4%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=36.0%, loss=1.0769, ram=10.6/30.9GB (46.2%), samples/s=1398.0]  


Epoch 3/5, Train Loss: 1.3422, Train Acc: 0.5315, Train Throughput: 3870.97 samples/s | Val Loss: 0.9093, Val Acc: 0.6868, Val Throughput: 8096.54 samples/s | CPU Usage: 12.00% | RAM Usage: 10.4/30.9GB (45.4%) | GPU 0 Util: 36.00% | GPU 0 Mem: 16.0/24.0GB (66.7%)


Epoch 4/5 [Training]: 100%|██████████| 704/704 [00:35<00:00, 19.73it/s, acc=0.5299, cpu=3.0%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=38.0%, loss=1.5663, ram=10.6/30.9GB (46.1%), samples/s=983.1]  
Epoch 4/5 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 40.36it/s, acc=0.6782, cpu=0.0%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=39.0%, loss=1.0909, ram=10.6/30.9GB (46.2%), samples/s=1322.9] 


Epoch 4/5, Train Loss: 1.3366, Train Acc: 0.5299, Train Throughput: 3882.57 samples/s | Val Loss: 0.9252, Val Acc: 0.6782, Val Throughput: 7854.60 samples/s | CPU Usage: 11.30% | RAM Usage: 10.4/30.9GB (45.3%) | GPU 0 Util: 39.00% | GPU 0 Mem: 16.0/24.0GB (66.7%)


Epoch 5/5 [Training]: 100%|██████████| 704/704 [00:35<00:00, 19.80it/s, acc=0.5319, cpu=3.0%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=36.0%, loss=1.3116, ram=10.6/30.9GB (46.2%), samples/s=1015.0] 
Epoch 5/5 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 40.55it/s, acc=0.7010, cpu=6.7%, gpu_mem=16.0/24.0GB (66.7%), gpu_util=42.0%, loss=1.0541, ram=10.6/30.9GB (46.2%), samples/s=1338.8] 
2025-06-11 14:11:27,873 - nnopt.model.train - INFO - Fine-tuning full model...


Epoch 5/5, Train Loss: 1.3340, Train Acc: 0.5319, Train Throughput: 3791.25 samples/s | Val Loss: 0.8776, Val Acc: 0.7010, Val Throughput: 7838.73 samples/s | CPU Usage: 11.00% | RAM Usage: 10.4/30.9GB (45.4%) | GPU 0 Util: 42.00% | GPU 0 Mem: 16.0/24.0GB (66.7%)


Epoch 1/3 [Training]: 100%|██████████| 704/704 [00:36<00:00, 19.24it/s, acc=0.6450, cpu=4.6%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=67.0%, loss=1.3441, ram=10.7/30.9GB (46.3%), samples/s=158.5]  
Epoch 1/3 [Validation]: 100%|██████████| 79/79 [00:02<00:00, 39.26it/s, acc=0.8524, cpu=3.8%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=36.0%, loss=0.6023, ram=10.7/30.9GB (46.5%), samples/s=1416.9] 


Epoch 1/3, Train Loss: 1.0127, Train Acc: 0.6450, Train Throughput: 2024.18 samples/s | Val Loss: 0.4233, Val Acc: 0.8524, Val Throughput: 8589.38 samples/s | CPU Usage: 10.70% | RAM Usage: 10.4/30.9GB (45.6%) | GPU 0 Util: 36.00% | GPU 0 Mem: 18.5/24.0GB (77.0%)


Epoch 2/3 [Training]: 100%|██████████| 704/704 [00:36<00:00, 19.33it/s, acc=0.7209, cpu=3.2%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=46.0%, loss=0.9418, ram=10.7/30.9GB (46.5%), samples/s=474.8]  
Epoch 2/3 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 40.72it/s, acc=0.8832, cpu=3.4%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=35.0%, loss=0.4247, ram=10.7/30.9GB (46.4%), samples/s=1326.5] 


Epoch 2/3, Train Loss: 0.7925, Train Acc: 0.7209, Train Throughput: 2029.49 samples/s | Val Loss: 0.3330, Val Acc: 0.8832, Val Throughput: 8807.78 samples/s | CPU Usage: 10.60% | RAM Usage: 10.4/30.9GB (45.6%) | GPU 0 Util: 34.00% | GPU 0 Mem: 18.5/24.0GB (77.0%)


Epoch 3/3 [Training]: 100%|██████████| 704/704 [00:35<00:00, 19.63it/s, acc=0.7544, cpu=3.2%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=62.0%, loss=1.2642, ram=10.7/30.9GB (46.3%), samples/s=476.2]  
Epoch 3/3 [Validation]: 100%|██████████| 79/79 [00:01<00:00, 40.39it/s, acc=0.8988, cpu=9.4%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=35.0%, loss=0.3132, ram=10.7/30.9GB (46.5%), samples/s=1263.4]  

Epoch 3/3, Train Loss: 0.7020, Train Acc: 0.7544, Train Throughput: 2041.14 samples/s | Val Loss: 0.2952, Val Acc: 0.8988, Val Throughput: 8442.85 samples/s | CPU Usage: 12.30% | RAM Usage: 10.4/30.9GB (45.6%) | GPU 0 Util: 35.00% | GPU 0 Mem: 18.5/24.0GB (77.0%)





In [3]:
# Evaluate the adapted model on the validation and test set
val_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_val_dataset,
    batch_size=64,  # Adjust batch size as needed
    device=DEVICE,
    use_amp=True,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
)

test_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_test_dataset,
    batch_size=64,  # Adjust batch size as needed
    device=DEVICE,
    use_amp=True,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
)
print(f"Validation accuracy of the adapted MobileNetV2 on CIFAR-10: {val_metrics['accuracy']:.2f}")
print(f"Test accuracy of the adapted MobileNetV2 on CIFAR-10: {test_metrics['accuracy']:.2f}")

2025-06-11 14:13:22,689 - nnopt.model.eval - INFO - Starting evaluation on device: cuda, dtype: torch.bfloat16, batch size: 64
2025-06-11 14:13:22,693 - nnopt.model.eval - INFO - Starting warmup for 5 batches...
[Warmup]: 100%|██████████| 5/5 [00:00<00:00, 12.82it/s]
2025-06-11 14:13:23,170 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 79/79 [00:01<00:00, 40.67it/s, acc=0.8988, cpu=0.0%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=38.0%, loss=0.3132, ram=10.7/30.9GB (46.5%), samples/s=1359.7] 
2025-06-11 14:13:25,118 - nnopt.model.eval - INFO - Starting evaluation on device: cuda, dtype: torch.bfloat16, batch size: 64
2025-06-11 14:13:25,121 - nnopt.model.eval - INFO - Starting warmup for 5 batches...


Evaluation Complete: Avg Loss: 0.2952, Accuracy: 0.8988
Throughput: 8822.51 samples/sec | Avg Batch Time: 7.17 ms | Avg Sample Time: 0.11 ms
System Stats: CPU Usage: 10.90% | RAM Usage: 10.4/30.9GB (45.6%) | GPU 0 Util: 38.00% | GPU 0 Mem: 18.5/24.0GB (77.0%)


[Warmup]: 100%|██████████| 5/5 [00:00<00:00, 13.22it/s]
2025-06-11 14:13:25,585 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 157/157 [00:03<00:00, 42.22it/s, acc=0.9004, cpu=3.3%, gpu_mem=18.5/24.0GB (77.0%), gpu_util=40.0%, loss=0.0883, ram=10.7/30.9GB (46.6%), samples/s=629.3]  

Evaluation Complete: Avg Loss: 0.2923, Accuracy: 0.9004
Throughput: 8450.56 samples/sec | Avg Batch Time: 7.54 ms | Avg Sample Time: 0.12 ms
System Stats: CPU Usage: 12.10% | RAM Usage: 10.5/30.9GB (45.7%) | GPU 0 Util: 26.00% | GPU 0 Mem: 18.5/24.0GB (77.0%)
Validation accuracy of the adapted MobileNetV2 on CIFAR-10: 0.90
Test accuracy of the adapted MobileNetV2 on CIFAR-10: 0.90





In [4]:
# Export the adapted model
save_mobilenetv2_cifar10_model(
    model=mobilenetv2_cifar10_baseline,
    metrics_values={
        "val_metrics": val_metrics,
        "test_metrics": test_metrics,
    },
    version="mobilenetv2_cifar10/fp32/baseline",
)

2025-06-11 14:13:29,367 - nnopt.recipes.mobilenetv2_cifar10 - INFO - Metadata saved to /home/pbeuran/repos/nnopt/models/mobilenetv2_cifar10/fp32/baseline/metadata.json
2025-06-11 14:13:29,368 - nnopt.recipes.mobilenetv2_cifar10 - INFO - Model saved to /home/pbeuran/repos/nnopt/models/mobilenetv2_cifar10/fp32/baseline/model.pt


# Analysis

## GPU FP32

In [5]:
# Evaluate the adapted model on the validation and test set on GPU
val_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_val_dataset,
    batch_size=64,  # Adjust batch size as needed
    device="cuda",
    use_amp=False,
    dtype=torch.float32
)

test_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_test_dataset,
    batch_size=64,  # Adjust batch size as needed
    device="cuda",
    use_amp=False,
    dtype=torch.float32
)

2025-06-11 14:13:29,373 - nnopt.model.eval - INFO - Starting evaluation on device: cuda, dtype: torch.float32, batch size: 64
2025-06-11 14:13:29,377 - nnopt.model.eval - INFO - Starting warmup for 5 batches...
[Warmup]: 100%|██████████| 5/5 [00:00<00:00, 10.75it/s]
2025-06-11 14:13:29,933 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 79/79 [00:01<00:00, 40.64it/s, acc=0.8976, cpu=6.2%, gpu_mem=19.1/24.0GB (79.4%), gpu_util=56.0%, loss=0.3326, ram=10.8/30.9GB (46.7%), samples/s=477.1]  
2025-06-11 14:13:31,882 - nnopt.model.eval - INFO - Starting evaluation on device: cuda, dtype: torch.float32, batch size: 64
2025-06-11 14:13:31,885 - nnopt.model.eval - INFO - Starting warmup for 5 batches...


Evaluation Complete: Avg Loss: 0.2991, Accuracy: 0.8976
Throughput: 5888.79 samples/sec | Avg Batch Time: 10.75 ms | Avg Sample Time: 0.17 ms
System Stats: CPU Usage: 11.40% | RAM Usage: 10.6/30.9GB (46.1%) | GPU 0 Util: 52.00% | GPU 0 Mem: 19.1/24.0GB (79.4%)


[Warmup]: 100%|██████████| 5/5 [00:00<00:00, 13.16it/s]
2025-06-11 14:13:32,356 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 157/157 [00:03<00:00, 42.65it/s, acc=0.8981, cpu=3.4%, gpu_mem=19.0/24.0GB (79.4%), gpu_util=53.0%, loss=0.0930, ram=10.8/30.9GB (46.7%), samples/s=1008.3] 

Evaluation Complete: Avg Loss: 0.2943, Accuracy: 0.8981
Throughput: 6029.95 samples/sec | Avg Batch Time: 10.56 ms | Avg Sample Time: 0.17 ms
System Stats: CPU Usage: 11.20% | RAM Usage: 10.6/30.9GB (46.1%) | GPU 0 Util: 53.00% | GPU 0 Mem: 19.0/24.0GB (79.4%)





In [6]:
# Print the val metrics
import yaml
print("- Validation Metrics:")
yaml_str = yaml.dump(val_metrics, default_flow_style=False)
print(yaml_str)

# Print the test metrics
print("- Test Metrics:")
yaml_str = yaml.dump(test_metrics, default_flow_style=False)
print(yaml_str)

- Validation Metrics:
accuracy: 0.8976
avg_loss: 0.2991045247554779
avg_time_per_batch: 0.010747729316652204
avg_time_per_sample: 0.00016981412320310482
params_stats:
  approx_memory_mb_for_params: 8.532264709472656
  bn_param_params: 34112
  float_bias_params: 10
  float_weight_params: 2202560
  int_weight_params: 0
  other_float_params: 0
  total_params: 2236682
samples_per_second: 5888.791704350515

- Test Metrics:
accuracy: 0.8981
avg_loss: 0.2943320981144905
avg_time_per_batch: 0.010562983795822151
avg_time_per_sample: 0.00016583884559440775
params_stats:
  approx_memory_mb_for_params: 8.532264709472656
  bn_param_params: 34112
  float_bias_params: 10
  float_weight_params: 2202560
  int_weight_params: 0
  other_float_params: 0
  total_params: 2236682
samples_per_second: 6029.950319635612



## CPU FP32

In [7]:
# Evaluate the adapted model on the validation and test set on CPU
val_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_val_dataset,
    batch_size=32,  # Adjust batch size as needed
    device="cpu",
    use_amp=False,
    dtype=torch.float32
)

test_metrics = eval_model(
    model=mobilenetv2_cifar10_baseline,
    test_dataset=cifar10_test_dataset,
    batch_size=32,  # Adjust batch size as needed
    device="cpu",
    use_amp=False,
    dtype=torch.float32
)

2025-06-11 14:13:36,081 - nnopt.model.eval - INFO - Starting evaluation on device: cpu, dtype: torch.float32, batch size: 32
2025-06-11 14:13:36,116 - nnopt.model.eval - INFO - Starting warmup for 5 batches...
[Warmup]: 100%|██████████| 5/5 [00:02<00:00,  2.02it/s]
2025-06-11 14:13:38,684 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 157/157 [01:06<00:00,  2.35it/s, acc=0.8970, cpu=47.4%, loss=0.3332, ram=10.9/30.9GB (48.1%), samples/s=169.1]
2025-06-11 14:14:45,357 - nnopt.model.eval - INFO - Starting evaluation on device: cpu, dtype: torch.float32, batch size: 32
2025-06-11 14:14:45,360 - nnopt.model.eval - INFO - Starting warmup for 5 batches...


Evaluation Complete: Avg Loss: 0.2994, Accuracy: 0.8970
Throughput: 77.05 samples/sec | Avg Batch Time: 413.32 ms | Avg Sample Time: 12.98 ms
System Stats: CPU Usage: 15.50% | RAM Usage: 10.7/30.9GB (47.5%)


[Warmup]: 100%|██████████| 5/5 [00:02<00:00,  1.75it/s]
2025-06-11 14:14:48,317 - nnopt.model.eval - INFO - Warmup complete.
[Evaluation]: 100%|██████████| 313/313 [02:30<00:00,  2.08it/s, acc=0.8979, cpu=48.7%, loss=0.0928, ram=10.8/30.9GB (47.5%), samples/s=68.0]

Evaluation Complete: Avg Loss: 0.2946, Accuracy: 0.8979
Throughput: 67.92 samples/sec | Avg Batch Time: 470.42 ms | Avg Sample Time: 14.72 ms
System Stats: CPU Usage: 34.50% | RAM Usage: 10.6/30.9GB (46.8%)





In [8]:
# Print the val metrics
import yaml
print("- Validation Metrics:")
yaml_str = yaml.dump(val_metrics, default_flow_style=False)
print(yaml_str)

# Print the test metrics
print("- Test Metrics:")
yaml_str = yaml.dump(test_metrics, default_flow_style=False)
print(yaml_str)

- Validation Metrics:
accuracy: 0.897
avg_loss: 0.2993665291070938
avg_time_per_batch: 0.41331771221001684
avg_time_per_sample: 0.012978176163394528
params_stats:
  approx_memory_mb_for_params: 8.532264709472656
  bn_param_params: 34112
  float_bias_params: 10
  float_weight_params: 2202560
  int_weight_params: 0
  other_float_params: 0
  total_params: 2236682
samples_per_second: 77.0524292019198

- Test Metrics:
accuracy: 0.8979
avg_loss: 0.2945516872644424
avg_time_per_batch: 0.4704234021278193
avg_time_per_sample: 0.014724252486600744
params_stats:
  approx_memory_mb_for_params: 8.532264709472656
  bn_param_params: 34112
  float_bias_params: 10
  float_weight_params: 2202560
  int_weight_params: 0
  other_float_params: 0
  total_params: 2236682
samples_per_second: 67.91516247836776



## Conclusions

* Accuracy is ~90% for CIFAR-10 with MobileNetV2, with fast convergence for so few epochs.
* GPU is ~100x time faster than CPU for both training and evaluation, which is to be expected considering architecture differences.
* Thus, if wanting to run the model on a CPU for embedded cases, and expect high throughput during inference with little-to-no accuracy loss, the model should be optimised for the CPU. This can be done with pruning, quantization, knowledge distillation.
* Pruning and quantization are good candidates and explored in the next notebooks, while knowledge distillation isn't because of the already efficient architecture of MobileNetV2.