Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,23 +1,19 @@
---
title: Deploy SqueezeNet 1.0 INT8 model with ONNX Runtime on Azure Cobalt 100

draft: true
cascade:
draft: true


minutes_to_complete: 60

who_is_this_for: This Learning Path introduces ONNX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers deploying ONNX-based applications on Arm-based machines.
who_is_this_for: This Learning Path is for developers deploying ONNX-based applications on Arm-based machines.

learning_objectives:
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image.
- Deploy ONNX on the Ubuntu Pro virtual machine.
- Perform ONNX baseline testing and benchmarking on Arm64 virtual machines.
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image
- Perform ONNX baseline testing and benchmarking on Arm64 virtual machines

prerequisites:
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
- Basic understanding of Python and machine learning concepts.
- Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services.
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6)
- Basic understanding of Python and machine learning concepts
- Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services

author: Pareena Verma

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,33 @@ layout: "learningpathall"

## Azure Cobalt 100 Arm-based processor

Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.

To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor, the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, it is a 64-bit CPU that delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads.

You can use Cobalt 100 for:

- Web and application servers
- Data analytics
- Open-source databases
- Caching systems
- Many other scale-out workloads

Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. You can learn more about Cobalt 100 in the Microsoft blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).

## ONNX
ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models.
It provides interoperability between different deep learning frameworks, enabling models trained in one framework (such as PyTorch or TensorFlow) to be deployed and run in another.

ONNX models are serialized into a standardized format that can be executed by the ONNX Runtime, a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference allows developers to build flexible, portable, and production-ready AI workflows.
ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models.

You can use ONNX to:

- Move models between different deep learning frameworks, such as PyTorch and TensorFlow
- Deploy models trained in one framework to run in another
- Build flexible, portable, and production-ready AI workflows

ONNX models are serialized into a standardized format that you can execute with ONNX Runtime - a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference lets you deploy models efficiently across cloud, edge, and mobile environments.

To learn more, see the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/).

## Next steps for ONNX on Azure Cobalt 100

ONNX is widely used in cloud, edge, and mobile environments to deliver efficient and scalable inference for deep learning models. Learn more from the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/).
Now that you understand the basics of Azure Cobalt 100 and ONNX Runtime, you are ready to deploy and benchmark ONNX models on Arm-based Azure virtual machines. This Learning Path will guide you step by step through setting up an Azure Cobalt 100 VM, installing ONNX Runtime, and running machine learning inference on Arm64 infrastructure.
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,24 @@ python3 baseline.py
You should see output similar to:
```output
Inference time: 0.0026061534881591797
```
{{% notice Note %}}Inference time is the amount of time it takes for a trained machine learning model to make a prediction (i.e., produce output) after receiving input data.
input tensor of shape (1, 3, 224, 224):
- 1: batch size
- 3: color channels (RGB)
- 224 x 224: image resolution (common for models like SqueezeNet)
{{% notice Note %}}
Inference time is how long it takes for a trained machine learning model to make a prediction after it receives input data.

The input tensor shape `(1, 3, 224, 224)` means:
- `1`: One image is processed at a time (batch size)
- `3`: Three color channels (red, green, blue)
- `224 x 224`: Each image is 224 pixels wide and 224 pixels tall (standard for SqueezeNet)
{{% /notice %}}

This indicates the model successfully executed a single forward pass through the SqueezeNet INT8 ONNX model and returned results.

#### Output summary:
## Output summary:

Single inference latency(0.00260 sec): This is the time required for the model to process one input image and produce an output. The first run includes graph loading, memory allocation, and model initialization overhead.
Subsequent inferences are usually faster due to caching and optimized execution.

This demonstrates that the setup is fully working, and ONNX Runtime efficiently executes quantized models on Arm64.

Great job! You've completed your first ONNX Runtime inference on Arm-based Azure infrastructure. This baseline test confirms your environment is set up correctly and ready for more advanced benchmarking.

Next, you'll use a dedicated benchmarking tool to capture more detailed performance statistics and further optimize your deployment.
Original file line number Diff line number Diff line change
@@ -1,19 +1,25 @@
---
title: Benchmarking via onnxruntime_perf_test
title: Benchmark ONNX runtime performance with onnxruntime_perf_test
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

Now that you have validated ONNX Runtime with Python-based timing (e.g., SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.
This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and other x86_64 instances. architectures.
## Benchmark ONNX model inference on Azure Cobalt 100
Now that you have validated ONNX Runtime with Python-based timing (for example, the SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.

This approach helps you evaluate ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and compare results with other architectures if needed.

You are ready to run benchmarks, which is a key skill for optimizing real-world deployments.


## Run the performance tests using onnxruntime_perf_test
The `onnxruntime_perf_test` is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models and supports multiple execution providers (like CPU, GPU, or other execution providers). on Arm64 VMs, CPU execution is the focus.
The `onnxruntime_perf_test` tool is included in the ONNX Runtime source code. You can use it to measure the inference performance of ONNX models and compare different execution providers (such as CPU or GPU). On Arm64 VMs, CPU execution is the focus.

### Install Required Build Tools
Before building or running `onnxruntime_perf_test`, you will need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.

## Install required build tools
Before building or running `onnxruntime_perf_test`, you need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.

```console
sudo apt update
Expand All @@ -29,35 +35,48 @@ You should see output similar to:
```output
libprotoc 3.21.12
```
### Build ONNX Runtime from Source:
## Build ONNX Runtime from source

The benchmarking tool `onnxruntime_perf_test`, isn’t available as a pre-built binary for any platform. So, you will have to build it from the source, which is expected to take around 40 minutes.
The benchmarking tool `onnxruntime_perf_test` isn’t available as a pre-built binary for any platform, so you will need to build it from source. This process can take up to 40 minutes.

Clone onnxruntime repo:
Clone the ONNX Runtime repository:
```console
git clone --recursive https://github.com/microsoft/onnxruntime
cd onnxruntime
```

Now, build the benchmark tool:

```console
./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests
```
You should see the executable at:
If the build completes successfully, you should see the executable at:
```output
./build/Linux/Release/onnxruntime_perf_test
```

### Run the benchmark

## Run the benchmark
Now that you have built the benchmarking tool, you can run inference benchmarks on the SqueezeNet INT8 model:

```console
./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I ../squeezenet-int8.onnx
```

Breakdown of the flags:
-e cpu → Use the CPU execution provider.
-r 100 → Run 100 inference passes for statistical reliability.
-m times → Run in “repeat N times” mode. Useful for latency-focused measurement.

- `-e cpu`: use the CPU execution provider.
- `-r 100`: run 100 inference passes for statistical reliability.
- `-m times`: run in “repeat N times” mode for latency-focused measurement.
- `-s`: print summary statistics after the run.
- `-Z`: disable memory arena for more consistent timing.
- `-I ../squeezenet-int8.onnx`: path to your ONNX model file.

You should see output with latency and throughput statistics. If you encounter build errors, check that you have enough memory (at least 8 GB recommended) and all dependencies are installed. For missing dependencies, review the installation steps above.

If the benchmark runs successfully, you are ready to analyze and optimize your ONNX model performance on Arm-based Azure infrastructure.

Well done! You have completed a full benchmarking workflow. Continue to the next section to explore further optimizations or advanced deployment scenarios.
-s → Show detailed per-run statistics (latency distribution).
-Z → Disable intra-op thread spinning. Reduces CPU waste when idle between runs, especially on high-core systems like Cobalt 100.
-I → Input the ONNX model path directly, skipping pre-generated test data.
Expand Down Expand Up @@ -86,17 +105,17 @@ P95 Latency: 0.00187393 s
P99 Latency: 0.00190312 s
P999 Latency: 0.00190312 s
```
### Benchmark Metrics Explained
## Benchmark Metrics Explained

* Average Inference Time: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.
* Throughput: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.
* CPU Utilization: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.
* Peak Memory Usage: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments.
* P50 Latency (Median Latency): The time below which 50% of inference requests complete. Represents typical latency under normal load.
* Latency Consistency: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.
* Average inference time: the mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.
* Throughput: the number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.
* CPU utilization: the percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.
* Peak Memory Usage: the maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments.
* P50 Latency (Median Latency): the time below which 50% of inference requests complete. Represents typical latency under normal load.
* Latency Consistency: describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.

### Benchmark summary on Arm64:
Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**.
## Benchmark summary on Arm64:
Here is a summary of benchmark results collected on an Arm64 D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine.

| **Metric** | **Value** |
|----------------------------|-------------------------------|
Expand All @@ -113,12 +132,9 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr
| **Latency Consistency** | Consistent |


### Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs
## Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs


The results on Arm64 virtual machines demonstrate:
- Low-Latency Inference: Achieved consistent average inference times of ~1.86 ms on Arm64.
- Strong and Stable Throughput: Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances.
- Lightweight Resource Footprint: Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference.
- Consistent Performance: P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure.
These results on Arm64 virtual machines demonstrate low-latency inference, with consistent average inference times of approximately 1.86 ms. Throughput remains strong and stable, sustaining over 538 inferences per second using the `squeezenet-int8.onnx` model on D4ps_v6 instances. The resource footprint is lightweight, as peak memory usage stays below 37 MB and CPU utilization is around 96%, making this setup ideal for efficient edge or cloud inference. Performance is also consistent, with P50, P95, and maximum latency values tightly grouped, showcasing reliable results on Azure Cobalt 100 Arm-based infrastructure.

You have now successfully benchmarked inference time of ONNX models on an Azure Cobalt 100 Arm64 virtual machine.
Loading