Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
396358d
Modify environment setup instructions in 01-env-setup.md
pareenaverma Nov 20, 2025
c15c0f8
Revise environment setup instructions for ExecuTorch
pareenaverma Nov 20, 2025
2f78c11
Enhance environment setup instructions
pareenaverma Nov 20, 2025
5d06a6d
Update 02-cross-compile.md
pareenaverma Nov 20, 2025
c8963d8
Update cross-compilation installation command
pareenaverma Nov 20, 2025
aa9fb4a
Fix command for installing cross-compilation tools
pareenaverma Nov 20, 2025
09516cc
Update export-conv2d.py
pareenaverma Nov 20, 2025
f61aeb4
Update _index.md
pareenaverma Nov 20, 2025
2f1d089
Enhance Python environment setup and installation instructions
pareenaverma Nov 20, 2025
0ab4278
Add success confirmation for Executorch installation
pareenaverma Nov 20, 2025
4fc1d5b
Refine cross-compilation instructions for AArch64
pareenaverma Nov 20, 2025
8601f02
Update 03-executorch-node-kai-kernel.md
pareenaverma Nov 20, 2025
b3aadca
Update 04-create-fc-model.md
pareenaverma Nov 20, 2025
f610d5f
Refine language for clarity in Conv2d model documentation
pareenaverma Nov 20, 2025
c14d2fe
Include script for exporting conv2D benchmark models
pareenaverma Nov 20, 2025
45b0c1d
Update 05-create-conv2d-model.md
pareenaverma Nov 20, 2025
b47b9c4
Clarify Batch Matrix Multiply operator usage
pareenaverma Nov 20, 2025
9d23c22
Add benchmark model script instructions
pareenaverma Nov 20, 2025
9c42626
Update 07-run-model.md
pareenaverma Nov 21, 2025
379b062
Update 07-run-model.md
pareenaverma Nov 21, 2025
75c98ee
Update 08-analyze-etdump.md
pareenaverma Nov 21, 2025
56c20fd
Update wording for clarity in performance analysis
pareenaverma Nov 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,45 +10,50 @@ layout: learningpathall
### Python Environment Setup

Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
This prevents dependency conflicts with your system Python installation and ensures a clean build environment.
This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.

```bash
cd $WORKSPACE
sudo apt update
sudo apt install -y python3 python3.12-dev python3-venv build-essential cmake
python3 -m venv pyenv
source pyenv/bin/activate

```
All subsequent steps should be executed within this Python virtual environment.
Once activated, all subsequent steps should be executed within this Python virtual environment.

### Download the ExecuTorch Source Code

Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.

```bash
```bash
export WORKSPACE=$HOME
cd $WORKSPACE
git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git

```

> **Note:**
> The instructions in this guide are based on **ExecuTorch v1.0.0**.
> Commands or configuration options may differ in later releases.
{{% notice Note %}}
The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
{{% /notice %}}

### Build and Install the ExecuTorch Python Components

Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled.
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.

Run the following command from your ExecuTorch workspace:
```bash
cd $WORKSPACE/executorch
CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh

```
This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.

This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector.

After installation completes successfully, you can verify the environment by running:
### Verify the Installation
After the build completes successfully, verify that ExecuTorch was installed into your current Python environment:

```bash
python -c "import executorch; print('Executorch build and install successfully.')"
```

If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels.
Original file line number Diff line number Diff line change
@@ -1,19 +1,26 @@
---
title: Cross-Compile ExecuTorch for the Aarch64 platform
title: Cross-Compile ExecuTorch for the AArch64 platform
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---


This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled.
All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc).
In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.

### Install the Cross-Compilation Toolchain
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
```bash
sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y
```

### Run CMake Configuration

Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration:
Use CMake to configure the ExecuTorch build for the AArch64 target.

The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI.

```bash

Expand Down Expand Up @@ -61,18 +68,19 @@ cmake -GNinja \


### Build ExecuTorch
Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:

```bash
cmake --build . -j$(nproc)

```
CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.

If the build completes successfully, you should find the executor_runner binary under the directory:
### Locate the executor_runner Binary
If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:

```bash
```output
build-arm64/executor_runner

```

You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.

Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ weight: 4
### FIXED, DO NOT MODIFY
layout: learningpathall
---
ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization.
ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.

Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.

These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ weight: 5
layout: learningpathall
---

In the previous section, we discussed that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.

To evaluate the performance of these variants across different hardware platforms, we will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.

These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.

### Fully connected benchmark model
### Define a Simple Linear Benchmark Model

In the following example model, we use simple model to generate nodes that can be accelerated by Kleidiai.
The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.

By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.

Expand All @@ -34,8 +36,9 @@ class DemoLinearModel(torch.nn.Module):
return (torch.randn(1, 256, dtype=dtype),)

```
This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.

### Export FP16/FP32 model for pf16_gemm/pf32_gemm Variants
### Export FP16/FP32 model for pf16_gemm and pf32_gemm

| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
Expand Down Expand Up @@ -86,15 +89,16 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm")

```

### Export int8 quantized model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variant
### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm
INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy.

| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
| qp8_f32_qc8w_gemm | Asymmetric INT8 per-row quantization | Per-channel symmetric INT8 quantization | FP32 |
| pqs8_qc8w_gemm | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |


The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variant to accelerate computation:
The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variants to accelerate computation:

```python

Expand Down Expand Up @@ -148,7 +152,9 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm");

```

### Export int4 quantized model for qp8_f32_qb4w_gemm variant
### Export INT4 quantized model for qp8_f32_qb4w_gemm
This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels.

| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
| qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 |
Expand Down Expand Up @@ -200,17 +206,26 @@ def export_int4_quantize_model(dynamic: bool, model_name: str):
etrecord.save(etr_file)

export_int4_quantize_model(False,"linear_model_qp8_f32_qb4w_gemm");


```

**NOTE:**

{{%notice Note%}}
When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
These ETRecord files are essential for subsequent model inspection and performance analysis using the ExecuTorch Inspector API.
{{%/notice%}}


After running this script, both the PTE model file and the etrecord file are generated.
### Run the Complete Benchmark Model Export Script
Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.

```bash
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-linear-model.py
chmod +x export-linear-model.py
python3 ./export-linear-model.py
```

### Verify the Generated Files
After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory:

``` bash
$ ls model/ -1
Expand All @@ -225,5 +240,4 @@ linear_model_qp8_f32_qb4w_gemm.pte
linear_model_qp8_f32_qc8w_gemm.etrecord
linear_model_qp8_f32_qc8w_gemm.pte
```

The complete source code is available [here](../export-linear-model.py).
At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,22 @@ weight: 6
layout: learningpathall
---

In the previous section, we discussed that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.


| XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType |
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
| pqs8_qc8w_gemm | Asymmetric INT8 quantization(NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
| pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 |

To evaluate the performance of Conv2d operators across multiple hardware platforms, we create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.
To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.


### INT8-quantized Conv2d benchmark model
### INT8-Quantized Conv2d benchmark model

The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI.

By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
By adjusting some of the model’s input parameters, you can also simulate the behavior of nodes that appear in real-world models.


```python
Expand Down Expand Up @@ -100,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm");

### PointwiseConv2d benchmark model

In the following example model, we use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai.
In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai.

As before, input parameters can be adjusted to simulate real-world model behavior.

Expand Down Expand Up @@ -158,10 +158,21 @@ export_pointwise_model("pointwise_conv2d_pf32_gemm")

```

**NOTES:**

{{%notice Note%}}
When exporting models, the generate_etrecord option is enabled to produce the .etrecord file alongside the .pte model file.
These ETRecord files are essential for subsequent model analysis and performance evaluation.
{{%/notice%}}


### Run the Complete Benchmark Model Script
Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata.

```bash
wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py
chmod +x export-conv2d.py
python3 ./export-conv2d.py
```
### Validate Outputs

After running this script, both the PTE model file and the etrecord file are generated.

Expand All @@ -173,4 +184,3 @@ pointwise_conv2d_pf32_gemm.etrecord
pointwise_conv2d_pf32_gemm.pte
```

The complete source code is available [here](../export-conv2d.py).
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ weight: 7
layout: learningpathall
---

In the previous section, we discussed that the Batch Matrix Multiply operator supports multiple GEMM (General Matrix Multiplication) variants.
The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm.

To evaluate the performance of these variants across different hardware platforms, we construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.
To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.


### Matrix multiply benchmark model
Expand Down Expand Up @@ -72,11 +72,22 @@ export_mutrix_mul_model(torch.float32,"matrix_mul_pf32_gemm")

```

**NOTE:**

{{%notice Note%}}
When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
These ETRecord files are essential for subsequent model analysis and performance evaluation.
{{%/notice%}}

### Run the Complete Benchmark Model Script
Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script.
This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation.

```bash
wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py
chmod +x export-matrix-mul.py
python3 ./export-matrix-mul.py
```

### Verify the output

After running this script, both the PTE model file and the etrecord file are generated.

Expand All @@ -87,5 +98,6 @@ model/matrix_mul_pf16_gemm.pte
model/matrix_mul_pf32_gemm.etrecord
model/matrix_mul_pf32_gemm.pte
```
These files are the inputs for upcoming executor_runner benchmarks, where you’ll measure and compare KleidiAI micro-kernel performance.

The complete source code is available [here](../export-matrix-mul.py).
Original file line number Diff line number Diff line change
@@ -1,25 +1,35 @@
---
title: Run model and generate the etdump
title: Run model and generate the ETDump
weight: 8

### FIXED, DO NOT MODIFY
layout: learningpathall
---

After generating the model, we can now run it on an ARM64 platform using the following command:
### Copy artifacts to your Arm64 target
From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device:

```bash
scp $WORKSPACE/build-arm64/executor_runner <arm_user>@<arm_host>:~/bench/
scp -r model/ <arm_user>@<arm_host>:~/bench/
```

### Run a model and emit ETDump
Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte).
The flags below tell executor_runner where to write the ETDump and how many times to execute.

```bash
cd $WORKSPACE
/build-arm64/executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1
cd ~/bench
./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1

```

You can adjust the number of execution threads and the number of times the model is invoked.


You should see output similar to the example below.
You should see logs like:

```bash
```output
D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace
D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0
D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED
Expand All @@ -42,6 +52,6 @@ OutputX 0: tensor(sizes=[1, 256], [
I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'.

```
If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed.

If the execution is successful, an etdump file will also be generated.

Loading