From 396358dcb52620c2b5e66265ae1c70c2c01283df Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:05:15 -0500 Subject: [PATCH 01/22] Modify environment setup instructions in 01-env-setup.md Updated directory navigation and installation commands for Python environment setup. --- .../01-env-setup.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 1e26ff1d6..e7b6471c5 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -13,7 +13,9 @@ Before building ExecuTorch, it is highly recommended to create an isolated Pytho This prevents dependency conflicts with your system Python installation and ensures a clean build environment. ```bash -cd $WORKSPACE +cd $HOME +sudo apt update +sudo apt install -y python3 python3-venv python3 -m venv pyenv source pyenv/bin/activate From c15c0f8505448ff802ef25af1bf939366909e22d Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:06:18 -0500 Subject: [PATCH 02/22] Revise environment setup instructions for ExecuTorch Updated instructions for setting up the environment for ExecuTorch. --- .../01-env-setup.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index e7b6471c5..2b66803b6 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -13,7 +13,6 @@ Before building ExecuTorch, it is highly recommended to create an isolated Pytho This prevents dependency conflicts with your system Python installation and ensures a clean build environment. ```bash -cd $HOME sudo apt update sudo apt install -y python3 python3-venv python3 -m venv pyenv @@ -26,7 +25,8 @@ All subsequent steps should be executed within this Python virtual environment. Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched. -```bash +```bash +export WORKSPACE=$HOME cd $WORKSPACE git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git From 2f78c1181104211b1c50b7b5c807e3a955c95884 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:16:56 -0500 Subject: [PATCH 03/22] Enhance environment setup instructions Updated the installation command to include python3.12-dev, build-essential, and cmake for a complete environment setup. --- .../01-env-setup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 2b66803b6..86a1a2041 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -14,7 +14,7 @@ This prevents dependency conflicts with your system Python installation and ensu ```bash sudo apt update -sudo apt install -y python3 python3-venv +sudo apt install -y python3 python3.12-dev python3-venv build-essential cmake python3 -m venv pyenv source pyenv/bin/activate From 5d06a6d3c0a5e9343667c9579524627eb5b77050 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:37:09 -0500 Subject: [PATCH 04/22] Update 02-cross-compile.md --- .../02-cross-compile.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index 0b0386694..f0e7bf43c 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -10,6 +10,10 @@ layout: learningpathall This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled. All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc). +```bash +sudo apt install gcc-aarch64-linux-gnu -y +``` + ### Run CMake Configuration From c8963d858f8f6d114ddc49eb14840c046e2ebd9a Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:39:00 -0500 Subject: [PATCH 05/22] Update cross-compilation installation command Added ninja-build to the installation command for cross-compilation. --- .../02-cross-compile.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index f0e7bf43c..5475d531b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -11,7 +11,7 @@ This section describes how to cross-compile ExecuTorch for an AArch64 target pla All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc). ```bash -sudo apt install gcc-aarch64-linux-gnu -y +sudo apt install gcc-aarch64-linux-gnu ninja-build -y ``` From aa9fb4ad53f8ab7a8039c94525245b3af276eb0e Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 14:40:53 -0500 Subject: [PATCH 06/22] Fix command for installing cross-compilation tools --- .../02-cross-compile.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index 5475d531b..3aaaedde3 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -11,7 +11,7 @@ This section describes how to cross-compile ExecuTorch for an AArch64 target pla All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc). ```bash -sudo apt install gcc-aarch64-linux-gnu ninja-build -y +sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y ``` From 09516cce3b37e342f2ba529c09602acc0d544f2a Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:07:37 -0500 Subject: [PATCH 07/22] Update export-conv2d.py --- .../export-conv2d.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py index b976be70c..0e1765436 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py @@ -74,7 +74,7 @@ def export_int8_quantize_conv2d_model(model_name: str): etrecord = et_program.get_etrecord() etrecord.save(etr_file) -export_int8_quantize_depthwise_model("qint8_conv2d_pqs8_qc8w_gemm"); +export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm"); From f61aeb4a879b382a6fbaa6e01b8bb0aa077fc005 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:28:21 -0500 Subject: [PATCH 08/22] Update _index.md --- .../_index.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md index 6f7bffc8c..14b6f0776 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md @@ -1,5 +1,5 @@ --- -title: How to Benchmark a Single KleidiAI Micro-kernel in ExecuTorch +title: How to Benchmark a KleidiAI Micro-kernel in ExecuTorch draft: true cascade: @@ -7,17 +7,17 @@ cascade: minutes_to_complete: 30 -who_is_this_for: This article is intended for advanced developers who want to leverage KleidiAI to accelerate ExecuTorch model inference on the AArch64 platform. +who_is_this_for: This is an advanced topic intended for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 (AArch64) platforms supporting SME/SME2 instructions. learning_objectives: - - Cross-compile ExecuTorch for the ARM64 platform with XNNPACK and KleidiAI enabled, including SME/SME2 support. + - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions. - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions. - - Use the `executor_runner` tool to collect ETDump profiling data. - - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API. + - Use the executor_runner tool to run kernel workloads and collect ETDump profiling data. + - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior. prerequisites: - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space. - - An Arm64 target system with support for SME or SME2. + - An Arm64 target system with support for SME or SME2. Refer to [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support). author: Qixiang Xu @@ -26,13 +26,12 @@ skilllevels: Advanced subjects: ML armips: - Cortex-A - - SME - - Kleidai tools_software_languages: - Python - - cmake + - ExecuTorch - XNNPACK + - KleidiAI operatingsystems: - Linux From 2f1d089d737938614a313aa73dd2654569198019 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:33:36 -0500 Subject: [PATCH 09/22] Enhance Python environment setup and installation instructions Updated Python environment setup instructions and improved clarity on installation steps. --- .../01-env-setup.md | 20 ++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 86a1a2041..4e2c7767e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -10,7 +10,7 @@ layout: learningpathall ### Python Environment Setup Before building ExecuTorch, it is highly recommended to create an isolated Python environment. -This prevents dependency conflicts with your system Python installation and ensures a clean build environment. +This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs. ```bash sudo apt update @@ -19,7 +19,7 @@ python3 -m venv pyenv source pyenv/bin/activate ``` -All subsequent steps should be executed within this Python virtual environment. +Once activated, all subsequent steps should be executed within this Python virtual environment. ### Download the ExecuTorch Source Code @@ -32,23 +32,25 @@ git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.g ``` - > **Note:** - > The instructions in this guide are based on **ExecuTorch v1.0.0**. - > Commands or configuration options may differ in later releases. + {{% notice Note %}} + The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases. + {{% /notice %}} ### Build and Install the ExecuTorch Python Components -Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled. +Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. +This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling. +Run the following command from your ExecuTorch workspace: ```bash cd $WORKSPACE/executorch CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh ``` +This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector. -This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector. - -After installation completes successfully, you can verify the environment by running: +### Verify the Installation +After the build completes successfully, verify that ExecuTorch was installed into your current Python environment: ```bash python -c "import executorch; print('Executorch build and install successfully.')" From 0ab4278263edb7c6800b21f25dfff7b0de88a7ec Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:34:17 -0500 Subject: [PATCH 10/22] Add success confirmation for Executorch installation Added confirmation step for successful installation. --- .../01-env-setup.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 4e2c7767e..9a0ec2c0b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -56,3 +56,4 @@ After the build completes successfully, verify that ExecuTorch was installed int python -c "import executorch; print('Executorch build and install successfully.')" ``` +If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels. From 4fc1d5b8b94c57c4b1d2bfacd26ec6baa52538a2 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:43:12 -0500 Subject: [PATCH 11/22] Refine cross-compilation instructions for AArch64 Updated the title and improved clarity in the instructions for cross-compiling ExecuTorch for the AArch64 platform. Added details about the cross-compilation toolchain and clarified the purpose of the executor_runner binary. --- .../02-cross-compile.md | 24 +++++++++++-------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index 3aaaedde3..dcd8ed07e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -1,5 +1,5 @@ --- -title: Cross-Compile ExecuTorch for the Aarch64 platform +title: Cross-Compile ExecuTorch for the AArch64 platform weight: 3 ### FIXED, DO NOT MODIFY @@ -7,17 +7,20 @@ layout: learningpathall --- -This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled. -All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc). +In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled. +Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine. +### Install the Cross-Compilation Toolchain +On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake: ```bash sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y ``` - ### Run CMake Configuration -Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration: +Use CMake to configure the ExecuTorch build for the AArch64 target. + +The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI. ```bash @@ -65,18 +68,19 @@ cmake -GNinja \ ### Build ExecuTorch +Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools: ```bash cmake --build . -j$(nproc) - ``` +CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target. -If the build completes successfully, you should find the executor_runner binary under the directory: +### Locate the executor_runner Binary +If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under: -```bash +```output build-arm64/executor_runner - ``` - +You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line. This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration. From 8601f0232fb67417c3dfa1b95de0f260ce4e18c1 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 15:45:38 -0500 Subject: [PATCH 12/22] Update 03-executorch-node-kai-kernel.md --- .../03-executorch-node-kai-kernel.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md index 7bb5ffffd..5f8cac6fd 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md @@ -5,9 +5,9 @@ weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization. +ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers. -Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms. +Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms. These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models. From b3aadcad49a77eb5309b92ed483e7db18debf6c5 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:03:09 -0500 Subject: [PATCH 13/22] Update 04-create-fc-model.md --- .../04-create-fc-model.md | 44 ++++++++++++------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md index 7be11240d..7683bbdd4 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md @@ -6,14 +6,16 @@ weight: 5 layout: learningpathall --- -In the previous section, we discussed that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants. +In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants. -To evaluate the performance of these variants across different hardware platforms, we will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis. +To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis. +These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels. -### Fully connected benchmark model +### Define a Simple Linear Benchmark Model -In the following example model, we use simple model to generate nodes that can be accelerated by Kleidiai. +The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer. +This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. @@ -34,8 +36,9 @@ class DemoLinearModel(torch.nn.Module): return (torch.randn(1, 256, dtype=dtype),) ``` +This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants. -### Export FP16/FP32 model for pf16_gemm/pf32_gemm Variants +### Export FP16/FP32 model for pf16_gemm and pf32_gemm | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -86,7 +89,8 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm") ``` -### Export int8 quantized model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variant +### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm +INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy. | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -94,7 +98,7 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm") | pqs8_qc8w_gemm | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization | -The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variant to accelerate computation: +The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variants to accelerate computation: ```python @@ -148,7 +152,9 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm"); ``` -### Export int4 quantized model for qp8_f32_qb4w_gemm variant +### Export INT4 quantized model for qp8_f32_qb4w_gemm +This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels. + | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | | qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 | @@ -200,17 +206,26 @@ def export_int4_quantize_model(dynamic: bool, model_name: str): etrecord.save(etr_file) export_int4_quantize_model(False,"linear_model_qp8_f32_qb4w_gemm"); - - ``` -**NOTE:** - +{{%notice Note%}} When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model inspection and performance analysis using the ExecuTorch Inspector API. +{{%/notice%}} -After running this script, both the PTE model file and the etrecord file are generated. +### Run the Complete Benchmark Model Export Script +Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4). +This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format. + +```bash +wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-linear-model.py +chmod +x export-linear-model.py +python3 ./export-linear-model.py +``` + +### Verify the Generated Files +After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory: ``` bash $ ls model/ -1 @@ -225,5 +240,4 @@ linear_model_qp8_f32_qb4w_gemm.pte linear_model_qp8_f32_qc8w_gemm.etrecord linear_model_qp8_f32_qc8w_gemm.pte ``` - -The complete source code is available [here](../export-linear-model.py). +At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels. From f610d5f4cf94329b2b2f385b18377682e57836ab Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:10:18 -0500 Subject: [PATCH 14/22] Refine language for clarity in Conv2d model documentation Updated language for clarity and consistency throughout the document, changing phrases to be more instructive. --- .../05-create-conv2d-model.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md index 685a7ce39..2bf932a7b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md @@ -6,7 +6,7 @@ weight: 6 layout: learningpathall --- -In the previous section, we discussed that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels. +In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels. | XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType | @@ -14,14 +14,14 @@ In the previous section, we discussed that both INT8-quantized Conv2d and pointw | pqs8_qc8w_gemm | Asymmetric INT8 quantization(NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) | | pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 | -To evaluate the performance of Conv2d operators across multiple hardware platforms, we create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis. +To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis. -### INT8-quantized Conv2d benchmark model +### INT8-Quantized Conv2d benchmark model The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI. -By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. +By adjusting some of the model’s input parameters, you can also simulate the behavior of nodes that appear in real-world models. ```python @@ -100,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm"); ### PointwiseConv2d benchmark model -In the following example model, we use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. +In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. As before, input parameters can be adjusted to simulate real-world model behavior. @@ -158,10 +158,12 @@ export_pointwise_model("pointwise_conv2d_pf32_gemm") ``` -**NOTES:** - +{{%notice Note%}} When exporting models, the generate_etrecord option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model analysis and performance evaluation. +{{%/notice%}} + +### Validate Outputs After running this script, both the PTE model file and the etrecord file are generated. From c14d2fe69bf1593b029d5281413fd29e0c58f47d Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:12:59 -0500 Subject: [PATCH 15/22] Include script for exporting conv2D benchmark models Added instructions to run the complete benchmark model script for exporting conv2D models. --- .../05-create-conv2d-model.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md index 2bf932a7b..04dbd1e87 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md @@ -163,6 +163,16 @@ When exporting models, the generate_etrecord option is enabled to produce the .e These ETRecord files are essential for subsequent model analysis and performance evaluation. {{%/notice%}} + +### Run the Complete Benchmark Model Script +Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports the conv2D benchmark models. +This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format. + +```bash +wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d-model.py +chmod +x export-conv2d-model.py +python3 ./export-conv2d-model.py +``` ### Validate Outputs After running this script, both the PTE model file and the etrecord file are generated. From 45b0c1d55e51e3d96c4a7de1eb63c2650df2c3b4 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:16:39 -0500 Subject: [PATCH 16/22] Update 05-create-conv2d-model.md --- .../05-create-conv2d-model.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md index 04dbd1e87..666ac7285 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md @@ -165,13 +165,12 @@ These ETRecord files are essential for subsequent model analysis and performance ### Run the Complete Benchmark Model Script -Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports the conv2D benchmark models. -This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format. +Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata. ```bash -wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d-model.py -chmod +x export-conv2d-model.py -python3 ./export-conv2d-model.py +wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py +chmod +x export-conv2d.py +python3 ./export-conv2d.py ``` ### Validate Outputs @@ -185,4 +184,3 @@ pointwise_conv2d_pf32_gemm.etrecord pointwise_conv2d_pf32_gemm.pte ``` -The complete source code is available [here](../export-conv2d.py). From b47b9c42d71bff38181a5598581d990eb84a917e Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:21:39 -0500 Subject: [PATCH 17/22] Clarify Batch Matrix Multiply operator usage Updated the explanation of the Batch Matrix Multiply operator and clarified the instructions for constructing benchmark models. --- .../06-create-matrix-mul-model.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md index 901e2a888..61625f10e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md @@ -6,9 +6,9 @@ weight: 7 layout: learningpathall --- -In the previous section, we discussed that the Batch Matrix Multiply operator supports multiple GEMM (General Matrix Multiplication) variants. +The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm. -To evaluate the performance of these variants across different hardware platforms, we construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis. +To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis. ### Matrix multiply benchmark model @@ -72,11 +72,10 @@ export_mutrix_mul_model(torch.float32,"matrix_mul_pf32_gemm") ``` -**NOTE:** - +{{%notice Note%}} When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model analysis and performance evaluation. - +{{%/notice%}} After running this script, both the PTE model file and the etrecord file are generated. From 9d23c22e282316412f327e0ea51ffc8f310043fe Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 16:24:26 -0500 Subject: [PATCH 18/22] Add benchmark model script instructions Added instructions for running the complete benchmark model script and verifying output files. --- .../06-create-matrix-mul-model.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md index 61625f10e..bc23b68a1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md @@ -77,6 +77,18 @@ When exporting models, the **generate_etrecord** option is enabled to produce th These ETRecord files are essential for subsequent model analysis and performance evaluation. {{%/notice%}} +### Run the Complete Benchmark Model Script +Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script. +This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation. + +```bash +wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py +chmod +x export-matrix-mul.py +python3 ./export-matrix-mul.py +``` + +### Verify the output + After running this script, both the PTE model file and the etrecord file are generated. ``` bash @@ -86,5 +98,6 @@ model/matrix_mul_pf16_gemm.pte model/matrix_mul_pf32_gemm.etrecord model/matrix_mul_pf32_gemm.pte ``` +These files are the inputs for upcoming executor_runner benchmarks, where you’ll measure and compare KleidiAI micro-kernel performance. The complete source code is available [here](../export-matrix-mul.py). From 9c42626227aabd943533110d1e9f8a3f882ad4fd Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 19:40:08 -0500 Subject: [PATCH 19/22] Update 07-run-model.md --- .../07-run-model.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md index 2831a1cd9..fda5ebfa9 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md @@ -1,16 +1,26 @@ --- -title: Run model and generate the etdump +title: Run model and generate the ETDump weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- -After generating the model, we can now run it on an ARM64 platform using the following command: +### Copy artifacts to your Arm64 target +From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device: + +```bash +scp $WORKSPACE/build-arm64/executor_runner @:~/bench/ +scp -r model/ @:~/bench/ +``` + +### Run a model and emit ETDump +Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte). +The flags below tell executor_runner where to write the ETDump and how many times to execute. ```bash -cd $WORKSPACE -/build-arm64/executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1 +cd ~/bench +./executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1 ``` From 379b062e59c320bc333e8a1cc1710d54d70cab45 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 19:45:21 -0500 Subject: [PATCH 20/22] Update 07-run-model.md --- .../07-run-model.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md index fda5ebfa9..e3f65b19f 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md @@ -20,16 +20,16 @@ The flags below tell executor_runner where to write the ETDump and how many time ```bash cd ~/bench -./executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1 +./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1 ``` You can adjust the number of execution threads and the number of times the model is invoked. -You should see output similar to the example below. +You should see logs like: -```bash +```output D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0 D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED @@ -52,6 +52,6 @@ OutputX 0: tensor(sizes=[1, 256], [ I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'. ``` +If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed. -If the execution is successful, an etdump file will also be generated. From 75c98ee27b9688100e730567324f62eded8337ec Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 19:55:15 -0500 Subject: [PATCH 21/22] Update 08-analyze-etdump.md --- .../08-analyze-etdump.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md index d5e684553..c18b7b87e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md @@ -6,11 +6,13 @@ weight: 9 layout: learningpathall --- -In the final step, we create an Inspector instance by providing the paths to the generated ETDump and ETRecord. +You will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels. + The Inspector analyzes the runtime data from the ETDump file and maps it to the corresponding operators in the Edge Dialect Graph. +### Inspector script -To visualize all runtime events in a tabular format, simply call: +Save the following code in a file named `inspect.py` and run it with the path to a .pte model. The script auto-derives .etrecord, .etdump, and an output .csv next to it. ```python @@ -38,6 +40,14 @@ with open(csvfile, "w", encoding="utf-8") as f: ``` +### Run the script + +Run the script, for example with the linear_model_pf32_gemm.pte model : + +```bash +python3 inspect.py model/linear_model_pf32_gemm.pte +``` + Next, you can examine the generated CSV file to view the execution time information for each node in the model. Below is an example showing the runtime data corresponding to the Fully Connected node. @@ -51,5 +61,6 @@ Below is an example showing the runtime data corresponding to the Fully Connecte | Execute | DELEGATE_CALL | 0.04136 | 0.04464 | 0.04792 | 0.046082053 | 0.03372 | 4.390585 | ['aten.linear.default'] | FALSE | XnnpackBackend | | Execute | Method::execute | 0.04848 | 0.0525595 | 0.05756 | 0.0540658046 | 0.03944 | 4.404385 | [] | FALSE | | +You can now iterate over FP32 vs FP16 vs INT8 vs INT4 models, confirm the exact GEMM variant used, and quantify the latency savings attributable to KleidiAI micro-kernels on your Arm device. You can experiment with different models and matrix sizes to obtain various performance results. From 56c20fd92feb9c8dc21db62db631cb092d06b798 Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Thu, 20 Nov 2025 19:55:43 -0500 Subject: [PATCH 22/22] Update wording for clarity in performance analysis --- .../08-analyze-etdump.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md index c18b7b87e..c0f171454 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md @@ -63,4 +63,4 @@ Below is an example showing the runtime data corresponding to the Fully Connecte You can now iterate over FP32 vs FP16 vs INT8 vs INT4 models, confirm the exact GEMM variant used, and quantify the latency savings attributable to KleidiAI micro-kernels on your Arm device. -You can experiment with different models and matrix sizes to obtain various performance results. +You can experiment with different models and matrix sizes to analyze various performance results.