diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 1e26ff1d6..9a0ec2c0b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -10,45 +10,50 @@ layout: learningpathall ### Python Environment Setup Before building ExecuTorch, it is highly recommended to create an isolated Python environment. -This prevents dependency conflicts with your system Python installation and ensures a clean build environment. +This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs. ```bash -cd $WORKSPACE +sudo apt update +sudo apt install -y python3 python3.12-dev python3-venv build-essential cmake python3 -m venv pyenv source pyenv/bin/activate ``` -All subsequent steps should be executed within this Python virtual environment. +Once activated, all subsequent steps should be executed within this Python virtual environment. ### Download the ExecuTorch Source Code Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched. -```bash +```bash +export WORKSPACE=$HOME cd $WORKSPACE git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git ``` - > **Note:** - > The instructions in this guide are based on **ExecuTorch v1.0.0**. - > Commands or configuration options may differ in later releases. + {{% notice Note %}} + The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases. + {{% /notice %}} ### Build and Install the ExecuTorch Python Components -Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled. +Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. +This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling. +Run the following command from your ExecuTorch workspace: ```bash cd $WORKSPACE/executorch CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh ``` +This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector. -This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector. - -After installation completes successfully, you can verify the environment by running: +### Verify the Installation +After the build completes successfully, verify that ExecuTorch was installed into your current Python environment: ```bash python -c "import executorch; print('Executorch build and install successfully.')" ``` +If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index 0b0386694..dcd8ed07e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -1,5 +1,5 @@ --- -title: Cross-Compile ExecuTorch for the Aarch64 platform +title: Cross-Compile ExecuTorch for the AArch64 platform weight: 3 ### FIXED, DO NOT MODIFY @@ -7,13 +7,20 @@ layout: learningpathall --- -This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled. -All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc). +In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled. +Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine. +### Install the Cross-Compilation Toolchain +On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake: +```bash +sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y +``` ### Run CMake Configuration -Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration: +Use CMake to configure the ExecuTorch build for the AArch64 target. + +The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI. ```bash @@ -61,18 +68,19 @@ cmake -GNinja \ ### Build ExecuTorch +Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools: ```bash cmake --build . -j$(nproc) - ``` +CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target. -If the build completes successfully, you should find the executor_runner binary under the directory: +### Locate the executor_runner Binary +If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under: -```bash +```output build-arm64/executor_runner - ``` - +You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line. This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md index 7bb5ffffd..5f8cac6fd 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md @@ -5,9 +5,9 @@ weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization. +ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers. -Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms. +Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms. These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md index 7be11240d..7683bbdd4 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md @@ -6,14 +6,16 @@ weight: 5 layout: learningpathall --- -In the previous section, we discussed that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants. +In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants. -To evaluate the performance of these variants across different hardware platforms, we will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis. +To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis. +These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels. -### Fully connected benchmark model +### Define a Simple Linear Benchmark Model -In the following example model, we use simple model to generate nodes that can be accelerated by Kleidiai. +The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer. +This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. @@ -34,8 +36,9 @@ class DemoLinearModel(torch.nn.Module): return (torch.randn(1, 256, dtype=dtype),) ``` +This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants. -### Export FP16/FP32 model for pf16_gemm/pf32_gemm Variants +### Export FP16/FP32 model for pf16_gemm and pf32_gemm | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -86,7 +89,8 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm") ``` -### Export int8 quantized model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variant +### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm +INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy. | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -94,7 +98,7 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm") | pqs8_qc8w_gemm | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization | -The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variant to accelerate computation: +The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variants to accelerate computation: ```python @@ -148,7 +152,9 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm"); ``` -### Export int4 quantized model for qp8_f32_qb4w_gemm variant +### Export INT4 quantized model for qp8_f32_qb4w_gemm +This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels. + | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | | qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 | @@ -200,17 +206,26 @@ def export_int4_quantize_model(dynamic: bool, model_name: str): etrecord.save(etr_file) export_int4_quantize_model(False,"linear_model_qp8_f32_qb4w_gemm"); - - ``` -**NOTE:** - +{{%notice Note%}} When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model inspection and performance analysis using the ExecuTorch Inspector API. +{{%/notice%}} -After running this script, both the PTE model file and the etrecord file are generated. +### Run the Complete Benchmark Model Export Script +Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4). +This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format. + +```bash +wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-linear-model.py +chmod +x export-linear-model.py +python3 ./export-linear-model.py +``` + +### Verify the Generated Files +After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory: ``` bash $ ls model/ -1 @@ -225,5 +240,4 @@ linear_model_qp8_f32_qb4w_gemm.pte linear_model_qp8_f32_qc8w_gemm.etrecord linear_model_qp8_f32_qc8w_gemm.pte ``` - -The complete source code is available [here](../export-linear-model.py). +At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md index 685a7ce39..666ac7285 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md @@ -6,7 +6,7 @@ weight: 6 layout: learningpathall --- -In the previous section, we discussed that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels. +In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels. | XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType | @@ -14,14 +14,14 @@ In the previous section, we discussed that both INT8-quantized Conv2d and pointw | pqs8_qc8w_gemm | Asymmetric INT8 quantization(NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) | | pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 | -To evaluate the performance of Conv2d operators across multiple hardware platforms, we create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis. +To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis. -### INT8-quantized Conv2d benchmark model +### INT8-Quantized Conv2d benchmark model The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI. -By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. +By adjusting some of the model’s input parameters, you can also simulate the behavior of nodes that appear in real-world models. ```python @@ -100,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm"); ### PointwiseConv2d benchmark model -In the following example model, we use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. +In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. As before, input parameters can be adjusted to simulate real-world model behavior. @@ -158,10 +158,21 @@ export_pointwise_model("pointwise_conv2d_pf32_gemm") ``` -**NOTES:** - +{{%notice Note%}} When exporting models, the generate_etrecord option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model analysis and performance evaluation. +{{%/notice%}} + + +### Run the Complete Benchmark Model Script +Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata. + +```bash +wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py +chmod +x export-conv2d.py +python3 ./export-conv2d.py +``` +### Validate Outputs After running this script, both the PTE model file and the etrecord file are generated. @@ -173,4 +184,3 @@ pointwise_conv2d_pf32_gemm.etrecord pointwise_conv2d_pf32_gemm.pte ``` -The complete source code is available [here](../export-conv2d.py). diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md index 901e2a888..bc23b68a1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md @@ -6,9 +6,9 @@ weight: 7 layout: learningpathall --- -In the previous section, we discussed that the Batch Matrix Multiply operator supports multiple GEMM (General Matrix Multiplication) variants. +The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm. -To evaluate the performance of these variants across different hardware platforms, we construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis. +To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis. ### Matrix multiply benchmark model @@ -72,11 +72,22 @@ export_mutrix_mul_model(torch.float32,"matrix_mul_pf32_gemm") ``` -**NOTE:** - +{{%notice Note%}} When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file. These ETRecord files are essential for subsequent model analysis and performance evaluation. +{{%/notice%}} + +### Run the Complete Benchmark Model Script +Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script. +This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation. + +```bash +wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py +chmod +x export-matrix-mul.py +python3 ./export-matrix-mul.py +``` +### Verify the output After running this script, both the PTE model file and the etrecord file are generated. @@ -87,5 +98,6 @@ model/matrix_mul_pf16_gemm.pte model/matrix_mul_pf32_gemm.etrecord model/matrix_mul_pf32_gemm.pte ``` +These files are the inputs for upcoming executor_runner benchmarks, where you’ll measure and compare KleidiAI micro-kernel performance. The complete source code is available [here](../export-matrix-mul.py). diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md index 2831a1cd9..e3f65b19f 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md @@ -1,25 +1,35 @@ --- -title: Run model and generate the etdump +title: Run model and generate the ETDump weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- -After generating the model, we can now run it on an ARM64 platform using the following command: +### Copy artifacts to your Arm64 target +From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device: + +```bash +scp $WORKSPACE/build-arm64/executor_runner @:~/bench/ +scp -r model/ @:~/bench/ +``` + +### Run a model and emit ETDump +Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte). +The flags below tell executor_runner where to write the ETDump and how many times to execute. ```bash -cd $WORKSPACE -/build-arm64/executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1 +cd ~/bench +./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1 ``` You can adjust the number of execution threads and the number of times the model is invoked. -You should see output similar to the example below. +You should see logs like: -```bash +```output D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0 D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED @@ -42,6 +52,6 @@ OutputX 0: tensor(sizes=[1, 256], [ I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'. ``` +If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed. -If the execution is successful, an etdump file will also be generated. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md index d5e684553..c0f171454 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md @@ -6,11 +6,13 @@ weight: 9 layout: learningpathall --- -In the final step, we create an Inspector instance by providing the paths to the generated ETDump and ETRecord. +You will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels. + The Inspector analyzes the runtime data from the ETDump file and maps it to the corresponding operators in the Edge Dialect Graph. +### Inspector script -To visualize all runtime events in a tabular format, simply call: +Save the following code in a file named `inspect.py` and run it with the path to a .pte model. The script auto-derives .etrecord, .etdump, and an output .csv next to it. ```python @@ -38,6 +40,14 @@ with open(csvfile, "w", encoding="utf-8") as f: ``` +### Run the script + +Run the script, for example with the linear_model_pf32_gemm.pte model : + +```bash +python3 inspect.py model/linear_model_pf32_gemm.pte +``` + Next, you can examine the generated CSV file to view the execution time information for each node in the model. Below is an example showing the runtime data corresponding to the Fully Connected node. @@ -51,5 +61,6 @@ Below is an example showing the runtime data corresponding to the Fully Connecte | Execute | DELEGATE_CALL | 0.04136 | 0.04464 | 0.04792 | 0.046082053 | 0.03372 | 4.390585 | ['aten.linear.default'] | FALSE | XnnpackBackend | | Execute | Method::execute | 0.04848 | 0.0525595 | 0.05756 | 0.0540658046 | 0.03944 | 4.404385 | [] | FALSE | | +You can now iterate over FP32 vs FP16 vs INT8 vs INT4 models, confirm the exact GEMM variant used, and quantify the latency savings attributable to KleidiAI micro-kernels on your Arm device. -You can experiment with different models and matrix sizes to obtain various performance results. +You can experiment with different models and matrix sizes to analyze various performance results. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md index 6f7bffc8c..14b6f0776 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md @@ -1,5 +1,5 @@ --- -title: How to Benchmark a Single KleidiAI Micro-kernel in ExecuTorch +title: How to Benchmark a KleidiAI Micro-kernel in ExecuTorch draft: true cascade: @@ -7,17 +7,17 @@ cascade: minutes_to_complete: 30 -who_is_this_for: This article is intended for advanced developers who want to leverage KleidiAI to accelerate ExecuTorch model inference on the AArch64 platform. +who_is_this_for: This is an advanced topic intended for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 (AArch64) platforms supporting SME/SME2 instructions. learning_objectives: - - Cross-compile ExecuTorch for the ARM64 platform with XNNPACK and KleidiAI enabled, including SME/SME2 support. + - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions. - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions. - - Use the `executor_runner` tool to collect ETDump profiling data. - - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API. + - Use the executor_runner tool to run kernel workloads and collect ETDump profiling data. + - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior. prerequisites: - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space. - - An Arm64 target system with support for SME or SME2. + - An Arm64 target system with support for SME or SME2. Refer to [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support). author: Qixiang Xu @@ -26,13 +26,12 @@ skilllevels: Advanced subjects: ML armips: - Cortex-A - - SME - - Kleidai tools_software_languages: - Python - - cmake + - ExecuTorch - XNNPACK + - KleidiAI operatingsystems: - Linux diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py index b976be70c..0e1765436 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py @@ -74,7 +74,7 @@ def export_int8_quantize_conv2d_model(model_name: str): etrecord = et_program.get_etrecord() etrecord.save(etr_file) -export_int8_quantize_depthwise_model("qint8_conv2d_pqs8_qc8w_gemm"); +export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm");