diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md index 9a0ec2c0b..ed041afd5 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md @@ -1,5 +1,5 @@ --- -title: Environment setup +title: Set up your environment weight: 2 ### FIXED, DO NOT MODIFY @@ -7,10 +7,9 @@ layout: learningpathall --- -### Python Environment Setup +## Set up your Python environment -Before building ExecuTorch, it is highly recommended to create an isolated Python environment. -This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs. +Before building ExecuTorch, it is highly recommended to create an isolated Python environment. This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs: ```bash sudo apt update @@ -19,11 +18,11 @@ python3 -m venv pyenv source pyenv/bin/activate ``` -Once activated, all subsequent steps should be executed within this Python virtual environment. +Keep your Python virtual environment activated while you complete the next steps. This ensures all dependencies install in the correct location. -### Download the ExecuTorch Source Code +## Download the ExecuTorch source code -Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched. +Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched: ```bash export WORKSPACE=$HOME @@ -33,13 +32,12 @@ git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.g ``` {{% notice Note %}} - The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases. + The instructions in this Learning Path were tested on ExecuTorch v1.0.0. Commands or configuration options might differ in later releases. {{% /notice %}} -### Build and Install the ExecuTorch Python Components +## Build and install the ExecuTorch Python components -Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. -This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling. +Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling. Run the following command from your ExecuTorch workspace: ```bash @@ -47,13 +45,14 @@ cd $WORKSPACE/executorch CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh ``` -This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector. +This builds ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector. + +## Verify the Installation +After the build completes, check that ExecuTorch is installed in your active Python environment. Run the following command: -### Verify the Installation -After the build completes successfully, verify that ExecuTorch was installed into your current Python environment: ```bash python -c "import executorch; print('Executorch build and install successfully.')" ``` -If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels. +If you see the success message, your environment is ready. You can now move on to cross-compiling and preparing to profile KleidiAI micro-kernels. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md index dcd8ed07e..b1d3c5a78 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md @@ -6,21 +6,21 @@ weight: 3 layout: learningpathall --- +## Overview -In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled. -Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine. +In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features. -### Install the Cross-Compilation Toolchain -On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake: +## Install the cross-compilation toolchain +On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake: ```bash sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y ``` -### Run CMake Configuration +## Run CMake configuration Use CMake to configure the ExecuTorch build for the AArch64 target. -The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI. +The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI: ```bash @@ -53,21 +53,21 @@ cmake -GNinja \ ``` -#### Key Build Options +## Key Build Options | **CMake Option** | **Description** | | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `EXECUTORCH_BUILD_XNNPACK` | Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms. | -| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI` | Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. | -| `EXECUTORCH_BUILD_DEVTOOLS` | Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. | -| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects. | -| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. | -| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Enables building **optimized kernel implementations** for better performance on supported architectures. | -| `EXECUTORCH_ENABLE_EVENT_TRACER` | Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis. | +| `EXECUTORCH_BUILD_XNNPACK` | Builds the XNNPACK backend, which provides highly optimized CPU operators (such as GEMM and convolution) for Arm64 platforms. | +| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI` | Enables Arm KleidiAI acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. | +| `EXECUTORCH_BUILD_DEVTOOLS` | Builds developer tools such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. | +| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Builds the Module API extension, which provides a high-level abstraction for model loading and execution using `Module` objects. | +| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | Builds the Tensor API extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. | +| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Enables building optimized kernel implementations for better performance on supported architectures. | +| `EXECUTORCH_ENABLE_EVENT_TRACER` | Enables the event tracing feature, which records performance and operator timing information for runtime analysis. | -### Build ExecuTorch +## Build ExecuTorch Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools: ```bash @@ -75,12 +75,10 @@ cmake --build . -j$(nproc) ``` CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target. -### Locate the executor_runner Binary +## Locate the executor_runner binary If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under: ```output build-arm64/executor_runner -``` -You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line. -This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration. +You’ll use `executor_runner` in later sections to execute and profile ExecuTorch models directly from the command line on your Arm64 target. This standalone binary lets you run models using the XNNPACK backend with KleidiAI acceleration, making it easy to benchmark and analyze performance on Arm devices. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md index 5f8cac6fd..6a97f9d58 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md @@ -1,19 +1,19 @@ --- -title: KleidiAI micro-kernels support in ExecuTorch +title: Accelerate ExecuTorch operators with KleidiAI micro-kernels weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers. +## Understand how KleidiAI micro-kernels integrate with ExecuTorch -Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms. +ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers. -These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models. +KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models. -When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path. +When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration. -Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models. +## Understand how KleidiAI micro-kernels integrate with ExecuTorch In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration: - XNNFullyConnected – Fully connected (dense) layers @@ -23,15 +23,15 @@ In ExecuTorch v1.0.0, the following operator types are implemented through the X However, not all instances of these operators are accelerated by KleidiAI. Acceleration eligibility depends on several operator attributes and backend support, including: -- Data types (e.g., float32, int8, int4) -- Quantization schemes (e.g., symmetric/asymmetric, per-tensor/per-channel) +- Data types (for example, float32, int8, int4) +- Quantization schemes (for example, symmetric/asymmetric, per-tensor/per-channel) - Tensor memory layout and alignment - Kernel dimensions and stride settings The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support. -### XNNFullyConnected +## XNNFullyConnected | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -42,14 +42,14 @@ The following section provides detailed information on which operator configurat | qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 | -### XNNConv2d +## XNNConv2d | XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | | pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 | | pqs8_qc8w_gemm | Asymmetric INT8 quantization (NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) | -### XNNBatchMatrixMultiply +## XNNBatchMatrixMultiply | XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType | | ------------------ | ---------------------------- | --------------------------------------- |--------------------------------------- | | pf32_gemm | FP32 | FP32 | FP32 | diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md index 7683bbdd4..dc26e235e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md @@ -6,18 +6,16 @@ weight: 5 layout: learningpathall --- +## Overview + In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants. To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis. These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels. -### Define a Simple Linear Benchmark Model - -The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer. -This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels. - -By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. +## Define a linear benchmark model with PyTorch for ExecuTorch +This step can be confusing at first, but building a minimal model helps you focus on the core operator performance. You’ll be able to quickly test different GEMM implementations and see how each one performs on Arm-based hardware. If you run into errors, check that your PyTorch and ExecuTorch versions are up to date and that you’re using the correct data types for your target GEMM variant. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models. ```python @@ -38,7 +36,7 @@ class DemoLinearModel(torch.nn.Module): ``` This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants. -### Export FP16/FP32 model for pf16_gemm and pf32_gemm +### Export FP16 and FP32 models for pf16_gemm and pf32_gemm variants | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | | ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- | @@ -89,7 +87,7 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm") ``` -### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm +### Export INT8 quantized models for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variants INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy. | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | @@ -152,7 +150,7 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm"); ``` -### Export INT4 quantized model for qp8_f32_qb4w_gemm +## Export INT4 quantized model for qp8_f32_qb4w_gemm variant This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels. | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType | @@ -214,7 +212,7 @@ These ETRecord files are essential for subsequent model inspection and performan {{%/notice%}} -### Run the Complete Benchmark Model Export Script +## Run the benchmark model export script for ExecuTorch Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4). This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format. @@ -224,7 +222,7 @@ chmod +x export-linear-model.py python3 ./export-linear-model.py ``` -### Verify the Generated Files +## Verify exported ExecuTorch and KleidiAI model files After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory: ``` bash @@ -240,4 +238,4 @@ linear_model_qp8_f32_qb4w_gemm.pte linear_model_qp8_f32_qc8w_gemm.etrecord linear_model_qp8_f32_qc8w_gemm.pte ``` -At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels. +Great job! You now have a complete set of benchmark models exported for multiple GEMM variants and quantization levels. You’re ready to move on and measure performance using ExecuTorch and KleidiAI micro-kernels on Arm-based hardware. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md index 666ac7285..ec3bdaabc 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md @@ -6,6 +6,8 @@ weight: 6 layout: learningpathall --- +## Understand Conv2d benchmark variants and KleidiAI acceleration + In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels. @@ -17,7 +19,7 @@ In the previous section, you saw that that both INT8-quantized Conv2d and pointw To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis. -### INT8-Quantized Conv2d benchmark model +## Create an INT8-quantized Conv2d benchmark model with KleidiAI The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI. @@ -98,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm"); ``` -### PointwiseConv2d benchmark model +## Create a PointwiseConv2d benchmark model with Kleidiai In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. @@ -164,7 +166,7 @@ These ETRecord files are essential for subsequent model analysis and performance {{%/notice%}} -### Run the Complete Benchmark Model Script +## Run the benchmark model export script for ExecuTorch and KleidiAI Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata. ```bash @@ -172,7 +174,7 @@ wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/head chmod +x export-conv2d.py python3 ./export-conv2d.py ``` -### Validate Outputs +## Validate exported model files for ExecuTorch and KleidiAI After running this script, both the PTE model file and the etrecord file are generated. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md index bc23b68a1..925d1fd41 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md @@ -6,16 +6,18 @@ weight: 7 layout: learningpathall --- -The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm. +## Learn how batch matrix multiply accelerates deep learning on Arm -To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis. +The batch matrix multiply operator (`torch.bmm`) is commonly used for efficient matrix operations in deep learning models. When running on Arm systems with XNNPACK, this operator is lowered to a general matrix multiplication (GEMM) implementation. If your input shapes and data types match supported patterns, XNNPACK can automatically dispatch these operations to KleidiAI micro-kernels, which are optimized for Arm hardware. +To compare the performance of different GEMM variants on various Arm platforms, you'll build a set of benchmark models. These models use the batch matrix multiply operator and allow you to evaluate how each GEMM implementation performs, helping you identify the best configuration for your workload. -### Matrix multiply benchmark model + +## Define a matrix multiply benchmark model for KleidiAI and ExecuTorch The following example defines a simple model to generate nodes that can be accelerated by KleidiAI. -By adjusting the input parameters, this model can also simulate the behavior of nodes commonly found in real-world models. +By adjusting the input parameters, this model can also simulate the behavior of nodes commonly found in real-world models: ```python @@ -28,7 +30,7 @@ class DemoBatchMatMulModel(nn.Module): ``` -### Export FP16/FP32 model for pf16_gemm/pf32_gemm variant +## Export FP16 and FP32 models for pf16_gemm and pf32_gemm variants | XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType | | ------------------ | ---------------------------- | --------------------------------------- |--------------------------------------- | @@ -77,9 +79,9 @@ When exporting models, the **generate_etrecord** option is enabled to produce th These ETRecord files are essential for subsequent model analysis and performance evaluation. {{%/notice%}} -### Run the Complete Benchmark Model Script +## Run the complete benchmark model script Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script. -This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation. +This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation: ```bash wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py @@ -87,7 +89,7 @@ chmod +x export-matrix-mul.py python3 ./export-matrix-mul.py ``` -### Verify the output +## Verify the output After running this script, both the PTE model file and the etrecord file are generated. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md index e3f65b19f..63bde34ff 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md @@ -6,7 +6,7 @@ weight: 8 layout: learningpathall --- -### Copy artifacts to your Arm64 target +## Copy artifacts to your Arm64 target From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device: ```bash @@ -14,7 +14,7 @@ scp $WORKSPACE/build-arm64/executor_runner @:~/bench/ scp -r model/ @:~/bench/ ``` -### Run a model and emit ETDump +## Run a model and emit ETDump Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte). The flags below tell executor_runner where to write the ETDump and how many times to execute. diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md index c0f171454..2014a30dc 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md @@ -1,16 +1,18 @@ --- -title: Analyzing ETRecord and ETDump +title: Analyze ETRecord and ETDump weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -You will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels. +## Overview + +In this section you will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels. The Inspector analyzes the runtime data from the ETDump file and maps it to the corresponding operators in the Edge Dialect Graph. -### Inspector script +## Analyze ETDump and ETRecord files with the Inspector script Save the following code in a file named `inspect.py` and run it with the path to a .pte model. The script auto-derives .etrecord, .etdump, and an output .csv next to it. @@ -40,7 +42,7 @@ with open(csvfile, "w", encoding="utf-8") as f: ``` -### Run the script +## Run the Inspector script and review performance results Run the script, for example with the linear_model_pf32_gemm.pte model : diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md index 14b6f0776..5cbda2705 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md @@ -1,23 +1,19 @@ --- -title: How to Benchmark a KleidiAI Micro-kernel in ExecuTorch - -draft: true -cascade: - draft: true +title: Benchmark a KleidiAI micro-kernel in ExecuTorch minutes_to_complete: 30 -who_is_this_for: This is an advanced topic intended for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 (AArch64) platforms supporting SME/SME2 instructions. +who_is_this_for: This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 platforms supporting SME/SME2 instructions. learning_objectives: - - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions. - - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions. + - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions + - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions - Use the executor_runner tool to run kernel workloads and collect ETDump profiling data. - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior. prerequisites: - - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space. - - An Arm64 target system with support for SME or SME2. Refer to [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support). + - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space + - An Arm64 target system with support for SME or SME2 - see the Learning Path [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support) author: Qixiang Xu diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/_index.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/_index.md index 5b4f2f0c0..8a1292870 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/_index.md @@ -1,19 +1,15 @@ --- -title: Deploy TypeScript on Google Cloud C4A (Arm-based Axion VMs) - -draft: true -cascade: - draft: true +title: Deploy TypeScript on Google Cloud C4A virtual machines minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for software developers deploying and optimizing TypeScript workloads on Arm64 Linux environments, specifically using Google Cloud C4A virtual machines powered by Axion processors. +who_is_this_for: This is an introductory topic for developers deploying and optimizing TypeScript workloads on Arm64 Linux environments, specifically using Google Cloud C4A virtual machines powered by Axion processors. learning_objectives: - - Provision an Arm-based SUSE SLES virtual machine on Google Cloud (C4A with Axion processors) - - Install TypeScript on a SUSE Arm64 (C4A) instance - - Validate TypeScript functionality by creating, compiling, and running a simple TypeScript script on the Arm64 VM - - Benchmark TypeScript performance using a JMH-style custom benchmark with perf_hooks on Arm64 architecture + - Provision an Arm-based SUSE Linux Enterprise Server (SLES) virtual machine (VM) on Google Cloud + - Install TypeScript on a SUSE Arm64 C4A instance + - Validate TypeScript functionality by creating, compiling, and running a simple TypeScript script on a Arm64 VM + - Benchmark TypeScript performance using a JMH-style custom benchmark with the perf_hooks module on Arm64 architecture prerequisites: - A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/background.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/background.md index 8c07d0201..e4b887666 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/background.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/background.md @@ -1,23 +1,28 @@ --- -title: Getting started with TypeScript on Google Axion C4A (Arm Neoverse-V2) +title: Get started with TypeScript on Google Axion C4A instances weight: 2 layout: "learningpathall" --- +## Introduction + +In this Learning Path, you'll deploy and benchmark TypeScript applications on Arm-based Google Cloud C4A instances powered by Axion processors. You'll provision a SUSE Linux Enterprise Server (SLES) virtual machine (VM), install and configure TypeScript, and measure performance using a JMH-style custom benchmark. This process shows you how to use TypeScript with Arm-based cloud infrastructure and helps you evaluate performance and compatibility for cloud-native workloads. + + ## Google Axion C4A Arm instances in Google Cloud -Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications. +Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, they offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications. -The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud. +The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture on Google Cloud. -To learn more about Google Axion, refer to the [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu) blog. +To learn more about Google Axion, see the Google blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu). ## TypeScript -TypeScript is an open-source, strongly typed programming language developed and maintained by Microsoft. +TypeScript is an open-source, strongly-typed programming language developed and maintained by Microsoft. -It is a superset of JavaScript, which means all valid JavaScript code is also valid TypeScript, but TypeScript adds static typing, interfaces, and advanced tooling to help developers write more reliable and maintainable code. +TypeScript builds on JavaScript by adding features like static typing and interfaces. Any valid JavaScript code works in TypeScript, but TypeScript gives you extra tools to write code that is easier to maintain and less prone to errors. -TypeScript is widely used for web applications, server-side development (Node.js), and large-scale JavaScript projects** where type safety and code quality are important. Learn more from the [TypeScript official website](https://www.typescriptlang.org/) and its [handbook and documentation](https://www.typescriptlang.org/docs/). +TypeScript is widely used for web applications, server-side development (Node.js), and large-scale JavaScript projects where type safety and code quality are important. Learn more by visiting the [TypeScript official website](https://www.typescriptlang.org/) and the [TypeScript handbook and documentation](https://www.typescriptlang.org/docs/). diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/baseline.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/baseline.md index dbc57adcd..a3b94d7b1 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/baseline.md @@ -1,19 +1,17 @@ --- -title: TypeScript Baseline Testing on Google Axion C4A Arm Virtual Machine +title: Establish a TypeScript performance baseline weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Baseline Setup for TypeScript -This section walks you through the baseline setup and validation of TypeScript on a Google Cloud C4A (Axion Arm64) virtual machine running SUSE Linux. -The goal is to confirm that your TypeScript environment is functioning correctly, from initializing a project to compiling and executing a simple TypeScript file, ensuring a solid foundation before performance or benchmarking steps. +## Overview +This section walks you through the baseline setup and validation of TypeScript on a Google Cloud C4A (Axion Arm64) virtual machine running SUSE Linux. The goal is to confirm that your TypeScript environment is functioning correctly, from initializing a project to compiling and executing a simple TypeScript file, ensuring a solid foundation before performance or benchmarking steps. -### Set Up a TypeScript Project -Before running any tests, you’ll create a dedicated project directory and initialize a minimal TypeScript environment. +## Create project folder -1. Create project folder +Before running any tests, you’ll create a dedicated project directory and initialize a minimal TypeScript environment. Start by creating a new folder to hold your TypeScript project files: @@ -23,15 +21,15 @@ cd ~/typescript-benchmark ``` This creates a workspace named `typescript-benchmark` in your home directory, ensuring all TypeScript configuration and source files are organized separately from system files and global modules. -2. Initialize npm project +## Initialize npm project -Next, initialize a new Node.js project. This creates a `package.json` file that defines your project metadata, dependencies, and scripts. +Next, initialize a new Node.js project. This creates a `package.json` file that defines your project metadata, dependencies, and scripts: ```console npm init -y ``` -3. Install Node.js type definitions +## Install Node.js type definitions To enable TypeScript to properly recognize Node.js built-in APIs (like fs, path, and process), install the Node.js type definitions package: @@ -55,10 +53,10 @@ You should see output similar to: } ``` -### Baseline Testing +## Perform baseline testing With the TypeScript environment configured, you’ll now perform a baseline functionality test to confirm that TypeScript compilation and execution work correctly on your Google Cloud SUSE Arm64 VM. -1. Create a Simple TypeScript File +## Create a simple TypeScript file Create a file named `hello.ts` with the following content: @@ -71,7 +69,7 @@ console.log(greet("GCP SUSE ARM64")); ``` This simple function demonstrates TypeScript syntax, type annotations, and basic console output. -2. Compile TypeScript +## Compile TypeScript Use the TypeScript compiler (tsc) to transpile the .ts file into JavaScript: @@ -80,7 +78,7 @@ tsc hello.ts ``` This generates a new file named `hello.js` in the same directory. -3. Run compiled JavaScript +## Run compiled JavaScript Now, execute the compiled JavaScript using Node.js. This step verifies that: diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/benchmarking.md index 567f95984..cdc7c0252 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/benchmarking.md @@ -1,5 +1,5 @@ --- -title: TypeScript Benchmarking +title: Benchmark TypeScript performance weight: 6 ### FIXED, DO NOT MODIFY @@ -7,12 +7,12 @@ layout: learningpathall --- -## JMH-style Custom Benchmarking +## Create a custom JMH-style benchmark for TypeScript on Arm This section demonstrates how to benchmark TypeScript functions using a JMH-style (Java Microbenchmark Harness) methodology implemented with Node.js's built-in `perf_hooks` module. Unlike basic `console.time()` measurements, this approach executes multiple iterations, computes the average runtime, and produces stable and repeatable performance data, useful for evaluating workloads on your Google Cloud C4A (Axion Arm64) VM running SUSE Linux. -### Create the Benchmark Script +## Implement benchmarking with Node.js perf_hooks on Arm Create a file named `benchmark_jmh.ts` inside your project directory with the content below: ```typescript @@ -56,7 +56,7 @@ Code explanation: This JMH-style benchmarking approach provides more accurate and repeatable performance metrics than a single execution, making it ideal for performance testing on Arm-based systems. -### Compile the TypeScript Benchmark +## Compile the TypeScript Benchmark First, compile the benchmark file from TypeScript to JavaScript using the TypeScript compiler (tsc): ```console @@ -65,7 +65,7 @@ tsc benchmark_jmh.ts This command transpiles your TypeScript code into standard JavaScript, generating a file named `benchmark_jmh.js` in the same directory. The resulting JavaScript can be executed by Node.js, allowing you to measure performance on your Google Cloud C4A (Arm64) virtual machine. -### Run the Benchmark +## Run the benchmark Now, execute the compiled JavaScript file with Node.js: ```console @@ -87,33 +87,24 @@ Iteration 10: 0.673 ms Average execution time over 10 iterations: 0.888 ms ``` +## Interpret your TypeScript performance data -### Benchmark Metrics Explained +Each iteration measures how long it takes to run the benchmarked function once, while the average execution time is calculated by dividing the total time for all runs by the number of iterations. Running the benchmark multiple times helps smooth out fluctuations caused by factors like CPU scheduling, garbage collection, or memory caching. This approach produces more consistent and meaningful performance data, similar to the methodology used by Java’s JMH benchmarking framework. - * Iteration times → Each iteration represents the time taken for one complete execution of the benchmarked function. - * Average execution time → Calculated as the total of all iteration times divided by the number of iterations. This gives a stable measure of real-world performance. - * Why multiple iterations? - A single run can be affected by transient factors such as CPU scheduling, garbage collection, or memory caching. - Running multiple iterations and averaging the results smooths out variability, producing more repeatable and statistically meaningful data, similar to Java’s JMH benchmarking methodology. - -### Interpretation +The average execution time reflects how efficiently the function executes under steady-state conditions. The first iteration often shows higher latency because Node.js performing initial JIT (Just-In-Time) compilation and optimization, a common warm-up behavior in JavaScript/TypeScript benchmarks. -The average execution time reflects how efficiently the function executes under steady-state conditions. -The first iteration often shows higher latency because Node.js performing initial JIT (Just-In-Time) compilation and optimization, a common warm-up behavior in JavaScript/TypeScript benchmarks. - -### Benchmark summary on Arm64 +## Benchmark summary Results from the earlier run on the `c4a-standard-4` (4 vCPU, 16 GB memory) Arm64 VM in GCP (SUSE): | Iteration | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average | |-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------| | Time (ms) | 2.286 | 0.749 | 1.145 | 0.674 | 0.671 | 0.671 | 0.672 | 0.667 | 0.667 | 0.673 | 0.888 | +## Summarize TypeScript benchmarking results on Arm64 -### TypeScript performance benchmarking summary on Arm64 - -When you look at the benchmarking results, you will notice that on the Google Axion C4A Arm-based instances: +Here’s what the benchmark results show for Google Axion C4A Arm-based instances: -- The average execution time on Arm64 (~0.888 ms) shows that CPU-bound TypeScript operations run efficiently on Arm-based VMs. -- Initial iterations may show slightly higher times due to runtime warm-up and optimization overhead, which is common across architectures. -- Arm64 demonstrates stable iteration times after the first run, indicating consistent performance for repeated workloads. +- The average execution time on Arm64 is about 0.888 ms, which means TypeScript code runs efficiently on Arm-based VMs. +- The first run is usually a bit slower because Node.js is warming up and optimizing the code. This is normal for all architectures. +- After the first run, the times are very consistent, showing that Arm64 delivers stable performance for repeated tasks. -This demonstrates that Google Cloud C4A Arm64 virtual machines provide production-grade stability and throughput for TypeScript workloads, whether used for application logic, scripting, or performance-critical services. +These results confirm that Google Cloud C4A Arm64 virtual machines are reliable and fast for running TypeScript workloads, whether you’re building application logic, scripts, or performance-sensitive services. diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/installation.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/installation.md index a48a49cbf..28880c735 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/installation.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/installation.md @@ -6,22 +6,22 @@ weight: 4 layout: learningpathall --- -## Install TypeScript on GCP VM +## Overview This section walks you through installing TypeScript and its dependencies on a Google Cloud Platform (GCP) SUSE Arm64 virtual machine. You’ll install Node.js, npm, TypeScript, and ts-node, and verify that everything works correctly. -Running TypeScript on Google Cloud C4A instances, powered by Axion Arm64 processors, provides a high-performance and energy-efficient platform for Node.js-based workloads. +Running TypeScript on Google Cloud C4A instances, powered by Axion Arm64 processors provides a high-performance and energy-efficient platform for Node.js-based workloads. -### Update SUSE System +## Update SUSE system Before installing new packages, refresh the repositories and update existing ones to ensure your environment is current and secure: ```console sudo zypper refresh sudo zypper update -y ``` -Keeping your system up to date ensures that dependencies, libraries, and compilers required for Node.js and TypeScript work seamlessly on the Arm64 architecture. +Updating your system helps make sure all the tools and libraries you need for Node.js and TypeScript work smoothly on Arm64. -### Install Node.js and npm -Node.js provides the JavaScript runtime that powers TypeScript execution, while npm (Node Package Manager) manages project dependencies and global tools. +## Install Node.js and npm +Node.js is the JavaScript runtime that runs your TypeScript code. npm is the tool you use to install and manage packages and tools for your projects. Install both packages using SUSE’s repositories: @@ -30,7 +30,7 @@ sudo zypper install -y nodejs npm ``` This command installs the Node.js runtime and npm package manager on your Google Cloud SUSE Arm64 VM. -### Install TypeScript globally +## Install TypeScript globally TypeScript (tsc) is the compiler that converts .ts files into JavaScript. `ts-node` lets you run TypeScript files directly without pre-compiling them. It is useful for testing, scripting, and lightweight development workflows. @@ -43,7 +43,7 @@ The `-g` flag installs packages globally, making tsc and ts-node available syste This approach simplifies workflows for developers running multiple TypeScript projects on the same VM. -### Verify installations +## Verify installation Check that Node.js, npm, TypeScript, and ts-node are all installed correctly: ```console @@ -66,5 +66,4 @@ Version 5.9.3 v10.9.2 ``` -Node.js, npm, and TypeScript are now successfully installed and verified on your Google Cloud C4A (Arm64) virtual machine. -You’re ready to create and execute TypeScript scripts for testing, deployment, or performance benchmarking. +You’ve now installed and verified Node.js, npm, and TypeScript on your Google Cloud C4A (Arm64) virtual machine. You’re ready to start creating and running TypeScript scripts for testing, deployment, or performance checks. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/instance.md b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/instance.md index 2b93bc950..9a2aa5bf4 100644 --- a/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/typescript-on-gcp/instance.md @@ -8,15 +8,15 @@ layout: learningpathall ## Overview -In this section, you will learn how to provision a Google Axion C4A Arm virtual machine on Google Cloud Platform (GCP) using the `c4a-standard-4` (4 vCPUs, 16 GB memory) machine type in the Google Cloud Console. +In this section, you'll set up a Google Axion C4A Arm virtual machine on Google Cloud Platform (GCP) using the `c4a-standard-4` machine type. This instance gives you four virtual CPUs and 16 GB of memory. You'll use the Google Cloud Console to complete each step. {{% notice Note %}} For support on GCP setup, see the Learning Path [Getting started with Google Cloud Platform](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/google/). {{% /notice %}} -## Provision a Google Axion C4A Arm VM in Google Cloud Console +## Create the virtual machine -To create a virtual machine based on the C4A instance type: +To create the virtual machine, follow these steps: - Navigate to the [Google Cloud Console](https://console.cloud.google.com/). - Go to **Compute Engine > VM Instances** and select **Create Instance**. - Under **Machine configuration**: @@ -26,6 +26,6 @@ To create a virtual machine based on the C4A instance type: ![Create a Google Axion C4A Arm virtual machine in the Google Cloud Console with c4a-standard-4 selected alt-text#center](images/gcp-vm.png "Creating a Google Axion C4A Arm virtual machine in Google Cloud Console") -- Under **OS and Storage**, select **Change**, then choose an Arm64-based OS image. For this Learning Path, use **SUSE Linux Enterprise Server**. Pick the preferred version for your Operating System. Ensure you select the **Arm image** variant. Click **Select**. +- Under **OS and Storage**, select **Change**, then choose an Arm64-based OS image. For this Learning Path, use **SUSE Linux Enterprise Server**. Pick the preferred version for your operating system. Ensure you select the **Arm image** variant. Click **Select**. - Under **Networking**, enable **Allow HTTP traffic**. - Click **Create** to launch the instance.