From 0286a2c96e4eed483e24fab0ec29d28a0013c2cf Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Wed, 2 Jul 2025 21:27:09 +0000 Subject: [PATCH 01/29] Starting SME2 review --- .../1-get-started.md | 10 +++++----- .../10-going-further.md | 3 +-- .../6-sme2-matmul-asm.md | 2 +- .../8-benchmarking.md | 4 ++-- .../multiplying-matrices-with-sme2/_index.md | 16 ++++++---------- 5 files changed, 15 insertions(+), 20 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 562ba90582..760d4bc540 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -7,19 +7,19 @@ layout: learningpathall --- To follow this Learning Path, you will need to set up an environment to develop -with SME2 and download the code examples. This learning path assumes two +with SME2 and download the code examples. This Learning Path assumes two different ways of working, and you will need to select the one appropriate for your machine: - Case #1: Your machine has native SME2 support --- check the [list of devices with SME2 support](#devices-with-sme2-support). -- Case #2: Your machine does not have native SME2 support. This learning path +- Case #2: Your machine does not have native SME2 support. This Learning Path supports this use case by enabling you to run code with SME2 instructions in an emulator in bare metal mode, i.e., the emulator runs the SME2 code *without* an operating system. ## Code examples -[Download the code examples](https://gitlab.arm.com/learning-code-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2) for this learning path, expand the archive, and change your current directory to: +[Download the code examples](https://gitlab.arm.com/learning-code-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2) for this Learning Path, expand the archive, and change your current directory to: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2`` : ```BASH @@ -59,7 +59,7 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ ``` It contains: -- The code examples that will be used throughout this learning path. +- The code examples that will be used throughout this Learning Path. - A ``Makefile`` that builds the code examples. - A shell script called ``run-fvp.sh`` that runs the FVP (used in the emulated SME2 case). @@ -124,7 +124,7 @@ the necessary tools you require without cluttering your machine. The containers run independently, meaning they do not interfere with other containers on the same machine or server. -This learning path provides a Docker image that has a compiler and [Arm's Fixed +This Learning Path provides a Docker image that has a compiler and [Arm's Fixed Virtual Platform (FVP) model](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms) for emulating code with SME2 instructions. The Docker image recipe is provided diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index 441c3be98f..3964ca0acb 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -11,8 +11,7 @@ available to you. ## Generalize the algorithms -In this Learning Path, you focused on using SME2 for matrix multiplication with -floating point numbers. However in practice, any library or framework supporting +In this Learning Path, you focused on using SME2 for matrix multiplication with floating-point numbers. However in practice, any library or framework supporting matrix multiplication should also handle various integer types. You can see that the algorithm structure for matrix preprocessing as well as diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md index 61c5821116..461058887d 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md @@ -348,7 +348,7 @@ assembly- or intrinsic-based matrix multiplication) `I` times and then compute and report the minimum, maximum and average execution times. {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this learning path +Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2. {{% /notice %}} diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md index c38a7d3a94..ba750528d5 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md @@ -25,7 +25,7 @@ known to be a poor proxy for execution time comparisons. ## Benchmarking on platform with native SME2 support {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this learning path +Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2. {{% /notice %}} @@ -86,7 +86,7 @@ here is far from being an apples-to-apples comparison: - Firstly, the assembly version has some requirements on the `K` parameter that the intrinsic version does not have. - Second, the assembly version has an optimization that the intrinsic version, - for the sake of readability in this learning path, does not have (see the + for the sake of readability in this Learning Path, does not have (see the [Going further section](/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further/) to know more). diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index 3533edb21f..57cb6644a0 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -1,20 +1,16 @@ --- title: Accelerate Matrix Multiplication Performance with SME2 -draft: true -cascade: - draft: true - minutes_to_complete: 30 who_is_this_for: This Learning Path is an advanced topic for developers who want to learn about accelerating the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). learning_objectives: - - Implement a reference matrix multiplication without using SME2. - - Use SME2 assembly instructions to improve the matrix multiplication performance. - - Use SME2 intrinsics to improve the matrix multiplication performance using the C programming language. + - Implement a baseline matrix multiplication kernel in C without SME2. + - Use SME2 assembly instructions to accelerate the matrix multiplication performance. + - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C. - Compile code with SME2 instructions. - - Run code with SME2 instructions, on a platform with SME2 support or with an emulator. + - Compile and run SME2-enabled code on Arm hardware or in emulation. prerequisites: - Basic knowledge of Arm's Scalable Matrix Extension (SME). @@ -22,8 +18,8 @@ prerequisites: - An intermediate understanding of C programming language and assembly language. - A computer running Linux, macOS, or Windows. - Installations of Git and Docker. - - A platform that support SME2 (see the [list of devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions. - - A compiler with support for SME2 instructions. + - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions. + - A compiler with support for SME2 instructions. author: Arnaud de Grandmaison From 46b527b270460f95ed53896fac49219337f4a6a1 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Wed, 2 Jul 2025 21:46:55 +0000 Subject: [PATCH 02/29] Updates --- .../1-get-started.md | 8 ++++---- .../2-check-your-environment.md | 20 ++++++++----------- .../multiplying-matrices-with-sme2/_index.md | 8 ++++---- 3 files changed, 16 insertions(+), 20 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 760d4bc540..cd2b67e3da 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -1,5 +1,5 @@ --- -title: Set up your Environment +title: Set up your environment weight: 3 ### FIXED, DO NOT MODIFY @@ -61,8 +61,8 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ It contains: - The code examples that will be used throughout this Learning Path. - A ``Makefile`` that builds the code examples. -- A shell script called ``run-fvp.sh`` that runs the FVP (used in the emulated - SME2 case). +- A shell script called ``run-fvp.sh`` that runs the FVP model (used for emulated + SME2 execution). - A directory called ``docker`` that contains materials related to Docker, which are: - A script called ``assets.source_me`` that provides the FVP and compiler @@ -267,7 +267,7 @@ Then select the **Reopen in Container** menu entry as Figure 1 shows. It automatically finds and uses ``.devcontainer/devcontainer.json``: -![example image alt-text#center](VSCode.png "Figure 1: Setting up the Docker Container.") +![VSCode Docker alt-text#center](VSCode.png "Figure 1: Setting up the Docker container.") All your commands now run within the container, so there is no need to prepend them with a Docker invocation, as VS Code handles all this seamlessly for you. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md index 0e824002e2..c5a454597f 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md @@ -1,5 +1,5 @@ --- -title: Test your environment +title: Test your SME2 development environment weight: 4 ### FIXED, DO NOT MODIFY @@ -81,7 +81,7 @@ by invoking `make clean`: {{< /tab >}} {{< /tabpane >}} -## Basic Checks +## Run a Hello World program The very first program that you should run is the famous "Hello, world!" example that will tell you if your environment is set up correctly. @@ -114,15 +114,11 @@ Run the `hello` program with: {{< /tab >}} {{< /tabpane >}} -In the emulated case, there are extra lines that are printed out by the FVP, but -the important line here is "Hello, world!": it demonstrates that the generic -code can be compiled and executed. +In the emulated case, you may see that the FVP prints out extra lines. The key confirmation is the presence of "Hello, world!" in the output. it demonstrates that the generic code can be compiled and executed. -## SME2 checks +## Check SME2 availability -You will now run the `sme2_check` program, which verifies that SME2 works as -expected. This checks both the compiler and the CPU (or the emulated CPU) are -properly supporting SME2. +You will now run the `sme2_check` program, which verifies that SME2 works as expected. This checks both the compiler and the CPU (or the emulated CPU) are properly supporting SME2. The source code is found in `sme2_check.c`: @@ -186,15 +182,15 @@ emulated SME2 support), where no operating system has done the setup of the processor for the user land programs, an additional step is required to turn SME2 on. This is the purpose of the ``setup_sme_baremetal()`` call at line 21. In environments where SME2 is natively supported, nothing needs to be done, -which is why the execution of this function is condionned by the ``BAREMETAl`` +which is why the execution of this function is conditioned by the ``BAREMETAL`` macro. ``BAREMETAL`` is set to 1 in the ``Makefile`` when the FVP is targeted, and set to 0 otherwise. The body of the ``setup_sme_baremetal`` function is defined in ``misc.c``. The ``sme2_check`` program then displays whether SVE, SME and SME2 are supported at line 24. The checking of SVE, SME and SME2 is done differently depending on -``BAREMETAL``. This platform specific behaviour is abstract by the -``display_cpu_features()`` : +``BAREMETAL``. This platform specific behaviour is abstracted by the +``display_cpu_features()``: - In baremetal mode, our program has access to system registers and can thus do some low level peek at what the silicon actually supports. The program will print the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index 57cb6644a0..a5de86d516 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -3,14 +3,14 @@ title: Accelerate Matrix Multiplication Performance with SME2 minutes_to_complete: 30 -who_is_this_for: This Learning Path is an advanced topic for developers who want to learn about accelerating the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). +who_is_this_for: This Learning Path is an advanced topic for developers who want to accelerate the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). learning_objectives: - Implement a baseline matrix multiplication kernel in C without SME2. - Use SME2 assembly instructions to accelerate the matrix multiplication performance. - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C. - - Compile code with SME2 instructions. - - Compile and run SME2-enabled code on Arm hardware or in emulation. + - Compile code with SME2 intrinsics and assembly. + - Run SME2-enabled code on Arm hardware or through emulation. prerequisites: - Basic knowledge of Arm's Scalable Matrix Extension (SME). @@ -19,7 +19,7 @@ prerequisites: - A computer running Linux, macOS, or Windows. - Installations of Git and Docker. - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions. - - A compiler with support for SME2 instructions. + - A compiler with support for SME2 instructions (for example, LLVM 17+ with SME2 backend support). author: Arnaud de Grandmaison From 96a1b8c23aa93f4bb5602959711a588dac04d2ef Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 09:03:24 +0000 Subject: [PATCH 03/29] updates --- .../multiplying-matrices-with-sme2/_index.md | 25 ++++++++++--------- .../overview.md | 10 +++++--- 2 files changed, 19 insertions(+), 16 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index a5de86d516..eef21ac84f 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -6,20 +6,20 @@ minutes_to_complete: 30 who_is_this_for: This Learning Path is an advanced topic for developers who want to accelerate the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). learning_objectives: - - Implement a baseline matrix multiplication kernel in C without SME2. - - Use SME2 assembly instructions to accelerate the matrix multiplication performance. - - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C. - - Compile code with SME2 intrinsics and assembly. - - Run SME2-enabled code on Arm hardware or through emulation. + - Implement a baseline matrix multiplication kernel in C without SME2 + - Use SME2 assembly instructions to accelerate the matrix multiplication performance + - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C + - Compile code with SME2 intrinsics and assembly + - Benchmark and validate SME2-accelerated matrix multiplication on Arm hardware or in a Linux-based emulation environment + - Compare performance metrics between baseline and SME2-optimized implementations prerequisites: - - Basic knowledge of Arm's Scalable Matrix Extension (SME). - - Basic knowledge of Arm's Scalable Vector Extension (SVE). - - An intermediate understanding of C programming language and assembly language. - - A computer running Linux, macOS, or Windows. - - Installations of Git and Docker. - - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions. - - A compiler with support for SME2 instructions (for example, LLVM 17+ with SME2 backend support). + - Working knowledge of Arm’s SVE and SME instruction sets + - Intermediate proficiency with C and Armv9-A assembly language + - A computer running Linux, macOS, or Windows + - Installations of Git and Docker for project setup and emulation + - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions + - Compiler support for SME2 instructions (for example, LLVM 17+ with SME2 backend support) author: Arnaud de Grandmaison @@ -33,6 +33,7 @@ tools_software_languages: - C - Clang - Runbook + - LLVM operatingsystems: - Linux diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 008b274479..6d6d6bb7f0 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -6,15 +6,17 @@ weight: 2 layout: learningpathall --- -# Overview of Arm's Scalable Matrix Extension Version 2 +## Arm's Scalable Matrix Extension Version 2 (SME2) -### What is SME2? +## What is SME2? -The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture. The Scalable Matrix Extension Version 2 (SME2) extends the SME architecture by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. +The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. SME Version 2 (SME2) extends the SME architecture by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. SME2 extends SME by introducing multi-vector data-processing instructions, load to and store from multi-vectors, and a multi-vector predication mechanism. -Additional architectural features of SME2 include: +## Key architectural features of SME2 + +The key architectural features of SME2 include: * Multi-vector multiply-accumulate instructions, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. From 10306940e88fd758f53172be545b6578fe4b6b1f Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 09:28:10 +0000 Subject: [PATCH 04/29] Optimizing headings --- .../1-get-started.md | 18 ++++++------ .../overview.md | 29 +++++++++++-------- 2 files changed, 26 insertions(+), 21 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index cd2b67e3da..59d5f83e18 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -1,5 +1,5 @@ --- -title: Set up your environment +title: Prepare your development environment for SME2 weight: 3 ### FIXED, DO NOT MODIFY @@ -17,7 +17,7 @@ your machine: an emulator in bare metal mode, i.e., the emulator runs the SME2 code *without* an operating system. -## Code examples +## Download and inspect the code examples [Download the code examples](https://gitlab.arm.com/learning-code-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2) for this Learning Path, expand the archive, and change your current directory to: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2`` : @@ -84,7 +84,7 @@ directory is ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. {{% /notice %}} -## Platforms with native SME2 support +## Setup on a system with native SME2 support If your machine has native support for SME2, then you only need to ensure that you have a compiler with support for SME2 instructions. @@ -115,7 +115,7 @@ which forces us to use the version from ``homebrew`` (which has version You are now all set to start hacking with SME2! -## Platforms with emulated SME2 support +## Setup on a system using SME2 emulation If your machine does not have SME2 support or if you want to run SME2 with an emulator, you will need to install Docker. Docker containers provide @@ -133,7 +133,7 @@ also decide not to use the Docker image and follow the ``sme2-environment.docker`` Docker file instructions to install the tools on your machine. -### Docker +### Install and verify Docker {{% notice Note %}} This Learning Path works without ``docker``, but the compiler and the FVP must @@ -207,7 +207,7 @@ You can use Docker in the following ways: commands inside a Docker container, allowing you to work seamlessly within the Docker environment. -### Working with Docker from a terminal +### Run commands from a terminal using Docker When a command is executed in the Docker container environment, you must prepend it with instructions on the command line so that your shell executes it within @@ -233,7 +233,7 @@ For example, to run ``make``, you need to enter: docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 make ``` -### Working within the Docker container from the terminal +### Use an interactive Docker shell The above commands are long and error-prone, so you can instead choose to work interactively within the terminal, which would save you from prepending the @@ -257,7 +257,7 @@ persistent (it was invoked with ``--rm``), so each invocation will use a container freshly built from the image. All the files reside outside the container, so changes you make to them will be persistent. -### Working within the Docker container with VSCode +### Develop with Docker in Visual Studio Code If you are using Visual Studio Code as your IDE, it can use the container as is. @@ -279,7 +279,7 @@ However, if you are using VS Code, you only need to use the `COMMAND ARGUMENTS` part. {{% /notice %}} -### Devices with SME2 support +### Devices with native SME2 support By chip: diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 6d6d6bb7f0..7fab5db789 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -8,25 +8,30 @@ layout: learningpathall ## Arm's Scalable Matrix Extension Version 2 (SME2) -## What is SME2? +Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing on Armv9-A systems. -The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. SME Version 2 (SME2) extends the SME architecture by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. -SME2 extends SME by introducing multi-vector data-processing instructions, load to and store from multi-vectors, and a multi-vector predication mechanism. +### How SME2 extends the SME Architecture -## Key architectural features of SME2 +The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture and is designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. -The key architectural features of SME2 include: +SME2 builds on SME by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. -* Multi-vector multiply-accumulate instructions, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. +SME2 introduces multi-vector processing, new memory instructions, and enhanced predication to improve throughput and flexibility in compute-intensive applications. -* Multi-vector load, store, move, permute, and convert instructions, that use multiple SVE Z vectors as source and destination registers to pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. +### Key architectural features of SME2 -* *Predicate-as-counter*, which is an alternative predication mechanism that is added to the original SVE predication mechanism, to control operations performed on multiple vector registers. +SME2 adds several capabilities to the original SME architecture:: -* Compressed neural network capability using dedicated lookup table instructions and outer product instructions that support binary neural networks. +* **Multi-vector multiply-accumulate instructions**, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. -* A 512-bit architectural register ZT0, that supports the lookup table feature. +* **Multi-vector load, store, move, permute, and convert instructions**, that use multiple SVE Z vectors as source and destination registers to pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. + +* **Predicate-as-counter mechanism**, which is an alternative predication mechanism that is added to the original SVE predication mechanism, to control operations performed on multiple vector registers. + +* **Compressed neural network support** using dedicated lookup table instructions and outer product instructions that support binary neural networks. + +* **A 512-bit architectural register ZT0**, that supports the lookup table feature. ### Suggested reading @@ -36,5 +41,5 @@ This Learning Path assumes some basic understanding of SVE and SME. If you are n - [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture). - [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction). - [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2). -- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/matrix-matrix-multiplication-neon-sve-and-sme-compared). -- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file +- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared). +- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/). \ No newline at end of file From ab6a6f4114690345c7805601ca50456415dd7992 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 10:30:10 +0000 Subject: [PATCH 05/29] overview done! --- .../multiplying-matrices-with-sme2/_index.md | 2 +- .../overview.md | 31 +++++++++++-------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index eef21ac84f..d1d6436350 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -7,7 +7,7 @@ who_is_this_for: This Learning Path is an advanced topic for developers who want learning_objectives: - Implement a baseline matrix multiplication kernel in C without SME2 - - Use SME2 assembly instructions to accelerate the matrix multiplication performance + - Use SME2 assembly instructions to accelerate matrix multiplication performance - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C - Compile code with SME2 intrinsics and assembly - Benchmark and validate SME2-accelerated matrix multiplication on Arm hardware or in a Linux-based emulation environment diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 7fab5db789..5284ae9fc6 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -21,25 +21,30 @@ SME2 introduces multi-vector processing, new memory instructions, and enhanced p ### Key architectural features of SME2 -SME2 adds several capabilities to the original SME architecture:: +SME2 adds several capabilities to the original SME architecture: -* **Multi-vector multiply-accumulate instructions**, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. +* **Multi-vector multiply-accumulate instructions**, that use Z vectors as multiplier and multiplicand inputs, and accumulate results into ZA array vectors. This includes widening multiplies that write to more vectors than they read from. -* **Multi-vector load, store, move, permute, and convert instructions**, that use multiple SVE Z vectors as source and destination registers to pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. +* **Multi-vector load, store, move, permute, and convert instructions**, that use multiple SVE Z vectors as source and destination registers to efficiently pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. -* **Predicate-as-counter mechanism**, which is an alternative predication mechanism that is added to the original SVE predication mechanism, to control operations performed on multiple vector registers. +* A **predicate-as-counter mechanism**, which is a new predication mechanism that is added alongside the original SVE approach to enable fine-grained control over operations across multiple vector registers. -* **Compressed neural network support** using dedicated lookup table instructions and outer product instructions that support binary neural networks. +* **Compressed neural network support** using dedicated lookup table and outer product instructions that support binary neural network workloads. -* **A 512-bit architectural register ZT0**, that supports the lookup table feature. +* A **512-bit architectural register ZT0**, which is a dedicated register that enables fast, table-driven data transformations. ### Suggested reading -If you are not familiar with matrix multiplication, or would benefit from refreshing your knowledge, this [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) is a good start. +This Learning Path assumes some basic understanding of SVE, SME, and matrix multiplication. If you do want to refresh or grow your knowledge however, these are some useful resources that you might find helpful: -This Learning Path assumes some basic understanding of SVE and SME. If you are not familiar with SVE or SME, these are some useful resources that you can read first: -- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture). -- [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction). -- [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2). -- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared). -- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/). \ No newline at end of file +#### Matrix multiplication + +- This [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) + +#### SVE and SME + +- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) +- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) +- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file From 745211193214666bdff6d11c715b6b4756685100 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 11:22:37 +0000 Subject: [PATCH 06/29] updates --- .../1-get-started.md | 31 +++++++++---------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 59d5f83e18..f02281eb77 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -1,25 +1,24 @@ --- -title: Prepare your development environment for SME2 +title: Set up your SME2 development environment weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -To follow this Learning Path, you will need to set up an environment to develop -with SME2 and download the code examples. This Learning Path assumes two -different ways of working, and you will need to select the one appropriate for -your machine: -- Case #1: Your machine has native SME2 support --- check the [list of devices - with SME2 support](#devices-with-sme2-support). -- Case #2: Your machine does not have native SME2 support. This Learning Path - supports this use case by enabling you to run code with SME2 instructions in - an emulator in bare metal mode, i.e., the emulator runs the SME2 code - *without* an operating system. +## Choose your SME2 setup: native or emulated + +Before you can build or run any SME2-accelerated code, you need to set up your development environment. + +This section walks you through the required tools, code examples, and two supported execution options: + +* Running on native SME2-enabled hardware - for further information on devices with SME2 support, see this [list of devices](#devices-with-sme2-support). + +* Emulating SME2 using a Docker-based virtual platform - this Learning Path supports this use case by enabling you to run code with SME2 instructions in an emulator in bare metal mode, that is, the emulator runs the SME2 code *without* an operating system. ## Download and inspect the code examples -[Download the code examples](https://gitlab.arm.com/learning-code-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2) for this Learning Path, expand the archive, and change your current directory to: +To get started, [download the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2), expand the archive, and change your current directory to: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2`` : ```BASH @@ -59,7 +58,7 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ ``` It contains: -- The code examples that will be used throughout this Learning Path. +- The code examples that you will use. - A ``Makefile`` that builds the code examples. - A shell script called ``run-fvp.sh`` that runs the FVP model (used for emulated SME2 execution). @@ -70,8 +69,8 @@ It contains: - A Docker recipe called ``sme2-environment.docker`` to build the container that you will use. - A shell script called ``build-my-container.sh`` that you can use if you want - to build the Docker container. This is not essential; ready-made images are - available for you. + to build the Docker container. (This is not essential; ready-made images are + available for you). - A script called ``build-all-containers.sh`` that was used to create the image for you to download to provide multi-architecture support for both x86_64 and AArch64. @@ -274,7 +273,7 @@ them with a Docker invocation, as VS Code handles all this seamlessly for you. {{% notice Note %}} For the rest of this Learning Path, shell commands include the full Docker -invocation so that users not using VS Code can copy the complete command line. +invocation so that if you are not using VS Code you can copy the complete command line. However, if you are using VS Code, you only need to use the `COMMAND ARGUMENTS` part. {{% /notice %}} From 71784bb1b634f0dd01394857da6bbdaad7ed5d38 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 15:50:37 +0000 Subject: [PATCH 07/29] Updates --- .../1-get-started.md | 44 ++++++++----------- 1 file changed, 19 insertions(+), 25 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index f02281eb77..16e92e31db 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -12,21 +12,23 @@ Before you can build or run any SME2-accelerated code, you need to set up your d This section walks you through the required tools, code examples, and two supported execution options: -* Running on native SME2-enabled hardware - for further information on devices with SME2 support, see this [list of devices](#devices-with-sme2-support). +* **Native SME2 hardware** - build and run directly on a system with SME2 support. For supported devices, se [list of devices](#devices-with-sme2-support). -* Emulating SME2 using a Docker-based virtual platform - this Learning Path supports this use case by enabling you to run code with SME2 instructions in an emulator in bare metal mode, that is, the emulator runs the SME2 code *without* an operating system. +* **Docker-based emulation** - use a container to emulate SME2 in bare metal mode (without an OS). -## Download and inspect the code examples +## Download and explore the code examples -To get started, [download the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2), expand the archive, and change your current directory to: -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2`` : +To get started, [download the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). + +Now expand the archive, and change your current directory to: +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH tar xfz code-examples-main-learning-paths-cross-platform-multiplying-matrices-with-sme2.tar.gz -s /code-examples-main-learning-paths-cross-platform-multiplying-matrices-with-sme2/code-examples/ cd code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2 ``` -The listing of the content of this directory should look like this: +The directory structure looks like this: ```TXT code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ @@ -57,23 +59,15 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ └── sme2_check.c ``` -It contains: -- The code examples that you will use. -- A ``Makefile`` that builds the code examples. -- A shell script called ``run-fvp.sh`` that runs the FVP model (used for emulated - SME2 execution). -- A directory called ``docker`` that contains materials related to Docker, which - are: - - A script called ``assets.source_me`` that provides the FVP and compiler - toolchain references. - - A Docker recipe called ``sme2-environment.docker`` to build the container - that you will use. - - A shell script called ``build-my-container.sh`` that you can use if you want - to build the Docker container. (This is not essential; ready-made images are - available for you). - - A script called ``build-all-containers.sh`` that was used to create the - image for you to download to provide multi-architecture support for both - x86_64 and AArch64. +It directory structure includes: +- Code examples. +- A ``Makefile`` to build the code. +- A shell script called ``run-fvp.sh`` to run the FVP model. +- A `docker` directory containing: + - ``assets.source_me`` to provide toolchain paths. + - `sme2-environment.docker`, a Dockerfile to build the image. + - ``build-my-container.sh`` `sme2-environment.docker`, a Dockerfile to build the image. + - ``build-all-containers.sh`` used to build multi-architecture images. - A configuration script for VS Code to be able to use the container from the IDE called ``.devcontainer/devcontainer.json``. @@ -83,7 +77,7 @@ directory is ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. {{% /notice %}} -## Setup on a system with native SME2 support +## Set up a system with native SME2 support If your machine has native support for SME2, then you only need to ensure that you have a compiler with support for SME2 instructions. @@ -114,7 +108,7 @@ which forces us to use the version from ``homebrew`` (which has version You are now all set to start hacking with SME2! -## Setup on a system using SME2 emulation +## Set up a system using SME2 emulation If your machine does not have SME2 support or if you want to run SME2 with an emulator, you will need to install Docker. Docker containers provide From f545d3cacc20ea45e32fbd05252880a367d65c8f Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 16:02:49 +0000 Subject: [PATCH 08/29] Updates --- .../1-get-started.md | 20 +++++++------------ 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 16e92e31db..eeb876910b 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -68,8 +68,7 @@ It directory structure includes: - `sme2-environment.docker`, a Dockerfile to build the image. - ``build-my-container.sh`` `sme2-environment.docker`, a Dockerfile to build the image. - ``build-all-containers.sh`` used to build multi-architecture images. -- A configuration script for VS Code to be able to use the container from the - IDE called ``.devcontainer/devcontainer.json``. +- ``.devcontainer/devcontainer.json`` for VS Code container support. {{% notice Note %}} From this point in the Learning Path, all instructions assume that your current @@ -79,8 +78,7 @@ directory is ## Set up a system with native SME2 support -If your machine has native support for SME2, then you only need to ensure that -you have a compiler with support for SME2 instructions. +To run SME2 code natively, ensure your system includes SME2 hardware and uses a compiler version that supports SME2. A recent enough version of the compiler is required because SME2 is a recent addition to the Arm instruction set. Compiler versions that are too old will @@ -91,8 +89,9 @@ uses ``clang``. At the time of writing, the ``clang`` version shipped with macOS is ``17.0.0``, which forces us to use the version from ``homebrew`` (which has version -``20.1.7``). Ensure the ``clang`` compiler you are using is recent enough with -``clang --version``: +``20.1.7``). + +To check your compiler version:``clang --version`` {{< tabpane code=true >}} @@ -108,14 +107,9 @@ which forces us to use the version from ``homebrew`` (which has version You are now all set to start hacking with SME2! -## Set up a system using SME2 emulation +## Set up a system using SME2 emulation with Docker -If your machine does not have SME2 support or if you want to run SME2 with an -emulator, you will need to install Docker. Docker containers provide -functionality to execute commands in an isolated environment, where you have all -the necessary tools you require without cluttering your machine. The containers -run independently, meaning they do not interfere with other containers on the -same machine or server. +If your machine doesn't support SME2, or you want to emulate it, you can use the Docker-based environment provided in this Learning Path. This Learning Path provides a Docker image that has a compiler and [Arm's Fixed Virtual Platform (FVP) From df11573fc2dc0459cc9755331514efc47bc15a36 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 3 Jul 2025 22:14:07 +0000 Subject: [PATCH 09/29] Increased tim; further improvements. --- .../1-get-started.md | 65 +++++++------------ .../multiplying-matrices-with-sme2/_index.md | 2 +- 2 files changed, 24 insertions(+), 43 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index eeb876910b..3f022b2d7c 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -10,17 +10,17 @@ layout: learningpathall Before you can build or run any SME2-accelerated code, you need to set up your development environment. -This section walks you through the required tools, code examples, and two supported execution options: +This section walks you through the required tools and two supported execution options: -* **Native SME2 hardware** - build and run directly on a system with SME2 support. For supported devices, se [list of devices](#devices-with-sme2-support). +* **Native SME2 hardware** - build and run directly on a system with SME2 support. For supported devices, see [Devices with SME2 support](#devices-with-sme2-support). * **Docker-based emulation** - use a container to emulate SME2 in bare metal mode (without an OS). ## Download and explore the code examples -To get started, [download the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). +To get started, begin by [downloading the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). -Now expand the archive, and change your current directory to: +Extract the archive and change to the target directory: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH @@ -59,10 +59,10 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ └── sme2_check.c ``` -It directory structure includes: +It includes: - Code examples. - A ``Makefile`` to build the code. -- A shell script called ``run-fvp.sh`` to run the FVP model. +- ``run-fvp.sh`` to run the FVP model. - A `docker` directory containing: - ``assets.source_me`` to provide toolchain paths. - `sme2-environment.docker`, a Dockerfile to build the image. @@ -71,8 +71,7 @@ It directory structure includes: - ``.devcontainer/devcontainer.json`` for VS Code container support. {{% notice Note %}} -From this point in the Learning Path, all instructions assume that your current -directory is +From this point, all instructions assume that your current directory is ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. {{% /notice %}} @@ -80,19 +79,16 @@ directory is To run SME2 code natively, ensure your system includes SME2 hardware and uses a compiler version that supports SME2. -A recent enough version of the compiler is required because SME2 is a recent -addition to the Arm instruction set. Compiler versions that are too old will -have incomplete or no SME2 support, leading to compilation errors or -non-functional code. You can use [Clang](https://www.llvm.org/) version 18 or -later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path -uses ``clang``. +Use [Clang](https://www.llvm.org/) version 18 or later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path uses ``clang``. + +{{% notice Note %}} +At the time of writing, macOS ships with `clang` version 17.0.0, which doesn't support SME2. Use a newer version, such as 20.1.7, available through Homebrew.{{% /notice%}} -At the time of writing, the ``clang`` version shipped with macOS is ``17.0.0``, -which forces us to use the version from ``homebrew`` (which has version -``20.1.7``). To check your compiler version:``clang --version`` +### Install Clang + {{< tabpane code=true >}} {{< tab header="Linux/Ubuntu" language="bash">}} @@ -109,22 +105,16 @@ You are now all set to start hacking with SME2! ## Set up a system using SME2 emulation with Docker -If your machine doesn't support SME2, or you want to emulate it, you can use the Docker-based environment provided in this Learning Path. +If your machine doesn't support SME2, or you want to emulate it, you can use the Docker-based environment that this Learning Path models. -This Learning Path provides a Docker image that has a compiler and [Arm's Fixed -Virtual Platform (FVP) +The Docker container includes a compiler and [Arm's Fixed Virtual Platform (FVP) model](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms) -for emulating code with SME2 instructions. The Docker image recipe is provided -(with the code examples) so you can study it and build it yourself. You could -also decide not to use the Docker image and follow the -``sme2-environment.docker`` Docker file instructions to install the tools on -your machine. +for emulating code with SME2 instructions. You can run the provided image or build it using the included Dockerfile.and follow the ``sme2-environment.docker`` Docker file instructions to install the tools on your machine. ### Install and verify Docker {{% notice Note %}} -This Learning Path works without ``docker``, but the compiler and the FVP must -be available in your search path. +Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your path. {{% /notice %}} Start by checking that ``docker`` is installed on your machine by typing the @@ -135,17 +125,15 @@ docker --version Docker version 27.3.1, build ce12230 ``` -If the above command fails with a message similar to "``docker: command not -found``" then follow the steps from the [Docker Install -Guide](https://learn.arm.com/install-guides/docker/). +If the above command fails with a message similar to "``docker: command not found``" then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/). {{% notice Note %}} -You might need to log in again or restart your machine for the changes to take +You might need to log out and back in again or restart your machine for the changes to take effect. {{% /notice %}} Once you have confirmed that Docker is installed on your machine, you can check -that it is operating normally with the following: +that it is working with the following: ```BASH { output_lines="2-27" } docker run hello-world @@ -222,11 +210,7 @@ docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-enviro ### Use an interactive Docker shell -The above commands are long and error-prone, so you can instead choose to work -interactively within the terminal, which would save you from prepending the -``docker run ...`` magic before each command you want to execute. To work in -this mode, run Docker without any command (note the ``-it`` command line -argument to the Docker invocation): +The above commands are long and error-prone, so you can instead choose to work interactively within the terminal, which would save you from prepending the ``docker run ...`` magic before each command you want to execute. To work in this mode, run Docker without any command (note the ``-it`` command line argument to the Docker invocation). Start an interactive session to avoid repeating the docker run prefix: ```BASH docker run --rm -it -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 @@ -239,14 +223,11 @@ example, the ``make`` command can now be simply invoked with: make ``` -To exit the container, simply hit CTRL+D. Note that the container is not -persistent (it was invoked with ``--rm``), so each invocation will use a -container freshly built from the image. All the files reside outside the -container, so changes you make to them will be persistent. +To exit the container, simply hit CTRL+D. Note that the container is not persistent (it was invoked with ``--rm``), so each invocation will use a container freshly built from the image. All the files reside outside the container, so changes you make to them will be persistent. ### Develop with Docker in Visual Studio Code -If you are using Visual Studio Code as your IDE, it can use the container as is. +If you are using Visual Studio Code as your IDE, the container setup is already configured with `devcontainer/devcontainer.json`. Make sure you have the [Microsoft Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension installed. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index d1d6436350..227820eace 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -1,7 +1,7 @@ --- title: Accelerate Matrix Multiplication Performance with SME2 -minutes_to_complete: 30 +minutes_to_complete: 60 who_is_this_for: This Learning Path is an advanced topic for developers who want to accelerate the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). From d50c823f693e6a26fe412e71fc2a5b78eba3cd42 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 06:36:48 +0000 Subject: [PATCH 10/29] Updates --- .../1-get-started.md | 20 +++++++++---------- .../overview.md | 16 +++++++-------- 2 files changed, 17 insertions(+), 19 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 3f022b2d7c..41ec7c1030 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -20,7 +20,7 @@ This section walks you through the required tools and two supported execution op To get started, begin by [downloading the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). -Extract the archive and change to the target directory: +Now extract the archive and change to the target directory: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH @@ -28,7 +28,7 @@ tar xfz code-examples-main-learning-paths-cross-platform-multiplying-matrices-wi cd code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2 ``` -The directory structure looks like this: +The directory structure should look like this: ```TXT code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ @@ -61,18 +61,18 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ It includes: - Code examples. -- A ``Makefile`` to build the code. -- ``run-fvp.sh`` to run the FVP model. +- A `Makefile` to build the code. +- `run-fvp.sh` to run the FVP model. - A `docker` directory containing: - - ``assets.source_me`` to provide toolchain paths. - - `sme2-environment.docker`, a Dockerfile to build the image. - - ``build-my-container.sh`` `sme2-environment.docker`, a Dockerfile to build the image. - - ``build-all-containers.sh`` used to build multi-architecture images. -- ``.devcontainer/devcontainer.json`` for VS Code container support. + - `assets.source_me` to provide toolchain paths. + - `build-my-container.sh`, a script that automates building the Docker image from the sme2-environment.docker file. It runs the docker build command with the right arguments so you don’t have to remember them. + - `sme2-environment.docker`, a Dockerfile that defines the steps to build the SME2 container image. It installs all necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. + - `build-all-containers.sh` used to build multi-architecture images. + - `.devcontainer/devcontainer.json`` for VS Code container support. {{% notice Note %}} From this point, all instructions assume that your current directory is -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct place before proceeding. {{% /notice %}} ## Set up a system with native SME2 support diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 5284ae9fc6..f6201a1c0b 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -8,18 +8,16 @@ layout: learningpathall ## Arm's Scalable Matrix Extension Version 2 (SME2) -Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing on Armv9-A systems. +Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing. -### How SME2 extends the SME Architecture +## Extending the SME Architecture The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture and is designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. SME2 builds on SME by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. -SME2 introduces multi-vector processing, new memory instructions, and enhanced predication to improve throughput and flexibility in compute-intensive applications. - -### Key architectural features of SME2 +## Key architectural features of SME2 SME2 adds several capabilities to the original SME architecture: @@ -33,15 +31,15 @@ SME2 adds several capabilities to the original SME architecture: * A **512-bit architectural register ZT0**, which is a dedicated register that enables fast, table-driven data transformations. -### Suggested reading +## Further information -This Learning Path assumes some basic understanding of SVE, SME, and matrix multiplication. If you do want to refresh or grow your knowledge however, these are some useful resources that you might find helpful: +This Learning Path does assume some basic understanding of SVE, SME, and matrix multiplication, however if you do want to refresh or grow your knowledge, these are some useful resources that you might find helpful: -#### Matrix multiplication +### On matrix multiplication - This [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) -#### SVE and SME +### On SVE and SME - [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) - [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) From d2f20506c4335fb75fa228476d4e9c8edc0472fd Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 11:19:13 +0000 Subject: [PATCH 11/29] Streaming mode + ZA state updates --- .../3-streaming-mode.md | 89 ++++++++----------- 1 file changed, 38 insertions(+), 51 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md index 0240b6efd8..5c33457777 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md @@ -1,55 +1,41 @@ --- -title: Streaming mode +title: Streaming mode and ZA State in SME weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In real-world large-scale software, a program moves back and forth from -streaming mode, and some streaming mode routines call other streaming mode -routines, which means that some state needs to be saved and restored. This -includes the ZA storage. This is defined in the ACLE and supported by the -compiler: the programmer *just* has to annotate the functions with some keywords -and let the compiler automatically perform the low-level tasks of managing the -streaming mode. This frees the developer from a tedious and error-prone task. -See [Introduction to streaming and non-streaming -mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode) -for further information. The rest of this section references information from -the ACLE. - -## About streaming mode - -The AArch64 architecture defines a concept called *streaming mode*, controlled -by a processor state bit called `PSTATE.SM`. At any given point in time, the -processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode -(`PSTATE.SM==0`). There is an instruction called `SMSTART` to enter streaming mode -and an instruction called `SMSTOP` to return to non-streaming mode. - -Streaming mode has three main effects on C and C++ code: - -- It can change the length of SVE vectors and predicates: the length of an SVE - vector in streaming mode is called the “streaming vector length” (SVL), which - might be different from the normal non-streaming vector length. See - [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) - for more details. -- Some instructions can only be executed in streaming mode, which means that - their associated ACLE intrinsics can only be used in streaming mode. These - intrinsics are called “streaming intrinsics”. -- Some other instructions can only be executed in non-streaming mode, which - means that their associated ACLE intrinsics can only be used in non-streaming - mode. These intrinsics are called “non-streaming intrinsics”. - -The C and C++ standards define the behavior of programs in terms of an *abstract -machine*. As an extension, the ACLE specification applies the distinction -between streaming mode and non-streaming mode to this abstract machine: at any -given point in time, the abstract machine is either in streaming mode or in -non-streaming mode. - -This distinction between processor mode and abstract machine mode is mostly just -a specification detail. However, the usual “as if” rule applies: the -processor's actual mode at runtime can be different from the abstract machine's -mode, provided that this does not alter the behavior of the program. One +## Understanding streaming mode + +In large-scale software, programs often switch between streaming and non-streaming mode. Some streaming-mode functions may call others, requiring portions of processor state, such as the ZA storage, to be saved and restored. This behavior is defined in the Arm C Language Extensions (ACLE) and is supported by the compiler. + +To use streaming mode, you simply annotate the relevant functions with the appropriate keywords. The compiler handles the low-level mechanics of streaming mode management, removing the need for error-prone, manual work. + +{{% notice Note %}} +For more information, see the [Introduction to streaming and non-streaming mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode). The rest of this section references content from the ACLE specification. +{{% /notice %}} + +## Streaming mode behavior and compiler handling + +* The AArch64 architecture defines a concept called *streaming mode*, controlled +by a processor state bit `PSTATE.SM`. + +* At any given point in time, the processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode (`PSTATE.SM==0`). + +* To enter streaming mode, there is the instruction `SMSTART`, and to return to non-streaming mode, the instruction is `SMSTOP`. + +* Streaming mode affects C and C++ code in the following ways: + + - It can change the length of SVE vectors and predicates. The length of an SVE vector in streaming mode is called the *Streaming Vector Length* (SVL), which might differ from the non-streaming vector length. See [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) for further information. + - Some instructions, and their associated ACLE intrinsics, can only be executed in streaming mode.These intrinsics are called *streaming intrinsics*. + - Other instructions are restricted to non-streaming mode, and their instrinsics are called *non-streaming intrinsics*. + +The ACLE specification extends the C and C++ abstract machine model to include streaming mode. At any given time, the abstract machine is either in streaming or non-streaming mode. + +This distinction between abstract machine mode and processor mode is mostly a specification detail. At runtime, the processor’s mode may differ from the abstract machine’s mode - as long as the observable program behavior remains consistent (as per the "as-if" rule). + +One practical consequence of this is that C and C++ code does not specify the exact placement of `SMSTART` and `SMSTOP` instructions; the source code simply places limits on where such instructions go. For example, when stepping through a @@ -62,17 +48,18 @@ ACLE provides attributes that specify whether the abstract machine executes stat - In streaming mode, in which case they are called *streaming statements*. - In either mode, in which case they are called *streaming-compatible statements*. -SME provides an area of storage called ZA, of size `SVL.B` x `SVL.B` bytes. It +## Working with ZA state + +SME also introduces a matrix storage area called ZA, sized `SVL.B` × `SVL.B` bytes. It also provides a processor state bit called `PSTATE.ZA` to control whether ZA is enabled. -In C and C++ code, access to ZA is controlled at function granularity: a -function either uses ZA or it does not. Another way to say this is that a -function either “has ZA state” or it does not. +In C and C++, ZA usage is specified at the function level: a function either uses ZA or it doesn't. That is, a function either has ZA state or it does not. If a function does have ZA state, the function can either share that ZA state -with the function's caller or create new ZA state “from scratch”. In the latter +with the function's caller or create new ZA state. In the latter case, it is the compiler's responsibility to free up ZA so that the function can -use it; see the description of the lazy saving scheme in +use it - see the description of the lazy saving scheme in [AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details about how the compiler does this. + \ No newline at end of file From 2c28c20c413a92c7277a9ded8e8837362a8c87a6 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 11:30:33 +0000 Subject: [PATCH 12/29] Fix figure label --- .../multiplying-matrices-with-sme2/4-vanilla-matmul.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md index c91630a18a..0622ac7ef3 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md @@ -6,17 +6,13 @@ weight: 6 layout: learningpathall --- -In this section, you will learn about an example of standard matrix multiplication in C. +In this section, you'll implement a basic matrix multiplication algorithm in C, using a row-major memory layout. This version serves as a reference implementation for validating optimized versions later in the Learning Path. ## Vanilla matrix multiplication algorithm -The vanilla matrix multiplication operation takes two input matrices, A [Ar -rows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C -[Cr rows x Cc columns]. The operation consists of iterating on each row of A -and each column of B, multiplying each element of the A row with its corresponding -element in the B column then summing all these products, as Figure 2 shows. +The vanilla matrix multiplication operation takes two input matrices, A [Arrows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C [Cr rows x Cc columns]. The algorithm consists of iterating on each row of A and each column of B, multiplying each element of the A row with its corresponding element in the B column then summing all these products, as Figure 2 shows. -![example image alt-text#center](matmul.png "Figure 2: Standard Matrix Multiplication.") +![Standard Matrix Multiplication alt-text#center](matmul.png "Figure 2: Standard Matrix Multiplication.") This implies that the A, B, and C matrices have some constraints on their dimensions: From 1b51ac00710cfa9bbae24999a34d954f5a622f72 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 15:15:01 +0000 Subject: [PATCH 13/29] Update van matmul --- .../4-vanilla-matmul.md | 48 ++++++++++--------- 1 file changed, 26 insertions(+), 22 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md index 0622ac7ef3..8089bce757 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md @@ -6,24 +6,36 @@ weight: 6 layout: learningpathall --- +## Overview + In this section, you'll implement a basic matrix multiplication algorithm in C, using a row-major memory layout. This version serves as a reference implementation for validating optimized versions later in the Learning Path. ## Vanilla matrix multiplication algorithm -The vanilla matrix multiplication operation takes two input matrices, A [Arrows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C [Cr rows x Cc columns]. The algorithm consists of iterating on each row of A and each column of B, multiplying each element of the A row with its corresponding element in the B column then summing all these products, as Figure 2 shows. +The vanilla matrix multiplication operation takes two input matrices: + +* Matrix A [`Ar` rows x `Ac` columns] +* Matrix B [`Br` rows x `Bc` columns] + +It produces an output matrix C [`Cr` rows x `Cc` columns]. + +The algorithm works by iterating over each row of A and each column of B. It multiplies the corresponding elements and sums the products to generate each element of matrix C, as shown in the figure below. -![Standard Matrix Multiplication alt-text#center](matmul.png "Figure 2: Standard Matrix Multiplication.") +![Standard Matrix Multiplication alt-text#center](matmul.png "Figure 2: Standard matrix multiplication.") This implies that the A, B, and C matrices have some constraints on their dimensions: -- A's number of columns must match B's number of rows: Ac == Br. -- C has the dimensions Cr == Ar and Cc == Bc. -You can learn more about matrix multiplication, including its history, -properties and use, by reading this [Wikipedia -article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). +- The number of columns in A must equal the number of rows in B: `Ac == Br`. +- Matrix C must have the dimensions Cr == Ar and Cc == Bc. + +For more information about matrix multiplication, including its history, +properties and use, see this [Wikipedia article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). + +## Variable mappings in this Learning Path + +In this Learning Path, you'll use the following variable names: -In this Learning Path, you will see the following variable names: - `matLeft` corresponds to the left-hand side argument of the matrix multiplication. - `matRight`corresponds to the right-hand side of the matrix multiplication. @@ -35,8 +47,7 @@ In this Learning Path, you will see the following variable names: ## C implementation -A literal implementation of the textbook matrix multiplication algorithm, as -described above, can be found in file `matmul_vanilla.c`: +The file matmul_vanilla.c contains a reference implementation of the algorithm: ```C { line_numbers="true" } void matmul(uint64_t M, uint64_t K, uint64_t N, @@ -56,17 +67,10 @@ void matmul(uint64_t M, uint64_t K, uint64_t N, } ``` -In this Learning Path, the matrices are laid out in memory as contiguous -sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). -The `matmul` function performs the algorithm described above. +## Memory layout and pointer annotations + +In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above. -The pointers to `matLeft`, `matRight` and `matResult` have been annotated -as `restrict`, which informs the compiler that the memory areas designated by -those pointers do not alias. This means that they do not overlap in any way, so -that the compiler does not need to insert extra instructions to deal with these -cases. The pointers to `matLeft` and `matRight` are marked as `const` as -neither of these two matrices are modified by `matmul`. +The pointers to `matLeft`, `matRight` and `matResult` have been annotated as `restrict`, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to `matLeft` and `matRight` are marked as `const` as neither of these two matrices are modified by `matmul`. -You now have a reference standard matrix multiplication function. You will use -it later on in this Learning Path to ensure that the assembly version and the -intrinsics version of the multiplication algorithm do not contain errors. \ No newline at end of file +You now have a working baseline for the matrix multiplication function. You'll use it later on in this Learning Path to ensure that the assembly version and the intrinsics version of the multiplication algorithm do not contain errors. \ No newline at end of file From 24ebc7e6d2b1dce6bc3e7a24196829123e3b9990 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 16:52:28 +0000 Subject: [PATCH 14/29] Updates --- .../5-outer-product.md | 41 +++++-------------- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md index b49a5d2ba8..d7a7bc4591 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md @@ -6,12 +6,11 @@ weight: 7 layout: learningpathall --- -In this section, you will learn how to use the outer product with the SME engine -to improve matrix multiplication execution performances. +In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations. -## Matrix multiplication with the outer product +## Improve performance with the outer product -In the vanilla matrix multiplication example, the core of the computation is: +In the vanilla implementation, the core multiply-accumulate step looks like this: ```C acc += matLeft[m * K + k] * matRight[k * N + n]; @@ -19,21 +18,15 @@ In the vanilla matrix multiplication example, the core of the computation is: This translates to one multiply-accumulate operation, known as `macc`, for two loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 -`macc` to `load` ratio. +`macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration. This ratio limits efficiency, especially in triple-nested loops where memory bandwidth becomes a bottleneck. -From a memory system perspective, this is not efficient, especially since this -computation is done within a triple-nested loop, repeatedly loading data from -memory. - -To make matters worse, large matrices might not fit in cache. To improve matrix -multiplication efficiency, the goal is to increase the `macc` to `load` ratio, -which means increasing the number of multiply-accumulate operations per load. +To make matters worse, large matrices might not fit in cache. To improve matrix multiplication efficiency, the goal is to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products. Figure 3 below illustrates how the matrix multiplication of `matLeft` (3 rows, 2 columns) by `matRight` (2 rows, 3 columns) can be decomposed as the sum of outer products: -![example image alt-text#center](outer_product.png "Figure 3: Outer Product-based Matrix Multiplication.") +![example image alt-text#center](outer_product.png "Figure 3: Outer product-based matrix multiplication.") The SME engine builds on the [Outer Product](https://en.wikipedia.org/wiki/Outer_product) because matrix @@ -42,33 +35,19 @@ products](https://en.wikipedia.org/wiki/Outer_product#Connection_with_the_matrix ## About transposition -From the previous page, you will recall that matrices are laid out in row-major -order. This means that loading row-data from memory is efficient as the memory -system operates efficiently with contiguous data. An example of this is where -caches are loaded row by row, and data prefetching is simple - just load the -data from `current address + sizeof(data)`. This is not the case for loading +From the previous page, you will recall that matrices are laid out in row-major order. This means that loading row-data from memory is efficient as the memory-system operates efficiently with contiguous data. An example of this is where caches are loaded row by row, and data prefetching is simple - just load the data from `current address + sizeof(data)`. This is not the case for loading column-data from memory though, as it requires more work from the memory system. -To further improve matrix multiplication effectiveness, it is therefore -desirable to change the layout in memory of the left-hand side matrix, called -`matLeft` in the code examples in this Learning Path. The improved layout would -ensure that elements from the same column are located next to each other in -memory. This is essentially a matrix transposition, which changes `matLeft` from +To further improve matrix multiplication effectiveness, it is therefore desirable to change the layout in memory of the left-hand side matrix, called `matLeft` in the code examples in this Learning Path. The improved layout would ensure that elements from the same column are located next to each other in memory. This is essentially a matrix transposition, which changes `matLeft` from row-major order to column-major order. {{% notice Important %}} -It is important to note here that this reorganizes the layout of the matrix in -memory to make the algorithm implementation more efficient. The transposition -affects only the memory layout. `matLeft` is transformed to column-major order, -but from a mathematical perspective, `matLeft` is *not* transposed. +It is important to note here that this reorganizes the layout of the matrix in memory to make the algorithm implementation more efficient. The transposition affects only the memory layout. `matLeft` is transformed to column-major order, but from a mathematical perspective, `matLeft` is *not* transposed. {{% /notice %}} ### Transposition in the real world -Just as trees don't reach the sky, the SME engine has physical implementation -limits. It operates with tiles in the ZA storage. Tiles are 2D portions of the -matrices being processed. SME has dedicated instructions to load and store data -from tiles efficiently, as well as instructions to operate with and on tiles. +Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* in the ZA storage. Tiles are 2D portions of the matrices being processed. SME has dedicated instructions to load, store and compute on these tiles efficiently. For example, the [fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as inputs and accumulates all the outer products From a35dc97a03a9a59312addd07696541636d5793b4 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 19:31:04 +0000 Subject: [PATCH 15/29] Corrected reference title --- .../multiplying-matrices-with-sme2/_index.md | 2 +- .../multiplying-matrices-with-sme2/overview.md | 14 ++++++++------ 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index 227820eace..42bc39f111 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -8,7 +8,7 @@ who_is_this_for: This Learning Path is an advanced topic for developers who want learning_objectives: - Implement a baseline matrix multiplication kernel in C without SME2 - Use SME2 assembly instructions to accelerate matrix multiplication performance - - Use SME2 intrinsics to vectorize and optimize matrix multiplication in C + - Use SME2 intrinsics to vectorize and optimize matrix multiplication - Compile code with SME2 intrinsics and assembly - Benchmark and validate SME2-accelerated matrix multiplication on Arm hardware or in a Linux-based emulation environment - Compare performance metrics between baseline and SME2-optimized implementations diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index f6201a1c0b..20b29a6f46 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -8,12 +8,14 @@ layout: learningpathall ## Arm's Scalable Matrix Extension Version 2 (SME2) -Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing. +Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. +Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing. -## Extending the SME Architecture -The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture and is designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. +## Extending the SME architecture + +SME is an extension to the Armv9-A architecture and is designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. SME2 builds on SME by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. @@ -35,14 +37,14 @@ SME2 adds several capabilities to the original SME architecture: This Learning Path does assume some basic understanding of SVE, SME, and matrix multiplication, however if you do want to refresh or grow your knowledge, these are some useful resources that you might find helpful: -### On matrix multiplication +On matrix multiplication: - This [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) -### On SVE and SME +On SVE and SME: - [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) - [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) - [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) - [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) -- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file +- [Learn about function multiversioning, Alexandros Lamprineas, Arm](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file From 5e008b0d7f60c78f46fc73b1b15031f220198434 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 20:07:24 +0000 Subject: [PATCH 16/29] outer product improvements --- .../5-outer-product.md | 24 +++++++------------ .../overview.md | 4 ++-- 2 files changed, 11 insertions(+), 17 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md index d7a7bc4591..7389b33021 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md @@ -8,7 +8,7 @@ layout: learningpathall In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations. -## Improve performance with the outer product +## Increase MACC efficiency using outer products In the vanilla implementation, the core multiply-accumulate step looks like this: @@ -33,21 +33,21 @@ Product](https://en.wikipedia.org/wiki/Outer_product) because matrix multiplication can be expressed as the [sum of column-by-row outer products](https://en.wikipedia.org/wiki/Outer_product#Connection_with_the_matrix_product). -## About transposition +## Optimize memory layout with transposition From the previous page, you will recall that matrices are laid out in row-major order. This means that loading row-data from memory is efficient as the memory-system operates efficiently with contiguous data. An example of this is where caches are loaded row by row, and data prefetching is simple - just load the data from `current address + sizeof(data)`. This is not the case for loading column-data from memory though, as it requires more work from the memory system. -To further improve matrix multiplication effectiveness, it is therefore desirable to change the layout in memory of the left-hand side matrix, called `matLeft` in the code examples in this Learning Path. The improved layout would ensure that elements from the same column are located next to each other in memory. This is essentially a matrix transposition, which changes `matLeft` from +To further improve matrix multiplication effectiveness, it is desirable to change the layout in memory of the left-hand side matrix, called `matLeft` in the code examples in this Learning Path. The improved layout ensures that elements from the same column are located next to each other in memory. This is essentially a matrix transposition, which changes `matLeft` from row-major order to column-major order. {{% notice Important %}} -It is important to note here that this reorganizes the layout of the matrix in memory to make the algorithm implementation more efficient. The transposition affects only the memory layout. `matLeft` is transformed to column-major order, but from a mathematical perspective, `matLeft` is *not* transposed. +This transformation affects only the memory layout. From a mathematical perspective, `matLeft` is not transposed. It is reorganized for better data locality. {{% /notice %}} ### Transposition in the real world -Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* in the ZA storage. Tiles are 2D portions of the matrices being processed. SME has dedicated instructions to load, store and compute on these tiles efficiently. +Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* - 2D blocks of data stored in the ZA storage. SME has dedicated instructions to load, store, and compute on these tiles efficiently. For example, the [fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as inputs and accumulates all the outer products @@ -55,13 +55,9 @@ into a 2D tile. The tile in ZA storage allows SME to increase the `macc` to `load` ratio by loading all the tile elements to be used with the SME outer product instructions. -Considering that ZA storage is finite, the desired transposition of the -`matLeft` matrix discussed in the previous section needs to be adapted to the -tile dimensions, so that a tile is easy to access. The `matLeft` preprocessing -thus involves some aspects of transposition but also takes into account tiling, -referred to in the code as `preprocess`. +But since ZA storage is finite, you need to you need to preprocess `matLeft` to fit tile dimensions - this includes transposing portions of the matrix and padding where needed. -Here is what `preprocess_l` does in practice, at the algorithmic level: +The following function shows how `preprocess_l` transforms the matrix at the algorithmic level: ```C { line_numbers = "true" } void preprocess_l(uint64_t nbr, uint64_t nbc, uint64_t SVL, @@ -87,12 +83,10 @@ void preprocess_l(uint64_t nbr, uint64_t nbc, uint64_t SVL, } ``` -`preprocess_l` will be used to check that the assembly and intrinsic versions of -the matrix multiplication perform the preprocessing step correctly. This code is -located in the file `preprocess_vanilla.c`. +This routine is defined in `preprocess_vanilla.c.` It's used to ensure the assembly and intrinsics-based matrix multiplication routines work with the expected input format. {{% notice Note %}} -In real-world applications, it might be possible to arrange for `matLeft` to be +In production environments, it might be possible to arrange for `matLeft` to be stored in column-major order, eliminating the need for transposition and making the preprocessing step unnecessary. Matrix processing frameworks and libraries often have attributes within the matrix object to track if it is in row- or diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 20b29a6f46..f9535b213f 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -43,8 +43,8 @@ On matrix multiplication: On SVE and SME: -- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) +- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture - Martin Weidmann, Arm](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) - [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) - [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) - [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) -- [Learn about function multiversioning, Alexandros Lamprineas, Arm](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file +- [Learn about function multiversioning - Alexandros Lamprineas, Arm](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file From bfabf61142fdea1dc478c89f290d58936789a3ed Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 4 Jul 2025 20:20:10 +0000 Subject: [PATCH 17/29] Heading change --- .../multiplying-matrices-with-sme2/6-sme2-matmul-asm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md index 461058887d..3b1f278896 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md @@ -9,7 +9,7 @@ layout: learningpathall In this chapter, you will use an SME2-optimized matrix multiplication written directly in assembly. -## Matrix multiplication with SME2 in assembly +## About the SME2 assembly implementation ### Description From bda0828b66e1f88b415dd91748d9b3f87dceae23 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Sat, 5 Jul 2025 04:22:51 +0000 Subject: [PATCH 18/29] Clarifying language --- .../8-benchmarking.md | 30 ++++------ .../9-debugging.md | 60 +++++++------------ 2 files changed, 34 insertions(+), 56 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md index ba750528d5..097320cb04 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md @@ -6,36 +6,28 @@ weight: 10 layout: learningpathall --- -In this section, if your machine supports native execution of SME2 instructions, -you will perform benchmarking of the matrix multiplication improvement thanks to -SME2. +In this section, you'll benchmark matrix multiplication performance using SME2, if your machine supports native execution of SME2 instructions. ## About benchmarking and emulation Emulation is generally not the best way to assess the performance of a piece of -code. Emulation focuses on correctly simulating instructions and leaves out many -details necessary for precise execution time measurement. For example, as -explained in the section on the outer product, the goal was to increase the -`macc` to `load` ratio. Emulators, including the FVP, do not model in detail the -cache effects or the timing effects of the memory accesses. At best, an emulator -can provide an instruction count for the vanilla reference implementation versus -the assembly-/intrinsic-based versions of the matrix multiplication, but this is -known to be a poor proxy for execution time comparisons. +code. Emulation focuses on correctly simulating instructions and not accurate execution timing. For example, as explained in the [outer product section](../5-outer-product/), improving performance involves increasing the `macc`-to-`load` ratio. -## Benchmarking on platform with native SME2 support +Emulators, including the FVP, do not model in detail memory bandwidth, cache behavior, or latency. At best, an emulator provides an instruction count for the vanilla reference implementation versus the assembly-/intrinsic-based versions of the matrix multiplication, which is useful for functional validation but not for precise benchmarking. + +## Benchmarking on a platform with native SME2 support {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this Learning Path -is to provide some basic guidelines on the performance improvement that can be -obtained with SME2. +Benchmarking and profiling are complex tasks. This Learning Path provides a *simplified* framework for observing SME2-related performance improvements. {{% /notice %}} -If your machine natively supports SME2, then benchmarking becomes possible. When +If your machine natively supports SME2, then benchmarking is possible. When `sme2_matmul_asm` and `sme2_matmul_intr` were compiled with `BAREMETAL=0`, the -*benchmarking mode* becomes available. +*benchmarking mode* is available. + +*Benchmarking mode* is enabled by prepending the `M`, `K`, `N` optional parameters with an iteration count (`I`). -*Benchmarking mode* is enabled by prepending the `M`, `K`, `N` optional -parameters with an iteration count (`I`). +## Run the intrinsic version Now measure the execution time of `sme2_matmul_intr` for 1000 multiplications of matrices with the default sizes: diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md index a6debde6f4..081b75a3e7 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md @@ -6,19 +6,15 @@ weight: 11 layout: learningpathall --- -In practice, writing code can be complex and debugging code is required. +Debugging is an essential part of development, especially when working close to the hardware. -In this section, you will learn about the different ways to debug SME2 code. +In this section, you will learn about the different ways to debug and troubleshoot SME2 code. -## Looking at the generated code +## Inspect the generated assembly -In some cases, it is useful to look at the code generated by the compiler. In -this Learning Path, the assembly listings have been produced and you can inspect -them. +Sometimes it's helpful to review the assembly code generated by the compiler. In this Learning Path, listings have already been generated for you. You can inspect these files to verify that SME2 instructions were emitted correctly. -For example, the inner loop with the outer product and the accumulation of the -matrix multiplication with intrinsics from the listing file -`sme2_matmul_intr.lst` looks like this: +For example, here’s a snippet from `sme2_matmul_intr.lst`, showing the inner loop of the matrix multiplication using intrinsics: ```TXT ... @@ -31,27 +27,21 @@ matrix multiplication with intrinsics from the listing file 8000186c: 54ffff41 b.ne 0x80001854 ... ``` +This sequence shows how `ld1w` loads vector registers, followed by the `fmopa` outer product operation. -### With debuggers +### Debug with gdb or lldb -Both of the main debuggers, `gdb` and `lldb`, have some support for -debugging SME2 code. Their usage is not shown in this Learning Path though. +Both of the main debuggers, `gdb` and `lldb`, have some support for debugging SME2 code. Their usage is not shown in this Learning Path though. -Note that debugging on the emulator might require some more steps as this is a -simplistic, and minimalistic environment, without an operating system, for -example. Debug mode requires a debug monitor to interface between the debugger, -the program, and the CPU. +{{% notice Note %}} +If you're using the FVP emulator, debugging is more complex. Because there's no operating system, you'll need a debug monitor to interface between your program, the CPU, and your debugger. +{{% /notice %}} -### With trace +### Analyze instruction trace with Tarmac -The FVP can emit an instruction trace file in text format, known as the Tarmac -trace. This provides a convenient way for you to understand what the program is -doing. +The FVP can emit an instruction trace file in text format, known as the Tarmac trace.This trace shows instruction-by-instruction execution and register contents, which is helpful for low-level debugging. -In the excerpt shown below, you can see that the SVE register `z0` has been -loaded with 16 values, as predicate `p0` was true, with an `LD1W` -instruction, whereas `z1` was loaded with only two values, as `p1`. `z0`, -and `z1` are later used by the `fmopa` instruction to compute the outer +In the excerpt shown below, you can see that the SVE register `z0` has been loaded with 16 values, as predicate `p0` was true, with an `LD1W` instruction, whereas `z1` was loaded with only two values, as `p1`. `z0`, and `z1` are later used by the `fmopa` instruction to compute the outer product, and the trace displays the content of the ZA storage. ```TXT @@ -103,21 +93,17 @@ product, and the trace displays the content of the ZA storage. 923580000 ps R ZA0H_S_15 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4479e70a_44f4223e ``` -You can get a Tarmac trace when invoking `run-fvp.sh` by adding the -`--trace` option as the *first* argument, for example: +You can get a Tarmac trace when invoking `run-fvp.sh` by adding the `--trace` option as the *first* argument, for example: ```BASH docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 ./run-fvp.sh --trace sme2_matmul_asm ``` -Tracing is not enabled by default. It slows down the simulation significantly and the trace file can become very large for programs with large matrices. - -{{% notice Debugging tip %}} -It can be helpful when debugging to understand where an element in the -Tile is coming from. The current code base allows you to do that in `debug` -mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you -look into `main.c`, you will notice that the matrix initialization is no -longer random, but instead initializes each element with its linear -index. This makes it *easier* to find where the matrix elements are loaded in -the tile in tarmac trace, for example. -{{% /notice %}} \ No newline at end of file +{{% notice Tip %}} +Tracing is disabled by default because it significantly slows down simulation and generates large files for big matrices. +{{% /notice %}} + +## Use debug mode for matrix inspection + +It can be helpful when debugging to understand where an element in the Tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no +longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example. From c7043e604e50556330009f47611baa0252c3ca61 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Sat, 5 Jul 2025 04:46:28 +0000 Subject: [PATCH 19/29] Updates --- .../10-going-further.md | 85 ++++++++++++++----- 1 file changed, 65 insertions(+), 20 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index 3964ca0acb..c5b3d20d81 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -6,35 +6,72 @@ weight: 12 layout: learningpathall --- -In this section, you will learn about the many different optimizations that are -available to you. +In this section, you'll explore ways to optimize and extend the matrix multiplication algorithm beyond the current SME2 implementation. These improvements include generalization, loop unrolling, and strategic use of matrix properties. -## Generalize the algorithms +## Generalize the algorithm for other data types -In this Learning Path, you focused on using SME2 for matrix multiplication with floating-point numbers. However in practice, any library or framework supporting -matrix multiplication should also handle various integer types. +So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are also performed on various integer types. -You can see that the algorithm structure for matrix preprocessing as well as -multiplication with the outer product does not change at all for other data -types - they only need to be adapted. +The overall structure of the algorithm - preprocessing with tiling and outer product–based multiplication - remains the same across data types. You only need to adapt how the data is loaded, stored, and accumulated. -This is suitable for languages with [generic -programming](https://en.wikipedia.org/wiki/Generic_programming) like C++ with -templates. You can even make the template manage a case where the value -accumulated during the product uses a larger type than the input matrices. SME2 -has the instructions to deal efficiently with this common case scenario. +This pattern works well in languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates. Templates can also handle cases where accumulation uses a wider data type than the input matrices, which is a common requirement. SME2 supports this with widening multiply-accumulate instructions. -This enables the library developer to focus on the algorithm, testing, and -optimizations, while allowing the compiler to generate multiple variants. +By expressing the algorithm generically, you let the compiler generate multiple variants while you focus on: -## Unroll further +- Algorithm design +- Testing and verification +- SME2-specific optimization -You might have noticed that ``matmul_intr_impl`` computes only one tile at a -time, for the sake of simplicity. +## Unroll loops to compute multiple tiles -SME2 does support multi-vector instructions, and some were used in -``preprocess_l_intr``, for example, ``svld1_x2``. +For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. But SME2 supports multi-vector operations, and you can take advantage of them to improve performance. +For example, `preprocess_l_intr` uses: + +```c +svld1_x2(...); // Load two vectors at once + +--- +title: Going further +weight: 12 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +--- +title: Going further +weight: 12 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this section, you'll explore ways to optimize and extend the matrix multiplication algorithm beyond the current SME2 implementation. These improvements include generalization, loop unrolling, and strategic use of matrix properties. + +## Generalize the algorithm for other data types + +So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are also performed on various integer types. + +The overall structure of the algorithm - preprocessing with tiling and outer product–based multiplication - remains the same across data types. You only need to adapt how the data is loaded, stored, and accumulated. + +This pattern works well in languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates. Templates can also handle cases where accumulation uses a wider data type than the input matrices, which is a common requirement. SME2 supports this with widening multiply-accumulate instructions. + +By expressing the algorithm generically, you let the compiler generate multiple variants while you focus on: + +- Algorithm design +- Testing and verification +- SME2-specific optimization + +## Unroll loops to compute multiple tiles + +For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. But SME2 supports multi-vector operations, and you can take advantage of them to improve performance. + +For example, `preprocess_l_intr` uses: + +```c +svld1_x2(...); // Load two vectors at once +``` Loading two vectors at a time enables the simultaneous computing of more tiles, and as the input matrices have been laid out in memory in a neat way, the consecutive loading of the data is efficient. Implementing this approach can @@ -70,3 +107,11 @@ and one column. Although our current code handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way? + + +In order to check your understanding of SME2, you can try to implement this +unrolling yourself in the intrinsic version (the asm version already has this +optimization). You can check your work by comparing your results to the expected +reference values. + + From e57f9a11986dcdd36c5e557bac7d3dcf02b638cf Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Sat, 5 Jul 2025 22:22:27 +0000 Subject: [PATCH 20/29] Improvements including new internal links. --- .../1-get-started.md | 27 ++++++++++--------- .../multiplying-matrices-with-sme2/_index.md | 2 +- .../overview.md | 10 +++---- 3 files changed, 20 insertions(+), 19 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 41ec7c1030..23e6215b67 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -10,17 +10,17 @@ layout: learningpathall Before you can build or run any SME2-accelerated code, you need to set up your development environment. -This section walks you through the required tools and two supported execution options: +This section walks you through the required tools and the two supported execution options, which are: -* **Native SME2 hardware** - build and run directly on a system with SME2 support. For supported devices, see [Devices with SME2 support](#devices-with-sme2-support). +* [**Native SME2 hardware**](#set-up-a-system-with-native-SME2-support) - build and run directly on a system with SME2 support. For supported devices, see [Devices with SME2 support](#devices-with-sme2-support). -* **Docker-based emulation** - use a container to emulate SME2 in bare metal mode (without an OS). +* [**Docker-based emulation**](#set-up-a-system-using-sme2-emulation-with-dockerset-up-a-system-using-SME2-emulation-with-Docker) - use a container to emulate SME2 in bare metal mode (without an OS). ## Download and explore the code examples To get started, begin by [downloading the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). -Now extract the archive and change to the target directory: +Now extract the archive, and change to the target directory: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH @@ -65,30 +65,31 @@ It includes: - `run-fvp.sh` to run the FVP model. - A `docker` directory containing: - `assets.source_me` to provide toolchain paths. - - `build-my-container.sh`, a script that automates building the Docker image from the sme2-environment.docker file. It runs the docker build command with the right arguments so you don’t have to remember them. - - `sme2-environment.docker`, a Dockerfile that defines the steps to build the SME2 container image. It installs all necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. - - `build-all-containers.sh` used to build multi-architecture images. + - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the docker build command with the correct arguments so you don’t have to remember them. + - `sme2-environment.docker`, a Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. + - `build-all-containers.sh`, a script to build multi-architecture images. - `.devcontainer/devcontainer.json`` for VS Code container support. {{% notice Note %}} From this point, all instructions assume that your current directory is -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct place before proceeding. +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct directory before proceeding. {{% /notice %}} ## Set up a system with native SME2 support To run SME2 code natively, ensure your system includes SME2 hardware and uses a compiler version that supports SME2. -Use [Clang](https://www.llvm.org/) version 18 or later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path uses ``clang``. +For the compiler, you can use [Clang](https://www.llvm.org/) version 18 or later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path uses ``clang``. {{% notice Note %}} At the time of writing, macOS ships with `clang` version 17.0.0, which doesn't support SME2. Use a newer version, such as 20.1.7, available through Homebrew.{{% /notice%}} - -To check your compiler version:``clang --version`` +You can check your compiler version using the command:``clang --version`` ### Install Clang +Install Clang using the instructions below, selecting either macOS or Linux/Ubuntu, as appropriate: + {{< tabpane code=true >}} {{< tab header="Linux/Ubuntu" language="bash">}} @@ -117,8 +118,7 @@ for emulating code with SME2 instructions. You can run the provided image or bui Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your path. {{% /notice %}} -Start by checking that ``docker`` is installed on your machine by typing the -following command line in a terminal: +Start by checking that ``docker`` is installed on your machine: ```BASH { output_lines="2" } docker --version @@ -178,6 +178,7 @@ https://docs.docker.com/get-started/ You can use Docker in the following ways: - Directly from the command line. For example, when you are working from a terminal on your local machine. + - Within a containerized environment. Configure VS Code to execute all the commands inside a Docker container, allowing you to work seamlessly within the Docker environment. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index 42bc39f111..918fcc44f8 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -15,7 +15,7 @@ learning_objectives: prerequisites: - Working knowledge of Arm’s SVE and SME instruction sets - - Intermediate proficiency with C and Armv9-A assembly language + - Intermediate proficiency with the C programming language and the Armv9-A assembly language - A computer running Linux, macOS, or Windows - Installations of Git and Docker for project setup and emulation - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index f9535b213f..ed46aca044 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -29,7 +29,7 @@ SME2 adds several capabilities to the original SME architecture: * A **predicate-as-counter mechanism**, which is a new predication mechanism that is added alongside the original SVE approach to enable fine-grained control over operations across multiple vector registers. -* **Compressed neural network support** using dedicated lookup table and outer product instructions that support binary neural network workloads. +* **Compressed neural network support**, using dedicated lookup table and outer product instructions that support binary neural network workloads. * A **512-bit architectural register ZT0**, which is a dedicated register that enables fast, table-driven data transformations. @@ -39,12 +39,12 @@ This Learning Path does assume some basic understanding of SVE, SME, and matrix On matrix multiplication: -- This [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) +- The [Wikipedia article](https://en.wikipedia.org/wiki/Matrix_multiplication) On SVE and SME: - [Introducing the Scalable Matrix Extension for the Armv9-A Architecture - Martin Weidmann, Arm](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) -- [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) -- [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) -- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 1) - Zenon Xiu](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 2) - Zenon Xiu](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) +- [Matrix-matrix multiplication. Neon, SVE, and SME compared (Part 3)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) - [Learn about function multiversioning - Alexandros Lamprineas, Arm](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file From ca40e031857abc737c01b229a239160525b17eb0 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Sat, 5 Jul 2025 22:55:59 +0000 Subject: [PATCH 21/29] Updates --- .../10-going-further.md | 52 ++----------------- 1 file changed, 4 insertions(+), 48 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index c5b3d20d81..c42f6d48d4 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -6,58 +6,17 @@ weight: 12 layout: learningpathall --- -In this section, you'll explore ways to optimize and extend the matrix multiplication algorithm beyond the current SME2 implementation. These improvements include generalization, loop unrolling, and strategic use of matrix properties. +This section presents different ways that you can optimize and extend the matrix multiplication algorithm beyond the SME2 implementation that you've explored in this Learning Path. Improvements that you might like to build on include generalization, loop unrolling, and the strategic use of matrix properties. ## Generalize the algorithm for other data types -So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are also performed on various integer types. +So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are performed on various integer types. The overall structure of the algorithm - preprocessing with tiling and outer product–based multiplication - remains the same across data types. You only need to adapt how the data is loaded, stored, and accumulated. This pattern works well in languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates. Templates can also handle cases where accumulation uses a wider data type than the input matrices, which is a common requirement. SME2 supports this with widening multiply-accumulate instructions. -By expressing the algorithm generically, you let the compiler generate multiple variants while you focus on: - -- Algorithm design -- Testing and verification -- SME2-specific optimization - -## Unroll loops to compute multiple tiles - -For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. But SME2 supports multi-vector operations, and you can take advantage of them to improve performance. - -For example, `preprocess_l_intr` uses: - -```c -svld1_x2(...); // Load two vectors at once - ---- -title: Going further -weight: 12 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - ---- -title: Going further -weight: 12 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -In this section, you'll explore ways to optimize and extend the matrix multiplication algorithm beyond the current SME2 implementation. These improvements include generalization, loop unrolling, and strategic use of matrix properties. - -## Generalize the algorithm for other data types - -So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are also performed on various integer types. - -The overall structure of the algorithm - preprocessing with tiling and outer product–based multiplication - remains the same across data types. You only need to adapt how the data is loaded, stored, and accumulated. - -This pattern works well in languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates. Templates can also handle cases where accumulation uses a wider data type than the input matrices, which is a common requirement. SME2 supports this with widening multiply-accumulate instructions. - -By expressing the algorithm generically, you let the compiler generate multiple variants while you focus on: +By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on: - Algorithm design - Testing and verification @@ -72,10 +31,7 @@ For example, `preprocess_l_intr` uses: ```c svld1_x2(...); // Load two vectors at once ``` -Loading two vectors at a time enables the simultaneous computing of more tiles, -and as the input matrices have been laid out in memory in a neat way, the -consecutive loading of the data is efficient. Implementing this approach can -make improvements to the ``macc`` to load ``ratio``. +Loading two vectors at a time enables the simultaneous computing of more tiles, and as the input matrices have been laid out in memory in a neat way, the consecutive loading of the data is efficient. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the asm version already has this From ef38e398158aeddd0bc73116ae8bbe6a7e931f2b Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Sun, 6 Jul 2025 10:33:05 +0000 Subject: [PATCH 22/29] Updates --- .../10-going-further.md | 44 ++++++++++++++----- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index c42f6d48d4..63ea7b9724 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -1,40 +1,62 @@ --- -title: Going further +title: Beyond this implementation weight: 12 ### FIXED, DO NOT MODIFY layout: learningpathall --- -This section presents different ways that you can optimize and extend the matrix multiplication algorithm beyond the SME2 implementation that you've explored in this Learning Path. Improvements that you might like to build on include generalization, loop unrolling, and the strategic use of matrix properties. +## Going further -## Generalize the algorithm for other data types +There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications. -So far, this Learning Path has focused on multiplying floating-point matrices. In practice, matrix operations are performed on various integer types. +Advanced optimization techniques are essential when adapting algorithms to real-world scenarios. These often include processing matrices of different shapes and sizes, handling mixed data types, or maximizing throughput for large batch operations. The ability to generalize and fine-tune your implementation opens the door to more scalable and reusable code that performs well across workloads. -The overall structure of the algorithm - preprocessing with tiling and outer product–based multiplication - remains the same across data types. You only need to adapt how the data is loaded, stored, and accumulated. +Whether you're targeting different data types, improving parallelism, or adapting to unusual matrix shapes, these advanced techniques give you more control over both correctness and performance. -This pattern works well in languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates. Templates can also handle cases where accumulation uses a wider data type than the input matrices, which is a common requirement. SME2 supports this with widening multiply-accumulate instructions. +Some ideas of improvements that you might like to test out include: + +* Generalization +* Loop unrolling +* The strategic use of matrix properties + +## Generalize the algorithm for different data types + +So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well. + +The structure of the algorithm remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are: + +* Loaded from memory +* Accumulated (often with widening) +* Stored to the output + +Languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates, make this easier. + +Templates allow you to: + +* Swap data types flexibly +* Handle accumulation in a wider format (a common requirement) +* Reuse algorithm logic across multiple matrix types By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on: -- Algorithm design +- Creating efficient algorithm design - Testing and verification - SME2-specific optimization ## Unroll loops to compute multiple tiles -For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. But SME2 supports multi-vector operations, and you can take advantage of them to improve performance. +For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. However SME2 supports multi-vector operations that enable better performance through loop unrolling. -For example, `preprocess_l_intr` uses: +For example, the `preprocess_l_intr` function uses: ```c svld1_x2(...); // Load two vectors at once ``` -Loading two vectors at a time enables the simultaneous computing of more tiles, and as the input matrices have been laid out in memory in a neat way, the consecutive loading of the data is efficient. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. +Loading two vectors at a time enables the simultaneous computing of more tiles. Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. In order to check your understanding of SME2, you can try to implement this -unrolling yourself in the intrinsic version (the asm version already has this +unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values. From 11296f4d788dc33cc917dc58812ac6259b8b203f Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Sun, 6 Jul 2025 12:19:30 +0000 Subject: [PATCH 23/29] Added internal links to sections --- .../1-get-started.md | 20 +++++++++---------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 23e6215b67..391f1f9c4a 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -68,7 +68,7 @@ It includes: - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the docker build command with the correct arguments so you don’t have to remember them. - `sme2-environment.docker`, a Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. - `build-all-containers.sh`, a script to build multi-architecture images. - - `.devcontainer/devcontainer.json`` for VS Code container support. + - `.devcontainer/devcontainer.json` for VS Code container support. {{% notice Note %}} From this point, all instructions assume that your current directory is @@ -88,7 +88,7 @@ You can check your compiler version using the command:``clang --version`` ### Install Clang -Install Clang using the instructions below, selecting either macOS or Linux/Ubuntu, as appropriate: +Install Clang using the instructions below, selecting either macOS or Linux/Ubuntu, depending on your setup: {{< tabpane code=true >}} @@ -102,7 +102,7 @@ Install Clang using the instructions below, selecting either macOS or Linux/Ubun {{< /tabpane >}} -You are now all set to start hacking with SME2! +You are now all set to start hacking with SME2. ## Set up a system using SME2 emulation with Docker @@ -115,17 +115,17 @@ for emulating code with SME2 instructions. You can run the provided image or bui ### Install and verify Docker {{% notice Note %}} -Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your path. +Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your `PATH`. {{% /notice %}} -Start by checking that ``docker`` is installed on your machine: +To begin, start by checking that ``docker`` is installed on your machine: ```BASH { output_lines="2" } docker --version Docker version 27.3.1, build ce12230 ``` -If the above command fails with a message similar to "``docker: command not found``" then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/). +If the above command fails with a message similar to "``docker: command not found``", then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/) to install it. {{% notice Note %}} You might need to log out and back in again or restart your machine for the changes to take @@ -176,12 +176,10 @@ https://docs.docker.com/get-started/ ``` You can use Docker in the following ways: -- Directly from the command line. For example, when you are working from a - terminal on your local machine. +- [Directly from the command line](#run-commands-from-a-terminal-using-docker). For example, when you are working from a terminal on your local machine. -- Within a containerized environment. Configure VS Code to execute all the - commands inside a Docker container, allowing you to work seamlessly within the - Docker environment. +- [Within a containerized environment](#use-an-interactive-docker-shell). Configure VS Code to execute all the commands inside a Docker container, allowing you to work seamlessly within the +Docker environment. ### Run commands from a terminal using Docker From 3ddf238be9e9733a97d699c9594757aa637ae1ef Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Sun, 6 Jul 2025 12:33:29 +0000 Subject: [PATCH 24/29] format apple device list --- .../1-get-started.md | 49 +++++++++++++++---- 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 391f1f9c4a..85ba3f4ee4 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -248,16 +248,45 @@ part. ### Devices with native SME2 support -By chip: +#### Apple devices (by product type) + +- iPad + + - iPad Pro 11" + + - iPad Pro 13" + +- iPhone + + - iPhone 16 + + - iPhone 16 Plus + + - iPhone 16e + + - iPhone 16 Pro + + - iPhone 16 Pro Max + +- iMac + +- MacBook Air + + - MacBook Air 13" + + - MacBook Air 15" + +- Mac mini + + + +- MacBook Pro + + - MacBook Pro 14" + + - MacBook Pro 16" + +- Mac Studio -| Manufacturer | Chip | Devices | -|--------------|--------|---------| -| Apple | M4 | iPad Pro 11" & 13", iMac, Mac mini, MacBook Air 13" & 15"| -| Apple | M4 Pro | Mac mini, MacBook Pro 14" & 16" | -| Apple | M4 Max | MacBook Pro 14" & 16", Mac Studio | -By product: -| Manufacturer | Product family | Models | -|--------------|----------------|--------| -| Apple | iPhone 16 | iPhone 16, iPhone 16 Plus, iPhone 16e, iPhone 16 Pro, iPhone 16 Pro Max | \ No newline at end of file From b575f5fc33f4f2d1ba5b653d7dca19e9b064ae49 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Sun, 6 Jul 2025 22:04:28 +0000 Subject: [PATCH 25/29] Clarifying --- .../6-sme2-matmul-asm.md | 56 ++++++++++--------- 1 file changed, 31 insertions(+), 25 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md index 3b1f278896..1cc285b3e7 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md @@ -5,31 +5,37 @@ weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Overview -In this chapter, you will use an SME2-optimized matrix multiplication written -directly in assembly. +In this section, you'll learn how to run an SME2-optimized matrix multiplication implemented directly in assembly. -## About the SME2 assembly implementation +This implementation is based on the algorithm described in [Arm's SME Programmer's +Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) and has been adapted to integrate with the existing C and intrinsics-based code in this Learning Path. It demonstrates how to apply low-level optimizations for matrix multiplication using the SME2 instruction set, with a focus on preprocessing and outer-product accumulation. + +You'll explore how the assembly implementation works in practice, how it interfaces with C wrappers, and how to verify or benchmark its performance. Whether you're validating correctness or measuring execution speed, this example provides a clear, modular foundation for working with SME2 features in your own codebase. -### Description +By mastering this assembly implementation, you'll gain deeper insight into SME2 execution patterns and how to integrate low-level optimizations in high-performance workloads. + +## About the SME2 assembly implementation -This Learning Path reuses the assembly version provided in the [SME Programmer's +This Learning Path reuses the assembly version described in [The SME Programmer's Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) -where you will find a high-level and an in-depth description of the two steps -performed. +where you will find both high-level concepts and in-depth descriptions of the two key steps: +preprocessing and matrix multiplication. -The assembly versions have been modified so they coexist nicely with -the intrinsic versions. The modifications include: -- let the compiler manage the switching back and forth from streaming mode, -- don't use register `x18` which is used as a platform register. +The assembly code has been modified to work seamlessly alongside the intrinsic version. -In this Learning Path: -- the `preprocess` function is named `preprocess_l_asm` and is defined in +The key changes include: +* Delegating streaming mode control to the compiler +* Avoiding register `x18`, which is reserved as a platform register + +Here: +- The `preprocess` function is named `preprocess_l_asm` and is defined in `preprocess_l_asm.S` -- the outer product-based matrix multiplication is named `matmul_asm_impl`and - is defined in `matmul_asm_impl.S`. +- The outer product-based matrix multiplication is named `matmul_asm_impl`and + is defined in `matmul_asm_impl.S` -Those 2 functions are declared in `matmul.h`: +Both functions are declared in `matmul.h`: ```C // Matrix preprocessing, in assembly. @@ -43,10 +49,9 @@ void matmul_asm_impl( float *restrict matResult) __arm_streaming __arm_inout("za"); ``` -You will note that they have been marked with 2 attributes: `__arm_streaming` +You can see that they have been marked with two attributes: `__arm_streaming` and `__arm_inout("za")`. This instructs the compiler that these functions -expect the streaming mode to be active, and that they don't new to save / -restore the ZA storage. +expect the streaming mode to be active, and that they don't need to save or restore the ZA storage. These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that @@ -63,13 +68,14 @@ __arm_new("za") __arm_locally_streaming void matmul_asm( } ``` -Note that `matmul_asm` has been annotated with 2 attributes: -`__arm_new("za")` and `__arm_locally_streaming`. This instructs the compiler -to swith to streaming mode and save the ZA storage (and restore it when the -function returns). +You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return + +## How it integrates with the main function + +The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance. + +## Execution modes -The high-level `matmul_asm` function is called from `main.c`. This file might look a bit complex at first sight, but fear not, here are some explanations: -- the same `main.c` is used for the assembly- and intrinsic-based versions of the matrix multiplication --- this is parametrized at compilation time with the `IMPL` macro. This avoids code duplication and improves maintenance. - on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available. ```C { line_numbers="true" } From 332d396fdeda1b5765a9fff23b54f477214425bc Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 7 Jul 2025 09:58:06 +0000 Subject: [PATCH 26/29] Reducing ambuiguity --- .../1-get-started.md | 28 +++++++++++-------- 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 85ba3f4ee4..faff5f373f 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -20,7 +20,7 @@ This section walks you through the required tools and the two supported executio To get started, begin by [downloading the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). -Now extract the archive, and change to the target directory: +Now extract the archive, and change directory to: ``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH @@ -59,16 +59,16 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ └── sme2_check.c ``` -It includes: +Amongst other files, it includes: - Code examples. - A `Makefile` to build the code. - `run-fvp.sh` to run the FVP model. - A `docker` directory containing: - `assets.source_me` to provide toolchain paths. - - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the docker build command with the correct arguments so you don’t have to remember them. + - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the Docker build command with the correct arguments so you don’t have to remember them. - `sme2-environment.docker`, a Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. - `build-all-containers.sh`, a script to build multi-architecture images. - - `.devcontainer/devcontainer.json` for VS Code container support. +- `.devcontainer/devcontainer.json` for VS Code container support. {{% notice Note %}} From this point, all instructions assume that your current directory is @@ -108,9 +108,11 @@ You are now all set to start hacking with SME2. If your machine doesn't support SME2, or you want to emulate it, you can use the Docker-based environment that this Learning Path models. -The Docker container includes a compiler and [Arm's Fixed Virtual Platform (FVP) +The Docker container includes both a compiler and [Arm's Fixed Virtual Platform (FVP) model](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms) -for emulating code with SME2 instructions. You can run the provided image or build it using the included Dockerfile.and follow the ``sme2-environment.docker`` Docker file instructions to install the tools on your machine. +for emulating code that uses SME2 instructions. You can either run the prebuilt container image provided in this Learning Path or build it yourself using the Docker file that is included. + +If building manually, follow the instructions in the ``sme2-environment.docker`` file to install the required tools on your machine. ### Install and verify Docker @@ -118,14 +120,14 @@ for emulating code with SME2 instructions. You can run the provided image or bui Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your `PATH`. {{% /notice %}} -To begin, start by checking that ``docker`` is installed on your machine: +To begin, start by checking that Docker is installed on your machine: ```BASH { output_lines="2" } docker --version Docker version 27.3.1, build ce12230 ``` -If the above command fails with a message similar to "``docker: command not found``", then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/) to install it. +If the above command fails with an error message similar to "``docker: command not found``", then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/) to install Docker. {{% notice Note %}} You might need to log out and back in again or restart your machine for the changes to take @@ -176,9 +178,9 @@ https://docs.docker.com/get-started/ ``` You can use Docker in the following ways: -- [Directly from the command line](#run-commands-from-a-terminal-using-docker). For example, when you are working from a terminal on your local machine. +- [Directly from the command line](#run-commands-from-a-terminal-using-docker) - for example, when you are working from a terminal on your local machine. -- [Within a containerized environment](#use-an-interactive-docker-shell). Configure VS Code to execute all the commands inside a Docker container, allowing you to work seamlessly within the +- [Within a containerized environment](#use-an-interactive-docker-shell) - by configuring VS Code to execute all the commands inside a Docker container, allowing you to work seamlessly within the Docker environment. ### Run commands from a terminal using Docker @@ -209,13 +211,15 @@ docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-enviro ### Use an interactive Docker shell -The above commands are long and error-prone, so you can instead choose to work interactively within the terminal, which would save you from prepending the ``docker run ...`` magic before each command you want to execute. To work in this mode, run Docker without any command (note the ``-it`` command line argument to the Docker invocation). Start an interactive session to avoid repeating the docker run prefix: +The standard `docker run` commands can be long and repetitive. To streamline your workflow, you can start an interactive Docker session that allows you to run commands directly - without having to prepend docker run each time. + +To launch an interactive shell inside the container, use the `-it` flag: ```BASH docker run --rm -it -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 ``` -You are now in the Docker container; you can execute all commands directly. For +You are now in the Docker container, and you can execute all commands directly. For example, the ``make`` command can now be simply invoked with: ```BASH From fd7704f1de01d9ab0c88f51092a72fa315f0aefe Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 7 Jul 2025 11:18:09 +0000 Subject: [PATCH 27/29] Further improvements --- .../1-get-started.md | 13 ++++--- .../2-check-your-environment.md | 30 +++++++++------- .../3-streaming-mode.md | 34 +++++++++---------- .../4-vanilla-matmul.md | 12 +++---- 4 files changed, 47 insertions(+), 42 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index faff5f373f..5a97288740 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -8,13 +8,12 @@ layout: learningpathall ## Choose your SME2 setup: native or emulated -Before you can build or run any SME2-accelerated code, you need to set up your development environment. - -This section walks you through the required tools and the two supported execution options, which are: +To build or run SME2-accelerated code, first set up your development environment. +This section walks you through the required tools and two supported setup options: * [**Native SME2 hardware**](#set-up-a-system-with-native-SME2-support) - build and run directly on a system with SME2 support. For supported devices, see [Devices with SME2 support](#devices-with-sme2-support). -* [**Docker-based emulation**](#set-up-a-system-using-sme2-emulation-with-dockerset-up-a-system-using-SME2-emulation-with-Docker) - use a container to emulate SME2 in bare metal mode (without an OS). +* [**Docker-based emulation**](#set-up-a-system-using-sme2-emulation-with-docker) - use a container to emulate SME2 in bare metal mode (without an OS). ## Download and explore the code examples @@ -66,7 +65,7 @@ Amongst other files, it includes: - A `docker` directory containing: - `assets.source_me` to provide toolchain paths. - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the Docker build command with the correct arguments so you don’t have to remember them. - - `sme2-environment.docker`, a Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. + - `sme2-environment.docker`, a custom Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. - `build-all-containers.sh`, a script to build multi-architecture images. - `.devcontainer/devcontainer.json` for VS Code container support. @@ -234,7 +233,7 @@ If you are using Visual Studio Code as your IDE, the container setup is already Make sure you have the [Microsoft Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension installed. -Then select the **Reopen in Container** menu entry as Figure 1 shows. +Then select the **Reopen in Container** menu entry as shown below. It automatically finds and uses ``.devcontainer/devcontainer.json``: @@ -252,7 +251,7 @@ part. ### Devices with native SME2 support -#### Apple devices (by product type) +These Apple devices support SME2 natively. - iPad diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md index c5a454597f..5c8a6e3f19 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md @@ -6,13 +6,11 @@ weight: 4 layout: learningpathall --- -In this section, you will verify that your environment is set up and ready to -develop with SME2. This will be your first hands-on experience with the -environment. +In this section, you'll verify that your environment is ready for SME2 development. This is your first hands-on task and confirms that the toolchain, hardware (or emulator), and compiler are set up correctly. -## Compile the examples +## Build the code examples -First, build the code examples by running `make`: +Use the `make` command to compile all examples and generate assembly listings: {{< tabpane code=true >}} {{< tab header="Native SME2 support" language="bash" output_lines="2-19">}} @@ -66,6 +64,8 @@ The `make` command performs the following tasks: - It creates the assembly listings for the four executables: `hello.lst`, `sme2_check.lst`, `sme2_matmul_asm.lst`, and `sme2_matmul_intr.lst`. + These targets compile and link all example programs and generate disassembly listings for inspection. + At any point, you can clean the directory of all the files that have been built by invoking `make clean`: @@ -114,12 +114,20 @@ Run the `hello` program with: {{< /tab >}} {{< /tabpane >}} -In the emulated case, you may see that the FVP prints out extra lines. The key confirmation is the presence of "Hello, world!" in the output. it demonstrates that the generic code can be compiled and executed. +In the emulated case, you may see that the FVP prints out extra lines. The key confirmation is the presence of "Hello, world!" in the output. It demonstrates that the generic code can be compiled and executed. ## Check SME2 availability You will now run the `sme2_check` program, which verifies that SME2 works as expected. This checks both the compiler and the CPU (or the emulated CPU) are properly supporting SME2. +The `sme2_check` program verifies that SME2 is available and working. It confirms: + +* The compiler supports SME2 (via __ARM_FEATURE_SME2) + +* The system or emulator reports SME2 capability + +* Streaming mode works as expected + The source code is found in `sme2_check.c`: ```C { line_numbers="true" } @@ -191,10 +199,7 @@ The ``sme2_check`` program then displays whether SVE, SME and SME2 are supported at line 24. The checking of SVE, SME and SME2 is done differently depending on ``BAREMETAL``. This platform specific behaviour is abstracted by the ``display_cpu_features()``: -- In baremetal mode, our program has access to system registers and can thus do - some low level peek at what the silicon actually supports. The program will - print the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME - field of the ``ID_AA64PFR1_EL1`` system register. +- In baremetal mode, our program has access to system registers and can inspect system registers for SME2 support. The program will print the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME field of the ``ID_AA64PFR1_EL1`` system register. - In non baremetal mode, on an Apple platform the program needs to use a higher level API call. @@ -213,6 +218,8 @@ annotated with the ``__arm_locally_streaming`` attribute, which instructs the compiler to automatically switch to streaming mode when invoking this function. Streaming mode will be discussed in more depth in the next section. +Look for the following confirmation messages in the output: + {{< tabpane code=true >}} {{< tab header="Native SME2 support" language="bash" output_lines="2-9">}} ./sme2_check @@ -243,5 +250,4 @@ Streaming mode will be discussed in more depth in the next section. {{< /tab >}} {{< /tabpane >}} -You have now checked that the code can be compiled and run with full SME2 -support. You are all set to move to the next section. +You've now confirmed that your environment can compile and run SME2 code, and that SME2 features like streaming mode are working correctly. You're ready to continue to the next section and start working with SME2 in practice. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md index 5c33457777..ac84bd8eef 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md @@ -1,5 +1,5 @@ --- -title: Streaming mode and ZA State in SME +title: Streaming mode and ZA state in SME weight: 5 ### FIXED, DO NOT MODIFY @@ -8,7 +8,7 @@ layout: learningpathall ## Understanding streaming mode -In large-scale software, programs often switch between streaming and non-streaming mode. Some streaming-mode functions may call others, requiring portions of processor state, such as the ZA storage, to be saved and restored. This behavior is defined in the Arm C Language Extensions (ACLE) and is supported by the compiler. +Programs can switch between streaming and non-streaming mode during execution. When one streaming-mode function calls another, parts of the processor state - such as ZA storage - might need to be saved and restored. This behavior is governed by the Arm C Language Extensions (ACLE) and is managed by the compiler. To use streaming mode, you simply annotate the relevant functions with the appropriate keywords. The compiler handles the low-level mechanics of streaming mode management, removing the need for error-prone, manual work. @@ -18,29 +18,28 @@ For more information, see the [Introduction to streaming and non-streaming mode] ## Streaming mode behavior and compiler handling +Streaming mode changes how the processor and compiler manage execution context. Here's how it works: + * The AArch64 architecture defines a concept called *streaming mode*, controlled by a processor state bit `PSTATE.SM`. -* At any given point in time, the processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode (`PSTATE.SM==0`). +* At any given point in time, the processor is either in streaming mode (`PSTATE.SM == 1`) or in non-streaming mode (`PSTATE.SM == 0`). * To enter streaming mode, there is the instruction `SMSTART`, and to return to non-streaming mode, the instruction is `SMSTOP`. * Streaming mode affects C and C++ code in the following ways: - It can change the length of SVE vectors and predicates. The length of an SVE vector in streaming mode is called the *Streaming Vector Length* (SVL), which might differ from the non-streaming vector length. See [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) for further information. - - Some instructions, and their associated ACLE intrinsics, can only be executed in streaming mode.These intrinsics are called *streaming intrinsics*. - - Other instructions are restricted to non-streaming mode, and their instrinsics are called *non-streaming intrinsics*. + - Some instructions, and their associated ACLE intrinsics, can only be executed in streaming mode.These are called *streaming intrinsics*. + - Other instructions are restricted to non-streaming mode. These are called *non-streaming intrinsics*. The ACLE specification extends the C and C++ abstract machine model to include streaming mode. At any given time, the abstract machine is either in streaming or non-streaming mode. This distinction between abstract machine mode and processor mode is mostly a specification detail. At runtime, the processor’s mode may differ from the abstract machine’s mode - as long as the observable program behavior remains consistent (as per the "as-if" rule). -One -practical consequence of this is that C and C++ code does not specify the exact -placement of `SMSTART` and `SMSTOP` instructions; the source code simply places -limits on where such instructions go. For example, when stepping through a -program in a debugger, the processor mode might sometimes be different from the -one implied by the source code. +{{% notice Note %}} +One practical consequence of this is that C and C++ code does not specify the exact placement of `SMSTART` and `SMSTOP` instructions; the source code simply places limits on where such instructions go. For example, when stepping through a program in a debugger, the processor mode might sometimes be different from the one implied by the source code. +{{% /notice %}} ACLE provides attributes that specify whether the abstract machine executes statements: @@ -56,10 +55,11 @@ is enabled. In C and C++, ZA usage is specified at the function level: a function either uses ZA or it doesn't. That is, a function either has ZA state or it does not. -If a function does have ZA state, the function can either share that ZA state -with the function's caller or create new ZA state. In the latter -case, it is the compiler's responsibility to free up ZA so that the function can -use it - see the description of the lazy saving scheme in -[AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details -about how the compiler does this. +Functions that use ZA can either: + +- Share the caller’s ZA state +- Allocate a new ZA state for themselves + +When new state is needed, the compiler is responsible for preserving the caller’s state using a *lazy saving* scheme. For more information, see the [AAPCS64 section of the ACLE spec](https://arm-software.github.io/acle/main/acle.html#AAPCS64). + \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md index 8089bce757..55abb5f567 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md @@ -8,7 +8,7 @@ layout: learningpathall ## Overview -In this section, you'll implement a basic matrix multiplication algorithm in C, using a row-major memory layout. This version serves as a reference implementation for validating optimized versions later in the Learning Path. +In this section, you'll implement a basic matrix multiplication algorithm in C using row-major memory layout. This version acts as a reference implementation that you'll use to validate the correctness of optimized versions later in the Learning Path. ## Vanilla matrix multiplication algorithm @@ -21,6 +21,8 @@ It produces an output matrix C [`Cr` rows x `Cc` columns]. The algorithm works by iterating over each row of A and each column of B. It multiplies the corresponding elements and sums the products to generate each element of matrix C, as shown in the figure below. +The diagram below shows how matrix C is computed by iterating over rows of A and columns of B: + ![Standard Matrix Multiplication alt-text#center](matmul.png "Figure 2: Standard matrix multiplication.") This implies that the A, B, and C matrices have some constraints on their @@ -34,16 +36,14 @@ properties and use, see this [Wikipedia article on Matrix Multiplication](https: ## Variable mappings in this Learning Path -In this Learning Path, you'll use the following variable names: +The following variable names are used throughout the Learning Path to represent matrix dimensions and operands: -- `matLeft` corresponds to the left-hand side argument of the matrix - multiplication. +- `matLeft` corresponds to the left-hand side argument of the matrix multiplication. - `matRight`corresponds to the right-hand side of the matrix multiplication. - `M` is `matLeft` number of rows. - `K` is `matLeft` number of columns (and `matRight` number of rows). - `N` is `matRight` number of columns. -- `matResult`corresponds to the result of the matrix multiplication, with - `M` rows and `N` columns. +- `matResult`corresponds to the result of the matrix multiplication, with `M` rows and `N` columns. ## C implementation From e63fa6daf3b3bce3819c0ce27365df4a6a027ee7 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 7 Jul 2025 14:17:05 +0000 Subject: [PATCH 28/29] Updates --- .../1-get-started.md | 2 +- .../4-vanilla-matmul.md | 6 ++-- .../5-outer-product.md | 29 ++++++++++--------- 3 files changed, 19 insertions(+), 18 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 5a97288740..8b9c5cb3ee 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -71,7 +71,7 @@ Amongst other files, it includes: {{% notice Note %}} From this point, all instructions assume that your current directory is -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. So to follow along, ensure that you are in the correct directory before proceeding. +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``, so ensure that you are in the correct directory before proceeding. {{% /notice %}} ## Set up a system with native SME2 support diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md index 55abb5f567..f8524ebeae 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md @@ -47,7 +47,7 @@ The following variable names are used throughout the Learning Path to represent ## C implementation -The file matmul_vanilla.c contains a reference implementation of the algorithm: +Here is the full reference implementation from `matmul_vanilla.c`: ```C { line_numbers="true" } void matmul(uint64_t M, uint64_t K, uint64_t N, @@ -69,8 +69,8 @@ void matmul(uint64_t M, uint64_t K, uint64_t N, ## Memory layout and pointer annotations -In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above. +In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above. The pointers to `matLeft`, `matRight` and `matResult` have been annotated as `restrict`, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to `matLeft` and `matRight` are marked as `const` as neither of these two matrices are modified by `matmul`. -You now have a working baseline for the matrix multiplication function. You'll use it later on in this Learning Path to ensure that the assembly version and the intrinsics version of the multiplication algorithm do not contain errors. \ No newline at end of file +This function gives you a working baseline for matrix multiplication. You'll use it later in the Learning Path to verify the correctness of optimized implementations using SME2 intrinsics and assembly. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md index 7389b33021..1e28558f2d 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md @@ -6,8 +6,12 @@ weight: 7 layout: learningpathall --- +## Overview + In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations. +This approach increases the number of multiply-accumulate (MACC) operations per memory load, reducing bandwidth pressure and improving overall throughput. + ## Increase MACC efficiency using outer products In the vanilla implementation, the core multiply-accumulate step looks like this: @@ -16,14 +20,12 @@ In the vanilla implementation, the core multiply-accumulate step looks like this acc += matLeft[m * K + k] * matRight[k * N + n]; ``` -This translates to one multiply-accumulate operation, known as `macc`, for two -loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 -`macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration. This ratio limits efficiency, especially in triple-nested loops where memory bandwidth becomes a bottleneck. +This translates to one multiply-accumulate operation, known as `macc`, for two loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 `macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration, which is inefficient. This becomes more pronounced in triple-nested loops and when matrices exceed cache capacity. -To make matters worse, large matrices might not fit in cache. To improve matrix multiplication efficiency, the goal is to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products. +To improve performance, you want to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products. -Figure 3 below illustrates how the matrix multiplication of `matLeft` (3 rows, 2 -columns) by `matRight` (2 rows, 3 columns) can be decomposed as the sum of outer +The diagram below illustrates how the matrix multiplication of `matLeft` (3 rows, 2 +columns) by `matRight` (2 rows, 3 columns) can be decomposed into a sum of column-by-row outer products: ![example image alt-text#center](outer_product.png "Figure 3: Outer product-based matrix multiplication.") @@ -45,17 +47,16 @@ row-major order to column-major order. This transformation affects only the memory layout. From a mathematical perspective, `matLeft` is not transposed. It is reorganized for better data locality. {{% /notice %}} -### Transposition in the real world +### Transposition in practice -Just as trees don't reach the sky, the SME engine has physical implementation limits. It operates on *tiles* - 2D blocks of data stored in the ZA storage. SME has dedicated instructions to load, store, and compute on these tiles efficiently. -For example, the -[fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) -instruction takes two vectors as inputs and accumulates all the outer products -into a 2D tile. The tile in ZA storage allows SME to increase the `macc` to -`load` ratio by loading all the tile elements to be used with the SME outer +The SME engine operates on tiles - 2D blocks of data stored in the ZA storage. SME provides dedicated instructions to load, store, and compute on tiles efficiently. + +For example, the [FMOPA](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as input and accumulates their outer product into a tile. The tile in ZA storage allows SME to increase the `macc` to`load` ratio by loading all the tile elements to be used with the SME outer product instructions. -But since ZA storage is finite, you need to you need to preprocess `matLeft` to fit tile dimensions - this includes transposing portions of the matrix and padding where needed. +But since ZA storage is finite, you need to you need to preprocess `matLeft` to match the tile dimensions - this includes transposing portions of the matrix and padding where needed. + +### Preprocessing with preprocess_l The following function shows how `preprocess_l` transforms the matrix at the algorithmic level: From 69e0aa82f5a92ca62f772a7c6c16ef205f61ae70 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 7 Jul 2025 20:24:59 +0000 Subject: [PATCH 29/29] Tweaks --- .../10-going-further.md | 48 ++++++----------- .../6-sme2-matmul-asm.md | 53 ++++++------------- .../7-sme2-matmul-intr.md | 23 +++----- .../8-benchmarking.md | 6 +-- .../9-debugging.md | 2 +- 5 files changed, 38 insertions(+), 94 deletions(-) diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index 63ea7b9724..6cc5e2382d 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -1,12 +1,12 @@ --- -title: Beyond this implementation +title: Going further weight: 12 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Going further +## Beyond this implementation There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications. @@ -22,9 +22,9 @@ Some ideas of improvements that you might like to test out include: ## Generalize the algorithm for different data types -So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well. +So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well. -The structure of the algorithm remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are: +The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are: * Loaded from memory * Accumulated (often with widening) @@ -35,10 +35,10 @@ Languages that support [generic programming](https://en.wikipedia.org/wiki/Gener Templates allow you to: * Swap data types flexibly -* Handle accumulation in a wider format (a common requirement) +* Handle accumulation in a wider format when needed * Reuse algorithm logic across multiple matrix types -By expressing the algorithm generically, you benefit from the compiler generating multiple variants, allowing you the opportunity to focus on: +By expressing the algorithm generically, you benefit from the compiler generating multiple optimized variants, allowing you the opportunity to focus on: - Creating efficient algorithm design - Testing and verification @@ -55,41 +55,23 @@ svld1_x2(...); // Load two vectors at once ``` Loading two vectors at a time enables the simultaneous computing of more tiles. Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. -In order to check your understanding of SME2, you can try to implement this -unrolling yourself in the intrinsic version (the assembly version already has this -optimization). You can check your work by comparing your results to the expected -reference values. +In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values. -## Apply strategies +## Optimize for special matrix shapes -One method for optimization is to use strategies that are flexible depending on -the matrices' dimensions. This is especially easy to set up when working in C or -C++, rather than directly in assembly language. +One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, rather than directly in assembly language. -By playing with the mathematical properties of matrix multiplication and the -outer product, it is possible to minimize data movement as well as reduce the -overall number of operations to perform. +By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform. -For example, it is common that one of the matrices is actually a vector, meaning -that it has a single row or column, and then it becomes advantageous to -transpose it. Can you see why? +For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why? -The answer is that as the elements are stored contiguously in memory, an ``Nx1`` -and ``1xN`` matrices have the exact same memory layout. The transposition -becomes a no-op, and the matrix elements stay in the same place in memory. +The answer is that as the elements are stored contiguously in memory, an ``Nx1``and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory. -An even more *degenerated* case that is easy to manage is when one of the -matrices is essentially a scalar, which means that it is a matrix with one row -and one column. +An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column. -Although our current code handles it correctly from a results point of view, a -different algorithm and use of instructions might be more efficient. Can you -think of another way? +Although the current code used here handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way? -In order to check your understanding of SME2, you can try to implement this -unrolling yourself in the intrinsic version (the asm version already has this -optimization). You can check your work by comparing your results to the expected -reference values. +In order to check your understanding of SME2, you can try to implement thisunrolling yourself in the intrinsic version (the asm version already has this optimization). You can check your work by comparing your results to the expected reference values. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md index 1cc285b3e7..e41965f946 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md @@ -30,10 +30,8 @@ The key changes include: * Avoiding register `x18`, which is reserved as a platform register Here: -- The `preprocess` function is named `preprocess_l_asm` and is defined in - `preprocess_l_asm.S` -- The outer product-based matrix multiplication is named `matmul_asm_impl`and - is defined in `matmul_asm_impl.S` +- The `preprocess` function is named `preprocess_l_asm` and is defined in `preprocess_l_asm.S` +- The outer product-based matrix multiplication is named `matmul_asm_impl` and is defined in `matmul_asm_impl.S` Both functions are declared in `matmul.h`: @@ -49,13 +47,9 @@ void matmul_asm_impl( float *restrict matResult) __arm_streaming __arm_inout("za"); ``` -You can see that they have been marked with two attributes: `__arm_streaming` -and `__arm_inout("za")`. This instructs the compiler that these functions -expect the streaming mode to be active, and that they don't need to save or restore the ZA storage. +Both functions are annotated with the `__arm_streaming` and `__arm_inout("za")` attributes. These indicate that the function expects streaming mode to be active and does not need to save or restore the ZA storage. -These two functions are stitched together in `matmul_asm.c` with the -same prototype as the reference implementation of matrix multiplication, so that -a top-level `matmul_asm` can be called from the `main` function: +These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that a top-level `matmul_asm` can be called from the `main` function: ```C __arm_new("za") __arm_locally_streaming void matmul_asm( @@ -68,7 +62,7 @@ __arm_new("za") __arm_locally_streaming void matmul_asm( } ``` -You can see that `matmul_asm` has been annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return +You can see that `matmul_asm` is annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return. ## How it integrates with the main function @@ -76,7 +70,7 @@ The same `main.c` file supports both the intrinsic and assembly implementations. ## Execution modes -- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available. +- On a baremetal platform, the program runs in *verification mode*, where it compares the results of the assembly-based matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available. ```C { line_numbers="true" } #ifndef __ARM_FEATURE_SME2 @@ -227,8 +221,8 @@ int main(int argc, char **argv) { float *matResult_ref = (float *)malloc(M * N * sizeof(float)); // Initialize matrices. Input matrices are initialized with random values in - // non debug mode. In debug mode, all matrices are initialized with linear - // or known values values for easier debugging. + // non-debug mode. In debug mode, all matrices are initialized with linear + // or known values for easier debugging. #ifdef DEBUG initialize_matrix(matLeft, M * K, LINEAR_INIT); initialize_matrix(matRight, K * N, LINEAR_INIT); @@ -327,36 +321,19 @@ int main(int argc, char **argv) { } ``` -The same `main.c` file is used for the assembly and intrinsic-based versions -of the matrix multiplication. It first sets the `M`, `K` and `N` -parameters, to either the arguments supplied on the command line (lines 93-95) -or uses the default value (lines 73-75). In non-baremetal mode, it also accepts -(lines 82-89 and lines 98-108), as first parameter, an iteration count `I` +The same `main.c` file is used for the assembly and intrinsic-based versions of the matrix multiplication. It first sets the `M`, `K` and `N` parameters, to either the arguments supplied on the command line (lines 93-95) or uses the default value (lines 73-75). In non-baremetal mode, it also accepts (lines 82-89 and lines 98-108), as first parameter, an iteration count `I` used for benchmarking. -Depending on the `M`, `K`, `N` dimension parameters, `main` allocates -memory for all the matrices and initializes `matLeft` and `matRight` with -random data. The actual matrix multiplication implementation is provided through -the `IMPL` macro. +Depending on the `M`, `K`, `N` dimension parameters, `main` allocates memory for all the matrices and initializes `matLeft` and `matRight` with random data. The actual matrix multiplication implementation is provided through the `IMPL` macro. -In *verification mode*, it then runs the matrix multiplication from `IMPL` -(line 167) and computes the reference values for the preprocessed matrix as well -as the result matrix (lines 170 and 171). It then compares the actual values to -the reference values and reports errors, if there are any (lines 173-177). -Finally, all the memory is deallocated (lines 236-243) before exiting the +In *verification mode*, it then runs the matrix multiplication from `IMPL` (line 167) and computes the reference values for the preprocessed matrix as well as the result matrix (lines 170 and 171). It then compares the actual values to the reference values and reports errors, if there are any (lines 173-177). Finally, all the memory is deallocated (lines 236-243) before exiting the program with a success or failure return code at line 245. -In *benchmarking mode*, it will first run the vanilla reference matrix -multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 -times without measuring elapsed time to warm-up the CPU. It will then measure -the elapsed execution time of the vanilla reference matrix multiplication (resp. -assembly- or intrinsic-based matrix multiplication) `I` times and then compute +In *benchmarking mode*, it will first run the vanilla reference matrix multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 times without measuring elapsed time to warm-up the CPU. It will then measure the elapsed execution time of the vanilla reference matrix multiplication (resp.assembly- or intrinsic-based matrix multiplication) `I` times and then compute and report the minimum, maximum and average execution times. {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this Learning Path -is to provide some basic guidelines on the performance improvement that can be -obtained with SME2. +Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2. {{% /notice %}} ### Compile and run it @@ -401,7 +378,7 @@ whether the preprocessing and matrix multiplication passed (`PASS`) or failed (`FAILED`) the comparison the vanilla reference implementation. {{% notice Tip %}} -The example above uses the default values for the `M` (125), `K`(25) and `N`(70) +The example above uses the default values for the `M` (125), `K`(70) and `N`(70) parameters. You can override this and provide your own values on the command line: {{< tabpane code=true >}} @@ -414,5 +391,5 @@ parameters. You can override this and provide your own values on the command lin {{< /tab >}} {{< /tabpane >}} -Here the values `M=7`, `K=8` and `N=9` are used instead. +In this example, `M=7`, `K=8`, and `N=9` are used. {{% /notice %}} \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md index a170de702d..eba6850aaf 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md @@ -1,30 +1,22 @@ --- -title: SME2 intrinsics matrix multiplication +title: Matrix multiplication using SME2 intrinsics in C weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you will write an SME2 optimized matrix multiplication in C -using the intrinsics that the compiler provides. +In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides. -## Matrix multiplication with SME2 intrinsics +## What are instrinsics? -*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*, -are the functions available to application developers that the compiler has an -intimate knowledge of. This enables the compiler to either translate the -function to a specific instruction or to perform specific optimizations, or -both. +*Intrinsics*, also known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both. You can learn more about intrinsics in this [Wikipedia Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function). Using intrinsics allows the programmer to use the specific instructions required -to achieve the required performance while writing in C all the -typically-required standard code, such as loops. This produces performance close -to what can be reached with hand-written assembly whilst being significantly -more maintainable and portable. +to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable. All Arm-specific intrinsics are specified in the [ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE @@ -51,10 +43,7 @@ Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will make the compiler save the ZA storage so we can use it without destroying its content if it was still in use by one of the callers. -`SVL`, the dimension of the ZA storage, is requested from the underlying -hardware with the `svcntsw()` function call at line 5, and passed down to the -`preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a -function provided be the ACLE library. +`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library. ### Matrix preprocessing diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md index 097320cb04..da241b1ad7 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md @@ -39,11 +39,7 @@ Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.4 SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us ``` -The execution time is reported in microseconds. A wide spread between the -minimum and maximum figures can be noted and is expected as the way of doing the -benchmarking is simplified for the purpose of simplicity. You will, however, -note that the intrinsic version of the matrix multiplication brings on average a -76x execution time reduction. +The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings on average a 76x execution time reduction. {{% notice Tip %}} You can override the default values for `M` (125), `K` (25), and `N` (70) and diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md index 081b75a3e7..d05e5a7ea0 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md @@ -105,5 +105,5 @@ Tracing is disabled by default because it significantly slows down simulation an ## Use debug mode for matrix inspection -It can be helpful when debugging to understand where an element in the Tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no +It can be helpful when debugging to understand where an element in the tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example.