diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md index 562ba90582..8b9c5cb3ee 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md @@ -1,33 +1,33 @@ --- -title: Set up your Environment +title: Set up your SME2 development environment weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -To follow this Learning Path, you will need to set up an environment to develop -with SME2 and download the code examples. This learning path assumes two -different ways of working, and you will need to select the one appropriate for -your machine: -- Case #1: Your machine has native SME2 support --- check the [list of devices - with SME2 support](#devices-with-sme2-support). -- Case #2: Your machine does not have native SME2 support. This learning path - supports this use case by enabling you to run code with SME2 instructions in - an emulator in bare metal mode, i.e., the emulator runs the SME2 code - *without* an operating system. +## Choose your SME2 setup: native or emulated -## Code examples +To build or run SME2-accelerated code, first set up your development environment. +This section walks you through the required tools and two supported setup options: -[Download the code examples](https://gitlab.arm.com/learning-code-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2) for this learning path, expand the archive, and change your current directory to: -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2`` : +* [**Native SME2 hardware**](#set-up-a-system-with-native-SME2-support) - build and run directly on a system with SME2 support. For supported devices, see [Devices with SME2 support](#devices-with-sme2-support). + +* [**Docker-based emulation**](#set-up-a-system-using-sme2-emulation-with-docker) - use a container to emulate SME2 in bare metal mode (without an OS). + +## Download and explore the code examples + +To get started, begin by [downloading the code examples](https://gitlab.arm.com/learning-cde-examples/code-examples/-/archive/main/code-examples-main.tar.gz?path=learning-paths/cross-platform/multiplying-matrices-with-sme2). + +Now extract the archive, and change directory to: +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2.`` ```BASH tar xfz code-examples-main-learning-paths-cross-platform-multiplying-matrices-with-sme2.tar.gz -s /code-examples-main-learning-paths-cross-platform-multiplying-matrices-with-sme2/code-examples/ cd code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2 ``` -The listing of the content of this directory should look like this: +The directory structure should look like this: ```TXT code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ @@ -58,48 +58,36 @@ code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2/ └── sme2_check.c ``` -It contains: -- The code examples that will be used throughout this learning path. -- A ``Makefile`` that builds the code examples. -- A shell script called ``run-fvp.sh`` that runs the FVP (used in the emulated - SME2 case). -- A directory called ``docker`` that contains materials related to Docker, which - are: - - A script called ``assets.source_me`` that provides the FVP and compiler - toolchain references. - - A Docker recipe called ``sme2-environment.docker`` to build the container - that you will use. - - A shell script called ``build-my-container.sh`` that you can use if you want - to build the Docker container. This is not essential; ready-made images are - available for you. - - A script called ``build-all-containers.sh`` that was used to create the - image for you to download to provide multi-architecture support for both - x86_64 and AArch64. -- A configuration script for VS Code to be able to use the container from the - IDE called ``.devcontainer/devcontainer.json``. +Amongst other files, it includes: +- Code examples. +- A `Makefile` to build the code. +- `run-fvp.sh` to run the FVP model. +- A `docker` directory containing: + - `assets.source_me` to provide toolchain paths. + - `build-my-container.sh`, a script that automates building the Docker image from the `sme2-environment.docker` file. It runs the Docker build command with the correct arguments so you don’t have to remember them. + - `sme2-environment.docker`, a custom Docker file that defines the steps to build the SME2 container image. It installs all the necessary dependencies, including the SME2-compatible compiler and Arm FVP emulator. + - `build-all-containers.sh`, a script to build multi-architecture images. +- `.devcontainer/devcontainer.json` for VS Code container support. {{% notice Note %}} -From this point in the Learning Path, all instructions assume that your current -directory is -``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``. +From this point, all instructions assume that your current directory is +``code-examples/learning-paths/cross-platform/multiplying-matrices-with-sme2``, so ensure that you are in the correct directory before proceeding. {{% /notice %}} -## Platforms with native SME2 support +## Set up a system with native SME2 support + +To run SME2 code natively, ensure your system includes SME2 hardware and uses a compiler version that supports SME2. + +For the compiler, you can use [Clang](https://www.llvm.org/) version 18 or later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path uses ``clang``. + +{{% notice Note %}} +At the time of writing, macOS ships with `clang` version 17.0.0, which doesn't support SME2. Use a newer version, such as 20.1.7, available through Homebrew.{{% /notice%}} -If your machine has native support for SME2, then you only need to ensure that -you have a compiler with support for SME2 instructions. +You can check your compiler version using the command:``clang --version`` -A recent enough version of the compiler is required because SME2 is a recent -addition to the Arm instruction set. Compiler versions that are too old will -have incomplete or no SME2 support, leading to compilation errors or -non-functional code. You can use [Clang](https://www.llvm.org/) version 18 or -later, or [GCC](https://gcc.gnu.org/) version 14 or later. This Learning Path -uses ``clang``. +### Install Clang -At the time of writing, the ``clang`` version shipped with macOS is ``17.0.0``, -which forces us to use the version from ``homebrew`` (which has version -``20.1.7``). Ensure the ``clang`` compiler you are using is recent enough with -``clang --version``: +Install Clang using the instructions below, selecting either macOS or Linux/Ubuntu, depending on your setup: {{< tabpane code=true >}} @@ -113,52 +101,40 @@ which forces us to use the version from ``homebrew`` (which has version {{< /tabpane >}} -You are now all set to start hacking with SME2! +You are now all set to start hacking with SME2. -## Platforms with emulated SME2 support +## Set up a system using SME2 emulation with Docker -If your machine does not have SME2 support or if you want to run SME2 with an -emulator, you will need to install Docker. Docker containers provide -functionality to execute commands in an isolated environment, where you have all -the necessary tools you require without cluttering your machine. The containers -run independently, meaning they do not interfere with other containers on the -same machine or server. +If your machine doesn't support SME2, or you want to emulate it, you can use the Docker-based environment that this Learning Path models. -This learning path provides a Docker image that has a compiler and [Arm's Fixed -Virtual Platform (FVP) +The Docker container includes both a compiler and [Arm's Fixed Virtual Platform (FVP) model](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms) -for emulating code with SME2 instructions. The Docker image recipe is provided -(with the code examples) so you can study it and build it yourself. You could -also decide not to use the Docker image and follow the -``sme2-environment.docker`` Docker file instructions to install the tools on -your machine. +for emulating code that uses SME2 instructions. You can either run the prebuilt container image provided in this Learning Path or build it yourself using the Docker file that is included. -### Docker +If building manually, follow the instructions in the ``sme2-environment.docker`` file to install the required tools on your machine. + +### Install and verify Docker {{% notice Note %}} -This Learning Path works without ``docker``, but the compiler and the FVP must -be available in your search path. +Docker is optional, but if you don’t use it, you must manually install the compiler and FVP, and ensure they’re in your `PATH`. {{% /notice %}} -Start by checking that ``docker`` is installed on your machine by typing the -following command line in a terminal: +To begin, start by checking that Docker is installed on your machine: ```BASH { output_lines="2" } docker --version Docker version 27.3.1, build ce12230 ``` -If the above command fails with a message similar to "``docker: command not -found``" then follow the steps from the [Docker Install -Guide](https://learn.arm.com/install-guides/docker/). +If the above command fails with an error message similar to "``docker: command not found``", then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/) to install Docker. {{% notice Note %}} -You might need to log in again or restart your machine for the changes to take +You might need to log out and back in again or restart your machine for the changes to take effect. {{% /notice %}} Once you have confirmed that Docker is installed on your machine, you can check -that it is operating normally with the following: +that it is working with the following: ```BASH { output_lines="2-27" } docker run hello-world @@ -201,13 +177,12 @@ https://docs.docker.com/get-started/ ``` You can use Docker in the following ways: -- Directly from the command line. For example, when you are working from a - terminal on your local machine. -- Within a containerized environment. Configure VS Code to execute all the - commands inside a Docker container, allowing you to work seamlessly within the - Docker environment. +- [Directly from the command line](#run-commands-from-a-terminal-using-docker) - for example, when you are working from a terminal on your local machine. + +- [Within a containerized environment](#use-an-interactive-docker-shell) - by configuring VS Code to execute all the commands inside a Docker container, allowing you to work seamlessly within the +Docker environment. -### Working with Docker from a terminal +### Run commands from a terminal using Docker When a command is executed in the Docker container environment, you must prepend it with instructions on the command line so that your shell executes it within @@ -233,64 +208,88 @@ For example, to run ``make``, you need to enter: docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 make ``` -### Working within the Docker container from the terminal +### Use an interactive Docker shell -The above commands are long and error-prone, so you can instead choose to work -interactively within the terminal, which would save you from prepending the -``docker run ...`` magic before each command you want to execute. To work in -this mode, run Docker without any command (note the ``-it`` command line -argument to the Docker invocation): +The standard `docker run` commands can be long and repetitive. To streamline your workflow, you can start an interactive Docker session that allows you to run commands directly - without having to prepend docker run each time. + +To launch an interactive shell inside the container, use the `-it` flag: ```BASH docker run --rm -it -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 ``` -You are now in the Docker container; you can execute all commands directly. For +You are now in the Docker container, and you can execute all commands directly. For example, the ``make`` command can now be simply invoked with: ```BASH make ``` -To exit the container, simply hit CTRL+D. Note that the container is not -persistent (it was invoked with ``--rm``), so each invocation will use a -container freshly built from the image. All the files reside outside the -container, so changes you make to them will be persistent. +To exit the container, simply hit CTRL+D. Note that the container is not persistent (it was invoked with ``--rm``), so each invocation will use a container freshly built from the image. All the files reside outside the container, so changes you make to them will be persistent. -### Working within the Docker container with VSCode +### Develop with Docker in Visual Studio Code -If you are using Visual Studio Code as your IDE, it can use the container as is. +If you are using Visual Studio Code as your IDE, the container setup is already configured with `devcontainer/devcontainer.json`. Make sure you have the [Microsoft Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension installed. -Then select the **Reopen in Container** menu entry as Figure 1 shows. +Then select the **Reopen in Container** menu entry as shown below. It automatically finds and uses ``.devcontainer/devcontainer.json``: -![example image alt-text#center](VSCode.png "Figure 1: Setting up the Docker Container.") +![VSCode Docker alt-text#center](VSCode.png "Figure 1: Setting up the Docker container.") All your commands now run within the container, so there is no need to prepend them with a Docker invocation, as VS Code handles all this seamlessly for you. {{% notice Note %}} For the rest of this Learning Path, shell commands include the full Docker -invocation so that users not using VS Code can copy the complete command line. +invocation so that if you are not using VS Code you can copy the complete command line. However, if you are using VS Code, you only need to use the `COMMAND ARGUMENTS` part. {{% /notice %}} -### Devices with SME2 support +### Devices with native SME2 support + +These Apple devices support SME2 natively. + +- iPad + + - iPad Pro 11" + + - iPad Pro 13" + +- iPhone + + - iPhone 16 + + - iPhone 16 Plus + + - iPhone 16e + + - iPhone 16 Pro + + - iPhone 16 Pro Max + +- iMac + +- MacBook Air + + - MacBook Air 13" + + - MacBook Air 15" + +- Mac mini + + + +- MacBook Pro + + - MacBook Pro 14" + + - MacBook Pro 16" -By chip: +- Mac Studio -| Manufacturer | Chip | Devices | -|--------------|--------|---------| -| Apple | M4 | iPad Pro 11" & 13", iMac, Mac mini, MacBook Air 13" & 15"| -| Apple | M4 Pro | Mac mini, MacBook Pro 14" & 16" | -| Apple | M4 Max | MacBook Pro 14" & 16", Mac Studio | -By product: -| Manufacturer | Product family | Models | -|--------------|----------------|--------| -| Apple | iPhone 16 | iPhone 16, iPhone 16 Plus, iPhone 16e, iPhone 16 Pro, iPhone 16 Pro Max | \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md index 441c3be98f..6cc5e2382d 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further.md @@ -6,68 +6,72 @@ weight: 12 layout: learningpathall --- -In this section, you will learn about the many different optimizations that are -available to you. +## Beyond this implementation -## Generalize the algorithms +There are many different ways that you can extend and optimize the matrix multiplication algorithm beyond the specific SME2 implementation that you've explored in this Learning Path. While the current approach is tuned for performance on a specific hardware target, further improvements can make your code more general, more efficient, and better suited to a wider range of applications. -In this Learning Path, you focused on using SME2 for matrix multiplication with -floating point numbers. However in practice, any library or framework supporting -matrix multiplication should also handle various integer types. +Advanced optimization techniques are essential when adapting algorithms to real-world scenarios. These often include processing matrices of different shapes and sizes, handling mixed data types, or maximizing throughput for large batch operations. The ability to generalize and fine-tune your implementation opens the door to more scalable and reusable code that performs well across workloads. -You can see that the algorithm structure for matrix preprocessing as well as -multiplication with the outer product does not change at all for other data -types - they only need to be adapted. +Whether you're targeting different data types, improving parallelism, or adapting to unusual matrix shapes, these advanced techniques give you more control over both correctness and performance. -This is suitable for languages with [generic -programming](https://en.wikipedia.org/wiki/Generic_programming) like C++ with -templates. You can even make the template manage a case where the value -accumulated during the product uses a larger type than the input matrices. SME2 -has the instructions to deal efficiently with this common case scenario. +Some ideas of improvements that you might like to test out include: -This enables the library developer to focus on the algorithm, testing, and -optimizations, while allowing the compiler to generate multiple variants. +* Generalization +* Loop unrolling +* The strategic use of matrix properties -## Unroll further +## Generalize the algorithm for different data types -You might have noticed that ``matmul_intr_impl`` computes only one tile at a -time, for the sake of simplicity. +So far, you've focused on multiplying floating-point matrices. In practice, matrix operations often involve integer types as well. -SME2 does support multi-vector instructions, and some were used in -``preprocess_l_intr``, for example, ``svld1_x2``. +The structure of the algorithm (The core logic - tiling, outer product, and accumulation) remains consistent across data types. It uses preprocessing with tiling and outer product–based multiplication. To adapt it for other data types, you only need to change how values are: -Loading two vectors at a time enables the simultaneous computing of more tiles, -and as the input matrices have been laid out in memory in a neat way, the -consecutive loading of the data is efficient. Implementing this approach can -make improvements to the ``macc`` to load ``ratio``. +* Loaded from memory +* Accumulated (often with widening) +* Stored to the output -In order to check your understanding of SME2, you can try to implement this -unrolling yourself in the intrinsic version (the asm version already has this -optimization). You can check your work by comparing your results to the expected -reference values. +Languages that support [generic programming](https://en.wikipedia.org/wiki/Generic_programming), such as C++ with templates, make this easier. -## Apply strategies +Templates allow you to: -One method for optimization is to use strategies that are flexible depending on -the matrices' dimensions. This is especially easy to set up when working in C or -C++, rather than directly in assembly language. +* Swap data types flexibly +* Handle accumulation in a wider format when needed +* Reuse algorithm logic across multiple matrix types -By playing with the mathematical properties of matrix multiplication and the -outer product, it is possible to minimize data movement as well as reduce the -overall number of operations to perform. +By expressing the algorithm generically, you benefit from the compiler generating multiple optimized variants, allowing you the opportunity to focus on: -For example, it is common that one of the matrices is actually a vector, meaning -that it has a single row or column, and then it becomes advantageous to -transpose it. Can you see why? +- Creating efficient algorithm design +- Testing and verification +- SME2-specific optimization -The answer is that as the elements are stored contiguously in memory, an ``Nx1`` -and ``1xN`` matrices have the exact same memory layout. The transposition -becomes a no-op, and the matrix elements stay in the same place in memory. +## Unroll loops to compute multiple tiles + +For clarity, the `matmul_intr_impl` function in this Learning Path processes one tile at a time. However SME2 supports multi-vector operations that enable better performance through loop unrolling. + +For example, the `preprocess_l_intr` function uses: + +```c +svld1_x2(...); // Load two vectors at once +``` +Loading two vectors at a time enables the simultaneous computing of more tiles. Since the matrices are already laid out efficiently in memory, consecutive loading is fast. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. + +In order to check your understanding of SME2, you can try to implement this unrolling yourself in the intrinsic version (the assembly version already has this optimization). You can check your work by comparing your results to the expected reference values. + +## Optimize for special matrix shapes + +One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, rather than directly in assembly language. + +By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform. + +For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why? + +The answer is that as the elements are stored contiguously in memory, an ``Nx1``and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory. + +An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column. + +Although the current code used here handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way? + + +In order to check your understanding of SME2, you can try to implement thisunrolling yourself in the intrinsic version (the asm version already has this optimization). You can check your work by comparing your results to the expected reference values. -An even more *degenerated* case that is easy to manage is when one of the -matrices is essentially a scalar, which means that it is a matrix with one row -and one column. -Although our current code handles it correctly from a results point of view, a -different algorithm and use of instructions might be more efficient. Can you -think of another way? diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md index 0e824002e2..5c8a6e3f19 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/2-check-your-environment.md @@ -1,18 +1,16 @@ --- -title: Test your environment +title: Test your SME2 development environment weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you will verify that your environment is set up and ready to -develop with SME2. This will be your first hands-on experience with the -environment. +In this section, you'll verify that your environment is ready for SME2 development. This is your first hands-on task and confirms that the toolchain, hardware (or emulator), and compiler are set up correctly. -## Compile the examples +## Build the code examples -First, build the code examples by running `make`: +Use the `make` command to compile all examples and generate assembly listings: {{< tabpane code=true >}} {{< tab header="Native SME2 support" language="bash" output_lines="2-19">}} @@ -66,6 +64,8 @@ The `make` command performs the following tasks: - It creates the assembly listings for the four executables: `hello.lst`, `sme2_check.lst`, `sme2_matmul_asm.lst`, and `sme2_matmul_intr.lst`. + These targets compile and link all example programs and generate disassembly listings for inspection. + At any point, you can clean the directory of all the files that have been built by invoking `make clean`: @@ -81,7 +81,7 @@ by invoking `make clean`: {{< /tab >}} {{< /tabpane >}} -## Basic Checks +## Run a Hello World program The very first program that you should run is the famous "Hello, world!" example that will tell you if your environment is set up correctly. @@ -114,15 +114,19 @@ Run the `hello` program with: {{< /tab >}} {{< /tabpane >}} -In the emulated case, there are extra lines that are printed out by the FVP, but -the important line here is "Hello, world!": it demonstrates that the generic -code can be compiled and executed. +In the emulated case, you may see that the FVP prints out extra lines. The key confirmation is the presence of "Hello, world!" in the output. It demonstrates that the generic code can be compiled and executed. + +## Check SME2 availability + +You will now run the `sme2_check` program, which verifies that SME2 works as expected. This checks both the compiler and the CPU (or the emulated CPU) are properly supporting SME2. -## SME2 checks +The `sme2_check` program verifies that SME2 is available and working. It confirms: -You will now run the `sme2_check` program, which verifies that SME2 works as -expected. This checks both the compiler and the CPU (or the emulated CPU) are -properly supporting SME2. +* The compiler supports SME2 (via __ARM_FEATURE_SME2) + +* The system or emulator reports SME2 capability + +* Streaming mode works as expected The source code is found in `sme2_check.c`: @@ -186,19 +190,16 @@ emulated SME2 support), where no operating system has done the setup of the processor for the user land programs, an additional step is required to turn SME2 on. This is the purpose of the ``setup_sme_baremetal()`` call at line 21. In environments where SME2 is natively supported, nothing needs to be done, -which is why the execution of this function is condionned by the ``BAREMETAl`` +which is why the execution of this function is conditioned by the ``BAREMETAL`` macro. ``BAREMETAL`` is set to 1 in the ``Makefile`` when the FVP is targeted, and set to 0 otherwise. The body of the ``setup_sme_baremetal`` function is defined in ``misc.c``. The ``sme2_check`` program then displays whether SVE, SME and SME2 are supported at line 24. The checking of SVE, SME and SME2 is done differently depending on -``BAREMETAL``. This platform specific behaviour is abstract by the -``display_cpu_features()`` : -- In baremetal mode, our program has access to system registers and can thus do - some low level peek at what the silicon actually supports. The program will - print the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME - field of the ``ID_AA64PFR1_EL1`` system register. +``BAREMETAL``. This platform specific behaviour is abstracted by the +``display_cpu_features()``: +- In baremetal mode, our program has access to system registers and can inspect system registers for SME2 support. The program will print the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME field of the ``ID_AA64PFR1_EL1`` system register. - In non baremetal mode, on an Apple platform the program needs to use a higher level API call. @@ -217,6 +218,8 @@ annotated with the ``__arm_locally_streaming`` attribute, which instructs the compiler to automatically switch to streaming mode when invoking this function. Streaming mode will be discussed in more depth in the next section. +Look for the following confirmation messages in the output: + {{< tabpane code=true >}} {{< tab header="Native SME2 support" language="bash" output_lines="2-9">}} ./sme2_check @@ -247,5 +250,4 @@ Streaming mode will be discussed in more depth in the next section. {{< /tab >}} {{< /tabpane >}} -You have now checked that the code can be compiled and run with full SME2 -support. You are all set to move to the next section. +You've now confirmed that your environment can compile and run SME2 code, and that SME2 features like streaming mode are working correctly. You're ready to continue to the next section and start working with SME2 in practice. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md index 0240b6efd8..ac84bd8eef 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/3-streaming-mode.md @@ -1,60 +1,45 @@ --- -title: Streaming mode +title: Streaming mode and ZA state in SME weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In real-world large-scale software, a program moves back and forth from -streaming mode, and some streaming mode routines call other streaming mode -routines, which means that some state needs to be saved and restored. This -includes the ZA storage. This is defined in the ACLE and supported by the -compiler: the programmer *just* has to annotate the functions with some keywords -and let the compiler automatically perform the low-level tasks of managing the -streaming mode. This frees the developer from a tedious and error-prone task. -See [Introduction to streaming and non-streaming -mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode) -for further information. The rest of this section references information from -the ACLE. - -## About streaming mode - -The AArch64 architecture defines a concept called *streaming mode*, controlled -by a processor state bit called `PSTATE.SM`. At any given point in time, the -processor is either in streaming mode (`PSTATE.SM==1`) or in non-streaming mode -(`PSTATE.SM==0`). There is an instruction called `SMSTART` to enter streaming mode -and an instruction called `SMSTOP` to return to non-streaming mode. - -Streaming mode has three main effects on C and C++ code: - -- It can change the length of SVE vectors and predicates: the length of an SVE - vector in streaming mode is called the “streaming vector length” (SVL), which - might be different from the normal non-streaming vector length. See - [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) - for more details. -- Some instructions can only be executed in streaming mode, which means that - their associated ACLE intrinsics can only be used in streaming mode. These - intrinsics are called “streaming intrinsics”. -- Some other instructions can only be executed in non-streaming mode, which - means that their associated ACLE intrinsics can only be used in non-streaming - mode. These intrinsics are called “non-streaming intrinsics”. - -The C and C++ standards define the behavior of programs in terms of an *abstract -machine*. As an extension, the ACLE specification applies the distinction -between streaming mode and non-streaming mode to this abstract machine: at any -given point in time, the abstract machine is either in streaming mode or in -non-streaming mode. - -This distinction between processor mode and abstract machine mode is mostly just -a specification detail. However, the usual “as if” rule applies: the -processor's actual mode at runtime can be different from the abstract machine's -mode, provided that this does not alter the behavior of the program. One -practical consequence of this is that C and C++ code does not specify the exact -placement of `SMSTART` and `SMSTOP` instructions; the source code simply places -limits on where such instructions go. For example, when stepping through a -program in a debugger, the processor mode might sometimes be different from the -one implied by the source code. +## Understanding streaming mode + +Programs can switch between streaming and non-streaming mode during execution. When one streaming-mode function calls another, parts of the processor state - such as ZA storage - might need to be saved and restored. This behavior is governed by the Arm C Language Extensions (ACLE) and is managed by the compiler. + +To use streaming mode, you simply annotate the relevant functions with the appropriate keywords. The compiler handles the low-level mechanics of streaming mode management, removing the need for error-prone, manual work. + +{{% notice Note %}} +For more information, see the [Introduction to streaming and non-streaming mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode). The rest of this section references content from the ACLE specification. +{{% /notice %}} + +## Streaming mode behavior and compiler handling + +Streaming mode changes how the processor and compiler manage execution context. Here's how it works: + +* The AArch64 architecture defines a concept called *streaming mode*, controlled +by a processor state bit `PSTATE.SM`. + +* At any given point in time, the processor is either in streaming mode (`PSTATE.SM == 1`) or in non-streaming mode (`PSTATE.SM == 0`). + +* To enter streaming mode, there is the instruction `SMSTART`, and to return to non-streaming mode, the instruction is `SMSTOP`. + +* Streaming mode affects C and C++ code in the following ways: + + - It can change the length of SVE vectors and predicates. The length of an SVE vector in streaming mode is called the *Streaming Vector Length* (SVL), which might differ from the non-streaming vector length. See [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) for further information. + - Some instructions, and their associated ACLE intrinsics, can only be executed in streaming mode.These are called *streaming intrinsics*. + - Other instructions are restricted to non-streaming mode. These are called *non-streaming intrinsics*. + +The ACLE specification extends the C and C++ abstract machine model to include streaming mode. At any given time, the abstract machine is either in streaming or non-streaming mode. + +This distinction between abstract machine mode and processor mode is mostly a specification detail. At runtime, the processor’s mode may differ from the abstract machine’s mode - as long as the observable program behavior remains consistent (as per the "as-if" rule). + +{{% notice Note %}} +One practical consequence of this is that C and C++ code does not specify the exact placement of `SMSTART` and `SMSTOP` instructions; the source code simply places limits on where such instructions go. For example, when stepping through a program in a debugger, the processor mode might sometimes be different from the one implied by the source code. +{{% /notice %}} ACLE provides attributes that specify whether the abstract machine executes statements: @@ -62,17 +47,19 @@ ACLE provides attributes that specify whether the abstract machine executes stat - In streaming mode, in which case they are called *streaming statements*. - In either mode, in which case they are called *streaming-compatible statements*. -SME provides an area of storage called ZA, of size `SVL.B` x `SVL.B` bytes. It +## Working with ZA state + +SME also introduces a matrix storage area called ZA, sized `SVL.B` × `SVL.B` bytes. It also provides a processor state bit called `PSTATE.ZA` to control whether ZA is enabled. -In C and C++ code, access to ZA is controlled at function granularity: a -function either uses ZA or it does not. Another way to say this is that a -function either “has ZA state” or it does not. +In C and C++, ZA usage is specified at the function level: a function either uses ZA or it doesn't. That is, a function either has ZA state or it does not. + +Functions that use ZA can either: + +- Share the caller’s ZA state +- Allocate a new ZA state for themselves + +When new state is needed, the compiler is responsible for preserving the caller’s state using a *lazy saving* scheme. For more information, see the [AAPCS64 section of the ACLE spec](https://arm-software.github.io/acle/main/acle.html#AAPCS64). -If a function does have ZA state, the function can either share that ZA state -with the function's caller or create new ZA state “from scratch”. In the latter -case, it is the compiler's responsibility to free up ZA so that the function can -use it; see the description of the lazy saving scheme in -[AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details -about how the compiler does this. + \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md index c91630a18a..f8524ebeae 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/4-vanilla-matmul.md @@ -6,41 +6,48 @@ weight: 6 layout: learningpathall --- -In this section, you will learn about an example of standard matrix multiplication in C. +## Overview + +In this section, you'll implement a basic matrix multiplication algorithm in C using row-major memory layout. This version acts as a reference implementation that you'll use to validate the correctness of optimized versions later in the Learning Path. ## Vanilla matrix multiplication algorithm -The vanilla matrix multiplication operation takes two input matrices, A [Ar -rows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C -[Cr rows x Cc columns]. The operation consists of iterating on each row of A -and each column of B, multiplying each element of the A row with its corresponding -element in the B column then summing all these products, as Figure 2 shows. +The vanilla matrix multiplication operation takes two input matrices: + +* Matrix A [`Ar` rows x `Ac` columns] +* Matrix B [`Br` rows x `Bc` columns] + +It produces an output matrix C [`Cr` rows x `Cc` columns]. + +The algorithm works by iterating over each row of A and each column of B. It multiplies the corresponding elements and sums the products to generate each element of matrix C, as shown in the figure below. + +The diagram below shows how matrix C is computed by iterating over rows of A and columns of B: -![example image alt-text#center](matmul.png "Figure 2: Standard Matrix Multiplication.") +![Standard Matrix Multiplication alt-text#center](matmul.png "Figure 2: Standard matrix multiplication.") This implies that the A, B, and C matrices have some constraints on their dimensions: -- A's number of columns must match B's number of rows: Ac == Br. -- C has the dimensions Cr == Ar and Cc == Bc. -You can learn more about matrix multiplication, including its history, -properties and use, by reading this [Wikipedia -article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). +- The number of columns in A must equal the number of rows in B: `Ac == Br`. +- Matrix C must have the dimensions Cr == Ar and Cc == Bc. -In this Learning Path, you will see the following variable names: -- `matLeft` corresponds to the left-hand side argument of the matrix - multiplication. +For more information about matrix multiplication, including its history, +properties and use, see this [Wikipedia article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). + +## Variable mappings in this Learning Path + +The following variable names are used throughout the Learning Path to represent matrix dimensions and operands: + +- `matLeft` corresponds to the left-hand side argument of the matrix multiplication. - `matRight`corresponds to the right-hand side of the matrix multiplication. - `M` is `matLeft` number of rows. - `K` is `matLeft` number of columns (and `matRight` number of rows). - `N` is `matRight` number of columns. -- `matResult`corresponds to the result of the matrix multiplication, with - `M` rows and `N` columns. +- `matResult`corresponds to the result of the matrix multiplication, with `M` rows and `N` columns. ## C implementation -A literal implementation of the textbook matrix multiplication algorithm, as -described above, can be found in file `matmul_vanilla.c`: +Here is the full reference implementation from `matmul_vanilla.c`: ```C { line_numbers="true" } void matmul(uint64_t M, uint64_t K, uint64_t N, @@ -60,17 +67,10 @@ void matmul(uint64_t M, uint64_t K, uint64_t N, } ``` -In this Learning Path, the matrices are laid out in memory as contiguous -sequences of elements, in [Row-Major Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). -The `matmul` function performs the algorithm described above. +## Memory layout and pointer annotations + +In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The `matmul` function performs the algorithm described above. -The pointers to `matLeft`, `matRight` and `matResult` have been annotated -as `restrict`, which informs the compiler that the memory areas designated by -those pointers do not alias. This means that they do not overlap in any way, so -that the compiler does not need to insert extra instructions to deal with these -cases. The pointers to `matLeft` and `matRight` are marked as `const` as -neither of these two matrices are modified by `matmul`. +The pointers to `matLeft`, `matRight` and `matResult` have been annotated as `restrict`, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to `matLeft` and `matRight` are marked as `const` as neither of these two matrices are modified by `matmul`. -You now have a reference standard matrix multiplication function. You will use -it later on in this Learning Path to ensure that the assembly version and the -intrinsics version of the multiplication algorithm do not contain errors. \ No newline at end of file +This function gives you a working baseline for matrix multiplication. You'll use it later in the Learning Path to verify the correctness of optimized implementations using SME2 intrinsics and assembly. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md index b49a5d2ba8..1e28558f2d 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/5-outer-product.md @@ -6,83 +6,59 @@ weight: 7 layout: learningpathall --- -In this section, you will learn how to use the outer product with the SME engine -to improve matrix multiplication execution performances. +## Overview -## Matrix multiplication with the outer product +In this section, you'll learn how to improve matrix multiplication performance using the SME engine and outer product operations. -In the vanilla matrix multiplication example, the core of the computation is: +This approach increases the number of multiply-accumulate (MACC) operations per memory load, reducing bandwidth pressure and improving overall throughput. + +## Increase MACC efficiency using outer products + +In the vanilla implementation, the core multiply-accumulate step looks like this: ```C acc += matLeft[m * K + k] * matRight[k * N + n]; ``` -This translates to one multiply-accumulate operation, known as `macc`, for two -loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 -`macc` to `load` ratio. - -From a memory system perspective, this is not efficient, especially since this -computation is done within a triple-nested loop, repeatedly loading data from -memory. +This translates to one multiply-accumulate operation, known as `macc`, for two loads (`matLeft[m * K + k]` and `matRight[k * N + n]`). It therefore has a 1:2 `macc` to `load` ratio of multiply-accumulate operations (MACCs) to memory loads - one multiply-accumulate and two loads per iteration, which is inefficient. This becomes more pronounced in triple-nested loops and when matrices exceed cache capacity. -To make matters worse, large matrices might not fit in cache. To improve matrix -multiplication efficiency, the goal is to increase the `macc` to `load` ratio, -which means increasing the number of multiply-accumulate operations per load. +To improve performance, you want to increase the `macc` to `load` ratio, which means increasing the number of multiply-accumulate operations per load - you can express matrix multiplication as a sum of column-by-row outer products. -Figure 3 below illustrates how the matrix multiplication of `matLeft` (3 rows, 2 -columns) by `matRight` (2 rows, 3 columns) can be decomposed as the sum of outer +The diagram below illustrates how the matrix multiplication of `matLeft` (3 rows, 2 +columns) by `matRight` (2 rows, 3 columns) can be decomposed into a sum of column-by-row outer products: -![example image alt-text#center](outer_product.png "Figure 3: Outer Product-based Matrix Multiplication.") +![example image alt-text#center](outer_product.png "Figure 3: Outer product-based matrix multiplication.") The SME engine builds on the [Outer Product](https://en.wikipedia.org/wiki/Outer_product) because matrix multiplication can be expressed as the [sum of column-by-row outer products](https://en.wikipedia.org/wiki/Outer_product#Connection_with_the_matrix_product). -## About transposition +## Optimize memory layout with transposition -From the previous page, you will recall that matrices are laid out in row-major -order. This means that loading row-data from memory is efficient as the memory -system operates efficiently with contiguous data. An example of this is where -caches are loaded row by row, and data prefetching is simple - just load the -data from `current address + sizeof(data)`. This is not the case for loading +From the previous page, you will recall that matrices are laid out in row-major order. This means that loading row-data from memory is efficient as the memory-system operates efficiently with contiguous data. An example of this is where caches are loaded row by row, and data prefetching is simple - just load the data from `current address + sizeof(data)`. This is not the case for loading column-data from memory though, as it requires more work from the memory system. -To further improve matrix multiplication effectiveness, it is therefore -desirable to change the layout in memory of the left-hand side matrix, called -`matLeft` in the code examples in this Learning Path. The improved layout would -ensure that elements from the same column are located next to each other in -memory. This is essentially a matrix transposition, which changes `matLeft` from +To further improve matrix multiplication effectiveness, it is desirable to change the layout in memory of the left-hand side matrix, called `matLeft` in the code examples in this Learning Path. The improved layout ensures that elements from the same column are located next to each other in memory. This is essentially a matrix transposition, which changes `matLeft` from row-major order to column-major order. {{% notice Important %}} -It is important to note here that this reorganizes the layout of the matrix in -memory to make the algorithm implementation more efficient. The transposition -affects only the memory layout. `matLeft` is transformed to column-major order, -but from a mathematical perspective, `matLeft` is *not* transposed. +This transformation affects only the memory layout. From a mathematical perspective, `matLeft` is not transposed. It is reorganized for better data locality. {{% /notice %}} -### Transposition in the real world - -Just as trees don't reach the sky, the SME engine has physical implementation -limits. It operates with tiles in the ZA storage. Tiles are 2D portions of the -matrices being processed. SME has dedicated instructions to load and store data -from tiles efficiently, as well as instructions to operate with and on tiles. -For example, the -[fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) -instruction takes two vectors as inputs and accumulates all the outer products -into a 2D tile. The tile in ZA storage allows SME to increase the `macc` to -`load` ratio by loading all the tile elements to be used with the SME outer +### Transposition in practice + +The SME engine operates on tiles - 2D blocks of data stored in the ZA storage. SME provides dedicated instructions to load, store, and compute on tiles efficiently. + +For example, the [FMOPA](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) instruction takes two vectors as input and accumulates their outer product into a tile. The tile in ZA storage allows SME to increase the `macc` to`load` ratio by loading all the tile elements to be used with the SME outer product instructions. -Considering that ZA storage is finite, the desired transposition of the -`matLeft` matrix discussed in the previous section needs to be adapted to the -tile dimensions, so that a tile is easy to access. The `matLeft` preprocessing -thus involves some aspects of transposition but also takes into account tiling, -referred to in the code as `preprocess`. +But since ZA storage is finite, you need to you need to preprocess `matLeft` to match the tile dimensions - this includes transposing portions of the matrix and padding where needed. + +### Preprocessing with preprocess_l -Here is what `preprocess_l` does in practice, at the algorithmic level: +The following function shows how `preprocess_l` transforms the matrix at the algorithmic level: ```C { line_numbers = "true" } void preprocess_l(uint64_t nbr, uint64_t nbc, uint64_t SVL, @@ -108,12 +84,10 @@ void preprocess_l(uint64_t nbr, uint64_t nbc, uint64_t SVL, } ``` -`preprocess_l` will be used to check that the assembly and intrinsic versions of -the matrix multiplication perform the preprocessing step correctly. This code is -located in the file `preprocess_vanilla.c`. +This routine is defined in `preprocess_vanilla.c.` It's used to ensure the assembly and intrinsics-based matrix multiplication routines work with the expected input format. {{% notice Note %}} -In real-world applications, it might be possible to arrange for `matLeft` to be +In production environments, it might be possible to arrange for `matLeft` to be stored in column-major order, eliminating the need for transposition and making the preprocessing step unnecessary. Matrix processing frameworks and libraries often have attributes within the matrix object to track if it is in row- or diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md index 61c5821116..e41965f946 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/6-sme2-matmul-asm.md @@ -5,31 +5,35 @@ weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Overview -In this chapter, you will use an SME2-optimized matrix multiplication written -directly in assembly. +In this section, you'll learn how to run an SME2-optimized matrix multiplication implemented directly in assembly. -## Matrix multiplication with SME2 in assembly +This implementation is based on the algorithm described in [Arm's SME Programmer's +Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) and has been adapted to integrate with the existing C and intrinsics-based code in this Learning Path. It demonstrates how to apply low-level optimizations for matrix multiplication using the SME2 instruction set, with a focus on preprocessing and outer-product accumulation. -### Description +You'll explore how the assembly implementation works in practice, how it interfaces with C wrappers, and how to verify or benchmark its performance. Whether you're validating correctness or measuring execution speed, this example provides a clear, modular foundation for working with SME2 features in your own codebase. -This Learning Path reuses the assembly version provided in the [SME Programmer's +By mastering this assembly implementation, you'll gain deeper insight into SME2 execution patterns and how to integrate low-level optimizations in high-performance workloads. + +## About the SME2 assembly implementation + +This Learning Path reuses the assembly version described in [The SME Programmer's Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) -where you will find a high-level and an in-depth description of the two steps -performed. +where you will find both high-level concepts and in-depth descriptions of the two key steps: +preprocessing and matrix multiplication. + +The assembly code has been modified to work seamlessly alongside the intrinsic version. -The assembly versions have been modified so they coexist nicely with -the intrinsic versions. The modifications include: -- let the compiler manage the switching back and forth from streaming mode, -- don't use register `x18` which is used as a platform register. +The key changes include: +* Delegating streaming mode control to the compiler +* Avoiding register `x18`, which is reserved as a platform register -In this Learning Path: -- the `preprocess` function is named `preprocess_l_asm` and is defined in - `preprocess_l_asm.S` -- the outer product-based matrix multiplication is named `matmul_asm_impl`and - is defined in `matmul_asm_impl.S`. +Here: +- The `preprocess` function is named `preprocess_l_asm` and is defined in `preprocess_l_asm.S` +- The outer product-based matrix multiplication is named `matmul_asm_impl` and is defined in `matmul_asm_impl.S` -Those 2 functions are declared in `matmul.h`: +Both functions are declared in `matmul.h`: ```C // Matrix preprocessing, in assembly. @@ -43,14 +47,9 @@ void matmul_asm_impl( float *restrict matResult) __arm_streaming __arm_inout("za"); ``` -You will note that they have been marked with 2 attributes: `__arm_streaming` -and `__arm_inout("za")`. This instructs the compiler that these functions -expect the streaming mode to be active, and that they don't new to save / -restore the ZA storage. +Both functions are annotated with the `__arm_streaming` and `__arm_inout("za")` attributes. These indicate that the function expects streaming mode to be active and does not need to save or restore the ZA storage. -These two functions are stitched together in `matmul_asm.c` with the -same prototype as the reference implementation of matrix multiplication, so that -a top-level `matmul_asm` can be called from the `main` function: +These two functions are stitched together in `matmul_asm.c` with the same prototype as the reference implementation of matrix multiplication, so that a top-level `matmul_asm` can be called from the `main` function: ```C __arm_new("za") __arm_locally_streaming void matmul_asm( @@ -63,14 +62,15 @@ __arm_new("za") __arm_locally_streaming void matmul_asm( } ``` -Note that `matmul_asm` has been annotated with 2 attributes: -`__arm_new("za")` and `__arm_locally_streaming`. This instructs the compiler -to swith to streaming mode and save the ZA storage (and restore it when the -function returns). +You can see that `matmul_asm` is annotated with two attributes: `__arm_new("za")` and `__arm_locally_streaming`. These attributes instruct the compiler to enable streaming mode and manage ZA state on entry and return. + +## How it integrates with the main function + +The same `main.c` file supports both the intrinsic and assembly implementations. The implementation to use is selected at compile time via the `IMPL` macro. This design reduces duplication and simplifies maintenance. + +## Execution modes -The high-level `matmul_asm` function is called from `main.c`. This file might look a bit complex at first sight, but fear not, here are some explanations: -- the same `main.c` is used for the assembly- and intrinsic-based versions of the matrix multiplication --- this is parametrized at compilation time with the `IMPL` macro. This avoids code duplication and improves maintenance. -- on a baremetal platform, the program only works in *verification mode*, where it compares the results of the assembly-based (resp. intrinsic-based) matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available. +- On a baremetal platform, the program runs in *verification mode*, where it compares the results of the assembly-based matrix multiplication with the vanilla reference implementation. When targeting a non-baremetal platform, a *benchmarking mode* is also available. ```C { line_numbers="true" } #ifndef __ARM_FEATURE_SME2 @@ -221,8 +221,8 @@ int main(int argc, char **argv) { float *matResult_ref = (float *)malloc(M * N * sizeof(float)); // Initialize matrices. Input matrices are initialized with random values in - // non debug mode. In debug mode, all matrices are initialized with linear - // or known values values for easier debugging. + // non-debug mode. In debug mode, all matrices are initialized with linear + // or known values for easier debugging. #ifdef DEBUG initialize_matrix(matLeft, M * K, LINEAR_INIT); initialize_matrix(matRight, K * N, LINEAR_INIT); @@ -321,36 +321,19 @@ int main(int argc, char **argv) { } ``` -The same `main.c` file is used for the assembly and intrinsic-based versions -of the matrix multiplication. It first sets the `M`, `K` and `N` -parameters, to either the arguments supplied on the command line (lines 93-95) -or uses the default value (lines 73-75). In non-baremetal mode, it also accepts -(lines 82-89 and lines 98-108), as first parameter, an iteration count `I` +The same `main.c` file is used for the assembly and intrinsic-based versions of the matrix multiplication. It first sets the `M`, `K` and `N` parameters, to either the arguments supplied on the command line (lines 93-95) or uses the default value (lines 73-75). In non-baremetal mode, it also accepts (lines 82-89 and lines 98-108), as first parameter, an iteration count `I` used for benchmarking. -Depending on the `M`, `K`, `N` dimension parameters, `main` allocates -memory for all the matrices and initializes `matLeft` and `matRight` with -random data. The actual matrix multiplication implementation is provided through -the `IMPL` macro. +Depending on the `M`, `K`, `N` dimension parameters, `main` allocates memory for all the matrices and initializes `matLeft` and `matRight` with random data. The actual matrix multiplication implementation is provided through the `IMPL` macro. -In *verification mode*, it then runs the matrix multiplication from `IMPL` -(line 167) and computes the reference values for the preprocessed matrix as well -as the result matrix (lines 170 and 171). It then compares the actual values to -the reference values and reports errors, if there are any (lines 173-177). -Finally, all the memory is deallocated (lines 236-243) before exiting the +In *verification mode*, it then runs the matrix multiplication from `IMPL` (line 167) and computes the reference values for the preprocessed matrix as well as the result matrix (lines 170 and 171). It then compares the actual values to the reference values and reports errors, if there are any (lines 173-177). Finally, all the memory is deallocated (lines 236-243) before exiting the program with a success or failure return code at line 245. -In *benchmarking mode*, it will first run the vanilla reference matrix -multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 -times without measuring elapsed time to warm-up the CPU. It will then measure -the elapsed execution time of the vanilla reference matrix multiplication (resp. -assembly- or intrinsic-based matrix multiplication) `I` times and then compute +In *benchmarking mode*, it will first run the vanilla reference matrix multiplication (resp. assembly- or intrinsic-based matrix multiplication) 10 times without measuring elapsed time to warm-up the CPU. It will then measure the elapsed execution time of the vanilla reference matrix multiplication (resp.assembly- or intrinsic-based matrix multiplication) `I` times and then compute and report the minimum, maximum and average execution times. {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this learning path -is to provide some basic guidelines on the performance improvement that can be -obtained with SME2. +Benchmarking and profiling are not simple tasks. The purpose of this Learning Path is to provide some basic guidelines on the performance improvement that can be obtained with SME2. {{% /notice %}} ### Compile and run it @@ -395,7 +378,7 @@ whether the preprocessing and matrix multiplication passed (`PASS`) or failed (`FAILED`) the comparison the vanilla reference implementation. {{% notice Tip %}} -The example above uses the default values for the `M` (125), `K`(25) and `N`(70) +The example above uses the default values for the `M` (125), `K`(70) and `N`(70) parameters. You can override this and provide your own values on the command line: {{< tabpane code=true >}} @@ -408,5 +391,5 @@ parameters. You can override this and provide your own values on the command lin {{< /tab >}} {{< /tabpane >}} -Here the values `M=7`, `K=8` and `N=9` are used instead. +In this example, `M=7`, `K=8`, and `N=9` are used. {{% /notice %}} \ No newline at end of file diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md index a170de702d..eba6850aaf 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/7-sme2-matmul-intr.md @@ -1,30 +1,22 @@ --- -title: SME2 intrinsics matrix multiplication +title: Matrix multiplication using SME2 intrinsics in C weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you will write an SME2 optimized matrix multiplication in C -using the intrinsics that the compiler provides. +In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides. -## Matrix multiplication with SME2 intrinsics +## What are instrinsics? -*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*, -are the functions available to application developers that the compiler has an -intimate knowledge of. This enables the compiler to either translate the -function to a specific instruction or to perform specific optimizations, or -both. +*Intrinsics*, also known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both. You can learn more about intrinsics in this [Wikipedia Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function). Using intrinsics allows the programmer to use the specific instructions required -to achieve the required performance while writing in C all the -typically-required standard code, such as loops. This produces performance close -to what can be reached with hand-written assembly whilst being significantly -more maintainable and portable. +to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable. All Arm-specific intrinsics are specified in the [ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE @@ -51,10 +43,7 @@ Note the `__arm_new("za")` and `__arm_locally_streaming` at line 1 that will make the compiler save the ZA storage so we can use it without destroying its content if it was still in use by one of the callers. -`SVL`, the dimension of the ZA storage, is requested from the underlying -hardware with the `svcntsw()` function call at line 5, and passed down to the -`preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a -function provided be the ACLE library. +`SVL`, the dimension of the ZA storage, is requested from the underlying hardware with the `svcntsw()` function call at line 5, and passed down to the `preprocess_l_intr` and `matmul_intr_impl` functions. `svcntsw()` is a function provided by the ACLE library. ### Matrix preprocessing diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md index c38a7d3a94..da241b1ad7 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/8-benchmarking.md @@ -6,36 +6,28 @@ weight: 10 layout: learningpathall --- -In this section, if your machine supports native execution of SME2 instructions, -you will perform benchmarking of the matrix multiplication improvement thanks to -SME2. +In this section, you'll benchmark matrix multiplication performance using SME2, if your machine supports native execution of SME2 instructions. ## About benchmarking and emulation Emulation is generally not the best way to assess the performance of a piece of -code. Emulation focuses on correctly simulating instructions and leaves out many -details necessary for precise execution time measurement. For example, as -explained in the section on the outer product, the goal was to increase the -`macc` to `load` ratio. Emulators, including the FVP, do not model in detail the -cache effects or the timing effects of the memory accesses. At best, an emulator -can provide an instruction count for the vanilla reference implementation versus -the assembly-/intrinsic-based versions of the matrix multiplication, but this is -known to be a poor proxy for execution time comparisons. +code. Emulation focuses on correctly simulating instructions and not accurate execution timing. For example, as explained in the [outer product section](../5-outer-product/), improving performance involves increasing the `macc`-to-`load` ratio. -## Benchmarking on platform with native SME2 support +Emulators, including the FVP, do not model in detail memory bandwidth, cache behavior, or latency. At best, an emulator provides an instruction count for the vanilla reference implementation versus the assembly-/intrinsic-based versions of the matrix multiplication, which is useful for functional validation but not for precise benchmarking. + +## Benchmarking on a platform with native SME2 support {{% notice Note %}} -Benchmarking and profiling are not simple tasks. The purpose of this learning path -is to provide some basic guidelines on the performance improvement that can be -obtained with SME2. +Benchmarking and profiling are complex tasks. This Learning Path provides a *simplified* framework for observing SME2-related performance improvements. {{% /notice %}} -If your machine natively supports SME2, then benchmarking becomes possible. When +If your machine natively supports SME2, then benchmarking is possible. When `sme2_matmul_asm` and `sme2_matmul_intr` were compiled with `BAREMETAL=0`, the -*benchmarking mode* becomes available. +*benchmarking mode* is available. + +*Benchmarking mode* is enabled by prepending the `M`, `K`, `N` optional parameters with an iteration count (`I`). -*Benchmarking mode* is enabled by prepending the `M`, `K`, `N` optional -parameters with an iteration count (`I`). +## Run the intrinsic version Now measure the execution time of `sme2_matmul_intr` for 1000 multiplications of matrices with the default sizes: @@ -47,11 +39,7 @@ Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.4 SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us ``` -The execution time is reported in microseconds. A wide spread between the -minimum and maximum figures can be noted and is expected as the way of doing the -benchmarking is simplified for the purpose of simplicity. You will, however, -note that the intrinsic version of the matrix multiplication brings on average a -76x execution time reduction. +The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings on average a 76x execution time reduction. {{% notice Tip %}} You can override the default values for `M` (125), `K` (25), and `N` (70) and @@ -86,7 +74,7 @@ here is far from being an apples-to-apples comparison: - Firstly, the assembly version has some requirements on the `K` parameter that the intrinsic version does not have. - Second, the assembly version has an optimization that the intrinsic version, - for the sake of readability in this learning path, does not have (see the + for the sake of readability in this Learning Path, does not have (see the [Going further section](/learning-paths/cross-platform/multiplying-matrices-with-sme2/10-going-further/) to know more). diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md index a6debde6f4..d05e5a7ea0 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/9-debugging.md @@ -6,19 +6,15 @@ weight: 11 layout: learningpathall --- -In practice, writing code can be complex and debugging code is required. +Debugging is an essential part of development, especially when working close to the hardware. -In this section, you will learn about the different ways to debug SME2 code. +In this section, you will learn about the different ways to debug and troubleshoot SME2 code. -## Looking at the generated code +## Inspect the generated assembly -In some cases, it is useful to look at the code generated by the compiler. In -this Learning Path, the assembly listings have been produced and you can inspect -them. +Sometimes it's helpful to review the assembly code generated by the compiler. In this Learning Path, listings have already been generated for you. You can inspect these files to verify that SME2 instructions were emitted correctly. -For example, the inner loop with the outer product and the accumulation of the -matrix multiplication with intrinsics from the listing file -`sme2_matmul_intr.lst` looks like this: +For example, here’s a snippet from `sme2_matmul_intr.lst`, showing the inner loop of the matrix multiplication using intrinsics: ```TXT ... @@ -31,27 +27,21 @@ matrix multiplication with intrinsics from the listing file 8000186c: 54ffff41 b.ne 0x80001854 ... ``` +This sequence shows how `ld1w` loads vector registers, followed by the `fmopa` outer product operation. -### With debuggers +### Debug with gdb or lldb -Both of the main debuggers, `gdb` and `lldb`, have some support for -debugging SME2 code. Their usage is not shown in this Learning Path though. +Both of the main debuggers, `gdb` and `lldb`, have some support for debugging SME2 code. Their usage is not shown in this Learning Path though. -Note that debugging on the emulator might require some more steps as this is a -simplistic, and minimalistic environment, without an operating system, for -example. Debug mode requires a debug monitor to interface between the debugger, -the program, and the CPU. +{{% notice Note %}} +If you're using the FVP emulator, debugging is more complex. Because there's no operating system, you'll need a debug monitor to interface between your program, the CPU, and your debugger. +{{% /notice %}} -### With trace +### Analyze instruction trace with Tarmac -The FVP can emit an instruction trace file in text format, known as the Tarmac -trace. This provides a convenient way for you to understand what the program is -doing. +The FVP can emit an instruction trace file in text format, known as the Tarmac trace.This trace shows instruction-by-instruction execution and register contents, which is helpful for low-level debugging. -In the excerpt shown below, you can see that the SVE register `z0` has been -loaded with 16 values, as predicate `p0` was true, with an `LD1W` -instruction, whereas `z1` was loaded with only two values, as `p1`. `z0`, -and `z1` are later used by the `fmopa` instruction to compute the outer +In the excerpt shown below, you can see that the SVE register `z0` has been loaded with 16 values, as predicate `p0` was true, with an `LD1W` instruction, whereas `z1` was loaded with only two values, as `p1`. `z0`, and `z1` are later used by the `fmopa` instruction to compute the outer product, and the trace displays the content of the ZA storage. ```TXT @@ -103,21 +93,17 @@ product, and the trace displays the content of the ZA storage. 923580000 ps R ZA0H_S_15 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4479e70a_44f4223e ``` -You can get a Tarmac trace when invoking `run-fvp.sh` by adding the -`--trace` option as the *first* argument, for example: +You can get a Tarmac trace when invoking `run-fvp.sh` by adding the `--trace` option as the *first* argument, for example: ```BASH docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 ./run-fvp.sh --trace sme2_matmul_asm ``` -Tracing is not enabled by default. It slows down the simulation significantly and the trace file can become very large for programs with large matrices. - -{{% notice Debugging tip %}} -It can be helpful when debugging to understand where an element in the -Tile is coming from. The current code base allows you to do that in `debug` -mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you -look into `main.c`, you will notice that the matrix initialization is no -longer random, but instead initializes each element with its linear -index. This makes it *easier* to find where the matrix elements are loaded in -the tile in tarmac trace, for example. -{{% /notice %}} \ No newline at end of file +{{% notice Tip %}} +Tracing is disabled by default because it significantly slows down simulation and generates large files for big matrices. +{{% /notice %}} + +## Use debug mode for matrix inspection + +It can be helpful when debugging to understand where an element in the tile is coming from. The current code base allows you to do that in `debug` mode, when `-DDEBUG` is passed to the compiler in the `Makefile`. If you look into `main.c`, you will notice that the matrix initialization is no +longer random, but instead initializes each element with its linear index. This makes it *easier* to find where the matrix elements are loaded in the tile in tarmac trace, for example. diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md index 3533edb21f..918fcc44f8 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/_index.md @@ -1,29 +1,25 @@ --- title: Accelerate Matrix Multiplication Performance with SME2 -draft: true -cascade: - draft: true +minutes_to_complete: 60 -minutes_to_complete: 30 - -who_is_this_for: This Learning Path is an advanced topic for developers who want to learn about accelerating the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). +who_is_this_for: This Learning Path is an advanced topic for developers who want to accelerate the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). learning_objectives: - - Implement a reference matrix multiplication without using SME2. - - Use SME2 assembly instructions to improve the matrix multiplication performance. - - Use SME2 intrinsics to improve the matrix multiplication performance using the C programming language. - - Compile code with SME2 instructions. - - Run code with SME2 instructions, on a platform with SME2 support or with an emulator. + - Implement a baseline matrix multiplication kernel in C without SME2 + - Use SME2 assembly instructions to accelerate matrix multiplication performance + - Use SME2 intrinsics to vectorize and optimize matrix multiplication + - Compile code with SME2 intrinsics and assembly + - Benchmark and validate SME2-accelerated matrix multiplication on Arm hardware or in a Linux-based emulation environment + - Compare performance metrics between baseline and SME2-optimized implementations prerequisites: - - Basic knowledge of Arm's Scalable Matrix Extension (SME). - - Basic knowledge of Arm's Scalable Vector Extension (SVE). - - An intermediate understanding of C programming language and assembly language. - - A computer running Linux, macOS, or Windows. - - Installations of Git and Docker. - - A platform that support SME2 (see the [list of devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions. - - A compiler with support for SME2 instructions. + - Working knowledge of Arm’s SVE and SME instruction sets + - Intermediate proficiency with the C programming language and the Armv9-A assembly language + - A computer running Linux, macOS, or Windows + - Installations of Git and Docker for project setup and emulation + - A platform that supports SME2 (see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-sme2-support)) or an emulator to run code with SME2 instructions + - Compiler support for SME2 instructions (for example, LLVM 17+ with SME2 backend support) author: Arnaud de Grandmaison @@ -37,6 +33,7 @@ tools_software_languages: - C - Clang - Runbook + - LLVM operatingsystems: - Linux diff --git a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md index 008b274479..ed46aca044 100644 --- a/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md +++ b/content/learning-paths/cross-platform/multiplying-matrices-with-sme2/overview.md @@ -6,33 +6,45 @@ weight: 2 layout: learningpathall --- -# Overview of Arm's Scalable Matrix Extension Version 2 +## Arm's Scalable Matrix Extension Version 2 (SME2) -### What is SME2? +Arm’s Scalable Matrix Extension Version 2 (SME2) is a hardware feature designed to accelerate dense linear algebra operations, enabling high-throughput execution of matrix-based workloads. -The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture. The Scalable Matrix Extension Version 2 (SME2) extends the SME architecture by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. +Whether you're building for AI inference, HPC, or scientific computing, SME2 provides fine-grained control and high-performance vector processing. -SME2 extends SME by introducing multi-vector data-processing instructions, load to and store from multi-vectors, and a multi-vector predication mechanism. -Additional architectural features of SME2 include: +## Extending the SME architecture -* Multi-vector multiply-accumulate instructions, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. +SME is an extension to the Armv9-A architecture and is designed to accelerate matrix-heavy computations, such as outer products and matrix-matrix multiplications. -* Multi-vector load, store, move, permute, and convert instructions, that use multiple SVE Z vectors as source and destination registers to pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. +SME2 builds on SME by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. -* *Predicate-as-counter*, which is an alternative predication mechanism that is added to the original SVE predication mechanism, to control operations performed on multiple vector registers. +## Key architectural features of SME2 -* Compressed neural network capability using dedicated lookup table instructions and outer product instructions that support binary neural networks. +SME2 adds several capabilities to the original SME architecture: -* A 512-bit architectural register ZT0, that supports the lookup table feature. +* **Multi-vector multiply-accumulate instructions**, that use Z vectors as multiplier and multiplicand inputs, and accumulate results into ZA array vectors. This includes widening multiplies that write to more vectors than they read from. -### Suggested reading +* **Multi-vector load, store, move, permute, and convert instructions**, that use multiple SVE Z vectors as source and destination registers to efficiently pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. -If you are not familiar with matrix multiplication, or would benefit from refreshing your knowledge, this [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) is a good start. +* A **predicate-as-counter mechanism**, which is a new predication mechanism that is added alongside the original SVE approach to enable fine-grained control over operations across multiple vector registers. -This Learning Path assumes some basic understanding of SVE and SME. If you are not familiar with SVE or SME, these are some useful resources that you can read first: -- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture). -- [Arm Scalable Matrix Extension (SME) Introduction (Part 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction). -- [Arm Scalable Matrix Extension (SME) Introduction (Part 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2). -- [Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/matrix-matrix-multiplication-neon-sve-and-sme-compared). -- [Build adaptive libraries with multiversioning](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file +* **Compressed neural network support**, using dedicated lookup table and outer product instructions that support binary neural network workloads. + +* A **512-bit architectural register ZT0**, which is a dedicated register that enables fast, table-driven data transformations. + +## Further information + +This Learning Path does assume some basic understanding of SVE, SME, and matrix multiplication, however if you do want to refresh or grow your knowledge, these are some useful resources that you might find helpful: + +On matrix multiplication: + +- The [Wikipedia article](https://en.wikipedia.org/wiki/Matrix_multiplication) + +On SVE and SME: + +- [Introducing the Scalable Matrix Extension for the Armv9-A Architecture - Martin Weidmann, Arm](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 1) - Zenon Xiu](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) +- [Arm Scalable Matrix Extension (SME) Introduction (Part 2) - Zenon Xiu](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) +- [Matrix-matrix multiplication. Neon, SVE, and SME compared (Part 3)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/.matrix-matrix-multiplication-neon-sve-and-sme-compared) +- [Learn about function multiversioning - Alexandros Lamprineas, Arm](https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/) \ No newline at end of file