diff --git a/content/learning-paths/cross-platform/sme2/1-get-started.md b/content/learning-paths/cross-platform/sme2/1-get-started.md new file mode 100644 index 0000000000..61031df5d2 --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/1-get-started.md @@ -0,0 +1,212 @@ +--- +title: Set up your Environment +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Installing software for this Learning Path + +To follow this Learning Path, you will need to set up an environment to develop with SME2. + +You will require: + + - A compiler with support for SME2 instructions. You can use [Clang](https://www.llvm.org/) + version 18 or later, or [GCC](https://gcc.gnu.org/) version 14, or later. This Learning + Path uses ``Clang``. + + - An emulator to execute code with the SME2 instructions. This Learning + Path uses [Arm's Fixed Virtual Platform (FVP) model](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms). + +You will also require Git and Docker installed on your machine. + +### Set up Git + +To check if Git is already installed on your machine, use the following command line in a terminal: + +```BASH { output_lines=2 } +git --version +git version 2.47.1 +``` + +If the above command line fails with a message similar to "``git: command not found``", then install Git following the steps for your machine's OS. + +{{< tabpane code=true >}} + {{< tab header="Linux/Ubuntu" language="bash">}} +sudo apt install git + {{< /tab >}} + {{< tab header="macOS" language="bash">}} +brew install git + {{< /tab >}} +{{< /tabpane >}} + +### Docker + +To enable you to get started easily and with the tools that you need, you can fetch a Docker container with the required compiler and FVP. Alternatively, if you do wish to build the container yourself, the ``Dockerfile`` is also available. + + +{{% notice Note %}} +This Learning Path works without ``docker``, but the compiler and the FVP must be available in your search path. +{{% /notice %}} + +Start by checking that ``docker`` is installed on your machine by typing the following +command line in a terminal: + +```BASH { output_lines="2" } +docker --version +Docker version 27.3.1, build ce12230 +``` + +If the above command fails with a message similar to "``docker: command not found``" +then follow the steps from the [Docker Install Guide](https://learn.arm.com/install-guides/docker/). + +{{% notice Note %}} +You might need to login again or restart your machine for the changes to take effect. +{{% /notice %}} + +Once you have confirmed that Docker is installed on your machine, you can check that it is operating normally with the following: + +```BASH { output_lines="2-27" } +docker run hello-world +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +478afc919002: Pull complete +Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966 +Status: Downloaded newer image for hello-world:latest + +Hello from Docker! +This message shows that your installation appears to be working correctly. + +To generate this message, Docker followed these steps: + + 1. The Docker client contacted the Docker daemon. + + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (arm64v8) + + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. + +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash + +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ + +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` + +## Environment + +Now, using Git, clone the environment for experimenting with SME2 to a directory +named ``SME2.git``: + +```BASH +git clone https://gitlab.arm.com/learning-code-examples/TODO_SOME_PATH SME2-learning-path.git +``` + +This list of content in the repository should look like this : + +```TXT +SME2-learning-path.git/ +├── .clang-format +├── .devcontainer/ +│ └── devcontainer.json +├── .git/ +├── .gitignore +├── Makefile +├── README.rst +├── docker/ +│ ├── assets.source_me +│ ├── build-all-containers.sh +│ ├── build-my-container.sh +│   └── sme2-environment.docker +├── hello.c +├── main.c +├── matmul.h +├── matmul_asm.c +├── matmul_asm_impl.S +├── matmul_intr.c +├── matmul_vanilla.c +├── misc.c +├── misc.h +├── preprocess_l_asm.S +├── preprocess_vanilla.c +├── run-fvp.sh +└── sme2_check.c +``` + +It contains: +- Code examples. +- A ``Makefile`` that builds the code examples. +- A shell script called ``run-fvp.sh`` that runs the FVP. +- A directory called ``docker`` that contains materials related to Docker, which are: + - A script called ``assets.source_me`` that provides the FVP and compiler toolchain references. + - A Docker recipe called ``sme2-environment.docker`` to build the container that + you will use. + - A shell script called ``build-my-container.sh`` that you can use if you want to build the Docker container. This is not essential however, as ready-made images are made available for you. + - A script called ``build-all-containers.sh`` that was used to create the image for you to download to provide multi-architecture support for both x86_64 and AArch64. +- A configuration script for VS Code to be able to use the container from the IDE called ``.devcontainer/devcontainer.json``. + +The next step is to change directory to your checkout: + +```BASH +cd SME2-learning-path.git +``` +{{% notice Note %}} +From this point in the Learning Path, all instructions assume that your current +directory is ``SME2-learning-path.git``.{{% /notice %}} + + +## Using the environment + +Docker containers provide you with the functionality to execute commands in an isolated environment, where you have all the necessary tools that you require without having to clutter your machine. The containers runs independently, which means that they do not interfere with other containers on the same machine or server. + +You can use Docker in the following ways: +- Directly from the command line. For example, when you are working from a terminal on your local machine. +- Within a containerized environment. Configure VS Code to execute run all the commands inside a Docker container, allowing you to work seamlessly within the Docker environment. + +### Working from a terminal + +When a command is executed in the Docker container environment, you must prepend it with instructions on the command line so that your shell executes it within the container. + +For example, to execute ``COMMAND ARGUMENTS`` in the SME2 Docker container, the command line looks like this: + +```SH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 COMMAND ARGUMENTS +``` + +This invokes Docker, using the +``armswdev/sme2-learning-path:sme2-environment-v1``container +image, and mounts the current working directory (the ``SME2-learning-path.git``) +inside the container to ``/work``, then sets ``/work`` as the +working directory and runs ``COMMAND ARGUMENTS`` in this environment. + +For example, to run ``make``, you need to enter: + +```SH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make +``` + +### Working from within the Docker container + +Make sure you have the [Microsoft Dev +Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) +extension installed. + +Then select the **Reopen in Container** menu entry as Figure 1 shows. + +It automatically finds and uses ``.devcontainer/devcontainer.json``: + +![example image alt-text#center](VSCode.png "Figure 1: Setting up the Docker Container.") + +All your commands now run within the container, so there is no need to prepend them with a Docker invocation, as VS Code handles all this seamlessly for you. + +{{% notice Note %}} +For the rest of this Learning Path, shell commands include the full Docker invocation so that users not using VS Code can copy the complete command line. However, if you are using VS Code, you only need to use the `COMMAND ARGUMENTS` part. +{{% /notice %}} \ No newline at end of file diff --git a/content/learning-paths/cross-platform/sme2/2-check-your-environment.md b/content/learning-paths/cross-platform/sme2/2-check-your-environment.md new file mode 100644 index 0000000000..267feb92ce --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/2-check-your-environment.md @@ -0,0 +1,173 @@ +--- +title: Test your environment +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this section, you will check that your environment is all set up and ready to develop with SME2. This will be your first hands-on experience with the environment. + +## Compile the examples + +First, compile the example code with Clang: + +```BASH { output_lines="2-19" } +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -nostartfiles -lcrt0-semihost -lsemihost -Wl,--defsym=__boot_flash=0x80000000 -Wl,--defsym=__flash=0x80001000 -Wl,--defsym=__ram=0x81000000 -T picolibc.ld -o hello hello.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o sme2_check.o sme2_check.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o misc.o misc.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -nostartfiles -lcrt0-semihost -lsemihost -Wl,--defsym=__boot_flash=0x80000000 -Wl,--defsym=__flash=0x80001000 -Wl,--defsym=__ram=0x81000000 -T picolibc.ld -o sme2_check sme2_check.o misc.o +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -DIMPL=asm -c -o main_asm.o main.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o matmul_asm.o matmul_asm.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o matmul_asm_impl.o matmul_asm_impl.S +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o preprocess_l_asm.o preprocess_l_asm.S +clang --target=aarch64-none-elf -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o matmul_vanilla.o matmul_vanilla.c +clang --target=aarch64-none-elf -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o preprocess_vanilla.o preprocess_vanilla.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -nostartfiles -lcrt0-semihost -lsemihost -Wl,--defsym=__boot_flash=0x80000000 -Wl,--defsym=__flash=0x80001000 -Wl,--defsym=__ram=0x81000000 -T picolibc.ld -o sme2_matmul_asm main_asm.o matmul_asm.o matmul_asm_impl.o preprocess_l_asm.o matmul_vanilla.o preprocess_vanilla.o misc.o +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -DIMPL=intr -c -o main_intr.o main.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -c -o matmul_intr.o matmul_intr.c +clang --target=aarch64-none-elf -march=armv9.4-a+sme2 -fno-exceptions -fno-rtti -mno-unaligned-access -O2 -Wall -std=c99 -nostartfiles -lcrt0-semihost -lsemihost -Wl,--defsym=__boot_flash=0x80000000 -Wl,--defsym=__flash=0x80001000 -Wl,--defsym=__ram=0x81000000 -T picolibc.ld -o sme2_matmul_intr main_intr.o matmul_intr.o matmul_vanilla.o preprocess_vanilla.o misc.o +llvm-objdump --demangle -d hello > hello.lst +llvm-objdump --demangle -d sme2_check > sme2_check.lst +llvm-objdump --demangle -d sme2_matmul_asm > sme2_matmul_asm.lst +llvm-objdump --demangle -d sme2_matmul_intr > sme2_matmul_intr.lst +``` + + Executed within the docker ``armswdev/sme2-learning-path:sme2-environment-v1`` environment, the ``make`` command performs the following tasks: + +- It builds four executables: ``hello``, ``sme2_check``, ``sme2_matmul_asm``, and ``sme2_matmul_intr``. +- It creates the assembly listings for the four executables: ``hello.lst``, ``sme2_check.lst``, ``sme2_matmul_asm.lst``, and ``sme2_matmul_intr.lst``. + +{{% notice Note %}} +At any point, you can clean the directory of all the files that have been built by invoking the ``make clean`` target: + +```BASH +$ docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make clean +``` +{{% /notice %}} + +## Basic checks + +The very first program that you should run is the famous "Hello, world !" example that +will tell you if your environment is set up correctly. + +The source code is contained in ``hello.c`` and looks like this: + +```C +#include +#include + +int main(int argc, char *argv[]) { + printf("Hello, world !\n"); + return EXIT_SUCCESS; +} +``` + +Run the FVP simulation of the ``hello`` program with: + +```BASH { output_lines="2-4" } +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh hello +Hello, world ! + +Info: /OSCI/SystemC: Simulation stopped by user. +``` + +The important line here is "``Hello, world !``" as it demonstrates that the generic code +can be compiled and run on the FVP. + +## SME2 checks + +You will now run the ``sme2_check`` program, which checks that SME2 works as +expected, in both the compiler and in the FVP. + +The source code is found in +``sme2_check.c``: + +```C +#include +#include + +#include "misc.h" + +#ifdef __ARM_FEATURE_SME2 +#include +#else +#error __ARM_FEATURE_SME2 is not defined +#endif + +#define get_cpu_ftr(regId, feat, msb, lsb) \ + ({ \ + unsigned long __val; \ + __asm__("mrs %0, " #regId : "=r"(__val)); \ + printf("%-20s: 0x%016lx\n", #regId, __val); \ + printf(" - %-10s: 0x%08lx\n", #feat, \ + (__val >> lsb) & ((1 << (msb - lsb)) - 1)); \ + }) + +int main(int argc, char *argv[]) { + get_cpu_ftr(ID_AA64PFR0_EL1, SVE, 35, 32); + get_cpu_ftr(ID_AA64PFR1_EL1, SME, 27, 24); + + int n = 0; +#ifdef __ARM_FEATURE_SME2 + setup_sme(); + n = svcntb() * 8; +#endif + if (n) { + printf("SVE is available with length %d\n", n); + } else { + printf("SVE is unavailable.\n"); + exit(EXIT_FAILURE); + } + + printf("Checking has_sme: %d\n", __arm_has_sme()); + printf("Checking in_streaming_mode: %d\n", __arm_in_streaming_mode()); + + printf("Starting streaming mode...\n"); + __asm__("smstart"); + + printf("Checking in_streaming_mode: %d\n", __arm_in_streaming_mode()); + + printf("Stopping streaming mode...\n"); + __asm__("smstop"); + + printf("Checking in_streaming_mode: %d\n", __arm_in_streaming_mode()); + + return EXIT_SUCCESS; +} +``` + +The ``sme2_check`` program displays the SVE field of the ``ID_AA64PFR0_EL1`` system register and the SME field of the ``ID_AA64PFR1_EL1`` system register. It will then check if SVE and SME are available, then finally will switch into streaming mode and back from streaming mode. + +The ``__ARM_FEATURE_SME2`` macro is provided by the compiler when it targets an SME-capable target, which is specified with the ``-march=armv9.4-a+sme2`` command line option to ``clang`` in +file ``Makefile``. + +The ``arm_sme.h`` include file is part of the Arm C Library +Extension ([ACLE](https://arm-software.github.io/acle/main/)). + +The ACLE provides types and function declarations to enable C/C++ programmers to make the best possible use of the Arm architecture. You can use the SME-related part of the library, but it does also provide support for Neon or other Arm architectural extensions. + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh sme2_check +``` + +The output should be similar to: + +```TXT +ID_AA64PFR0_EL1 : 0x1101101131111112 + - SVE : 0x00000001 +ID_AA64PFR1_EL1 : 0x0000101002000001 + - SME : 0x00000002 +SVE is available with length 512 +Checking has_sme: 1 +Checking in_streaming_mode: 0 +Starting streaming mode... +Checking in_streaming_mode: 1 +Stopping streaming mode... +Checking in_streaming_mode: 0 + +Info: /OSCI/SystemC: Simulation stopped by user. +``` + +You have now checked that the code can be compiled and run with full SME2 support, and are all set to move to the next section. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/sme2/3-vanilla-matmul.md b/content/learning-paths/cross-platform/sme2/3-vanilla-matmul.md new file mode 100644 index 0000000000..e78e399bbc --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/3-vanilla-matmul.md @@ -0,0 +1,80 @@ +--- +title: Vanilla matrix multiplication +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Vanilla matrix multiplication + +In this section, you will learn about an example of standard matrix multiplication in C. + +### Algorithm description + +The vanilla matrix multiplication operation takes two input matrices, A [Ar +rows x Ac columns] and B [Br rows x Bc columns], to produce an output matrix C +[Cr rows x Cc columns]. The operation consists of iterating on each row of A +and each column of B, multiplying each element of the A row with its corresponding +element in the B column then summing all these products, as Figure 2 shows. + +![example image alt-text#center](matmul.png "Figure 2: Standard Matrix Multiplication.") + +This implies that the A, B, and C matrices have some constraints on their +dimensions: + +- A's number of columns must match B's number of rows: Ac == Br. +- C has the dimensions Cr == Ar and Cc == Bc. + +You can learn more about matrix multiplication, including its history, +properties and use, by reading this [Wikipedia +article on Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). + +In this Learning Path, you will see the following variable names: + +- ``matLeft`` corresponds to the left-hand side argument of the matrix + multiplication. +- ``matRight``corresponds to the right-hand side of the matrix multiplication. +- ``M`` is ``matLeft`` number of rows. +- ``K`` is ``matLeft`` number of columns (and ``matRight`` number of rows). +- ``N`` is ``matRight`` number of columns. +- ``matResult``corresponds to the result of the matrix multiplication, with + ``M`` rows and ``N`` columns. + +### C implementation + +A literal implementation of the textbook matrix multiplication algorithm, as +described above, can be found in file ``matmul_vanilla.c``: + +```C +void matmul(uint64_t M, uint64_t K, uint64_t N, + const float *restrict matLeft, const float *restrict matRight, + float *restrict matResult) { + for (uint64_t m = 0; m < M; m++) { + for (uint64_t n = 0; n < N; n++) { + + float acc = 0.0; + + for (uint64_t k = 0; k < K; k++) + acc += matLeft[m * K + k] * matRight[k * N + n]; + + matResult[m * N + n] = acc; + } + } +} +``` + +In this Learning Path, the matrices are laid out in memory as contiguous +sequences of elements, in [Row-Major +Order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The +``matmul`` function performs the algorithm described above. + +The pointers to ``matLeft``, ``matRight`` and ``matResult`` have been annotated as +``restrict``, which informs the compiler that the memory areas designated by +those pointers do not alias. This means that they do not overlap in any way, so that the +compiler does not need to insert extra instructions to deal with these cases. +The pointers to ``matLeft`` and ``matRight`` are marked as ``const`` as neither of these two matrices are modified by ``matmul``. + +You now have a reference standard matrix multiplication function. You will use it later +on in this Learning Path to ensure that the assembly version and the intrinsics +version of the multiplication algorithm do not contain errors. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/sme2/4-outer-product.md b/content/learning-paths/cross-platform/sme2/4-outer-product.md new file mode 100644 index 0000000000..a550a6d449 --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/4-outer-product.md @@ -0,0 +1,100 @@ +--- +title: Outer product +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +## Matrix multiplication with the outer product + +In this section, you will learn how you can use the outer product with the SME engine to improve matrix multiplication. + +In this standard matrix multiplication example, the core of the computation can be represented as: + +```C + acc += matLeft[m * K + k] * matRight[k * N + n]; +``` + +Which translates to 1 multiply-accumulate, which is also known as ``macc``, for two loads (``matLeft[m * K + k]`` +and ``matRight[k *N + n]``). It therefore has a 1:2 ``macc`` to ``load`` ratio. + +From a memory system perspective, this is not effective, especially since this +computation is done within a triple-nested loop, repeatedly loading data from +memory. + +To exacerbate matters, large matrices might not fit in cache. In order to improve the matrix multiplication efficiency, the goal is to increase the ``macc`` to ``load`` ratio, which means to increase the number of multiply-accumulate operations per load. + +Figure 3 below shows how the matrix multiplication of ``matLeft`` (3 rows, 2 +columns) by ``matRight`` (2 rows, 3 columns) can be decomposed as the sum of the +outer products: + +![example image alt-text#center](outer_product.png "Figure 3: Outer Product-based Matrix Multiplication.") + +The SME engine builds on the +[Outer Product](https://en.wikipedia.org/wiki/Outer_product) as matrix +multiplication can be expressed as the +[sum of column-by-row outer products](https://en.wikipedia.org/wiki/Outer_product#Connection_with_the_matrix_product). + + +## About transposition + +From the previous page, you will recall that matrices are laid out in row-major +order. This means that loading row-data from memory is efficient as the memory +system operates efficiently with contiguous data. An example of this is where caches are loaded row by row, and data prefetching is simple - just load the data from ``current address + sizeof(data)``. This is not the case for loading column-data from memory though, as it requires more work from the memory system. + +In order to further improve the effectiveness of the matrix multiplication, it +is therefore desirable to change the layout in memory of the left-hand side matrix, which is called ``matLeft`` in the code examples in this Learning Path, which essentially performs a matrix +transposition so that instead of loading column-data from memory, one loads row-data. + +{{% notice Important %}} +It is important to note here that this reorganizes the layout of the matrix in +memory in order for the algorithm implementation to be more efficient. The +transposition affects only the memory layout. ``matLeft`` is transformed to +column-major order, but from a mathematical perspective, ``matleft`` is +*not* transposed. +{{% /notice %}} + +### Transposition in the real world + +In the same way that trees don't reach the sky, the SME engine has physical implementation limits. It operates with tiles in the ZA storage. Tiles are 2D portions of the matrices being processed. SME has dedicated instructions to load data to, and store data from tiles efficiently, as well as instructions to operate with and on tiles, for example the [fmopa](https://developer.arm.com/documentation/ddi0602/latest/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en) +instruction which takes two vectors as inputs and accumulate all the outer +products to a 2D tile. The tile in ZA storage is what allows SME to increase the +``macc`` to ``load`` ratio, as all the tile elements are loaded to the tile, to +be used with the SME outer product instructions. + +Taking into account that the ZA storage is finite, the desirable transposition +of the ``matLeft`` matrix that was discussed in the previous section needs to +adapted to the tile dimensions, so that a tile is easy to access. The +``matLeft`` preprocessing has thus some aspects of transpositions, but takes +into account the tiling as well and is referred to in the code as +``preprocess``. + +Here is at the algorithmic level what ``preprocess_l`` does in practice: + +```C +void preprocess_l(uint64_t nbr, uint64_t nbc, uint64_t SVL, + const float *restrict a, float *restrict a_mod) { + + // For all tiles of SVL x SVL data + for (uint64_t By = 0; By < nbr; By += SVL) { + for (uint64_t Bx = 0; Bx < nbc; Bx += SVL) { + // For this tile + const uint64_t dest = By * nbc + Bx * SVL; + for (uint64_t j = 0; j < SVL; j++) { + for (uint64_t i = 0; i < SVL && (Bx + i) < nbc; i++) { + if (By + j < nbr) { + a_mod[dest + i * SVL + j] = a[(By + j) * nbc + Bx + i]; + } else { + // These elements are outside of matrix a, so zero them. + a_mod[dest + i * SVL + j] = 0.0; + } + } + } + } + } +} +``` + +``preprocess_l`` will be used to check the assembly and intrinsic versions of +the matrix multiplication perform the preprocessing step correctly. This code is +located in file ``preprocess_vanilla.c``. diff --git a/content/learning-paths/cross-platform/sme2/5-SME2-matmul-asm.md b/content/learning-paths/cross-platform/sme2/5-SME2-matmul-asm.md new file mode 100644 index 0000000000..9504eff06d --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/5-SME2-matmul-asm.md @@ -0,0 +1,202 @@ +--- +title: SME2 assembly matrix multiplication +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +## Matrix multiplication with SME2 in assembly + +In this chapter, you will use an SME2-optimized matrix multiplication written +directly in assembly. + +### Description + +This Learning Path reuses the assembly version provided in the [SME Programmer's +Guide](https://developer.arm.com/documentation/109246/0100/matmul-fp32--Single-precision-matrix-by-matrix-multiplication) +where you will find a high-level and an in-depth description of the two steps +performed. + +The assembly versions have been modified so they coexist nicely with +the intrinsic versions. In this Learning Path, the ``preprocess`` function is +defined in ``preprocess_l_asm.S`` and the outer product-based matrix +multiplication is found in ``matmul_asm_impl.S``. + +These two functions have been stitched together in ``matmul_asm.c`` with the same prototype as the reference implementation of matrix multiplication, so that a top-level ``matmul_asm`` can +be called from the ``main`` function: + +```C +void matmul_asm(uint64_t M, uint64_t K, uint64_t N, + const float *restrict matLeft, const float *restrict matRight, + float *restrict matLeft_mod, float *restrict matResult) { + __asm volatile("" + : + : + : "p0", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9", + "p10", "p11", "p12", "p13", "p14", "p15", "z0", "z1", "z2", + "z3", "z4", "z5", "z6", "z7", "z8", "z9", "z10", "z11", + "z12", "z13", "z14", "z15", "z16", "z17", "z18", "z19", + "z20", "z21", "z22", "z23", "z24", "z25", "z26", "z27", + "z28", "z29", "z30", "z31"); + + preprocess_l_asm(M, K, matLeft, matLeft_mod); + matmul_asm_impl(M, K, N, matLeft_mod, matRight, matResult); + + __asm volatile("" + : + : + : "p0", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9", + "p10", "p11", "p12", "p13", "p14", "p15", "z0", "z1", "z2", + "z3", "z4", "z5", "z6", "z7", "z8", "z9", "z10", "z11", + "z12", "z13", "z14", "z15", "z16", "z17", "z18", "z19", + "z20", "z21", "z22", "z23", "z24", "z25", "z26", "z27", + "z28", "z29", "z30", "z31"); +} +``` + +Note here the use of the ``__asm`` statement forcing the compiler to save the SVE/SME registers. + +The high-level ``matmul_asm`` function is called from ``main.c``: + +```C +#include "matmul.h" +#include "misc.h" + +#include +#include +#include +#include + +#ifndef __ARM_FEATURE_SME2 +#error __ARM_FEATURE_SME2 is not defined +#endif + +#ifndef IMPL +#error matmul implementation selection macro IMPL is not defined +#endif + +#define STRINGIFY_(I) #I +#define STRINGIFY(I) STRINGIFY_(I) +#define FN(M, I) M##I +#define MATMUL(I, M, K, N, mL, mR, mM, m) FN(matmul_, I)(M, K, N, mL, mR, mM, m) + +// Assumptions: +// nbr in matLeft (M): any +// nbc in matLeft, nbr in matRight (K): any K > 2 (for the asm version) +// nbc in matRight (N): any + +int main(int argc, char **argv) { + + /* Size parameters */ + uint64_t M, N, K; + if (argc >= 4) { + M = strtoul(argv[1], NULL, 0); + K = strtoul(argv[2], NULL, 0); + N = strtoul(argv[3], NULL, 0); + } else { + /* Default: 125x35x70 */ + M = 125; + K = 35; + N = 70; + } + + printf("\nSME2 Matrix Multiply fp32 *%s* example with args %lu %lu %lu\n", + STRINGIFY(IMPL), M, K, N); + + setup_sme(); + + const uint64_t SVL = svcntsw(); + + /* Calculate M of transformed matLeft. */ + const uint64_t M_mod = SVL * (M / SVL + (M % SVL != 0 ? 1 : 0)); + + float *matRight = (float *)malloc(K * N * sizeof(float)); + + float *matLeft = (float *)malloc(M * K * sizeof(float)); + float *matLeft_mod = (float *)malloc(M_mod * K * sizeof(float)); + float *matLeft_mod_ref = (float *)malloc(M_mod * K * sizeof(float)); + + float *matResult = (float *)malloc(M * N * sizeof(float)); + float *matResult_ref = (float *)malloc(M * N * sizeof(float)); + +#ifdef DEBUG + initialize_matrix(matLeft, M * K, LINEAR_INIT); + initialize_matrix(matRight, K * N, LINEAR_INIT); + initialize_matrix(matLeft_mod, M_mod * K, DEAD_INIT); + initialize_matrix(matResult, M * N, DEAD_INIT); + + print_matrix(M, K, matLeft, "matLeft"); + print_matrix(K, N, matRight, "matRight"); +#else + initialize_matrix(matLeft, M * K, RANDOM_INIT); + initialize_matrix(matRight, K * N, RANDOM_INIT); +#endif + + MATMUL(IMPL, M, K, N, matLeft, matRight, matLeft_mod, matResult); + + // Compute the reference values with the vanilla implementations. + matmul(M, K, N, matLeft, matRight, matResult_ref); + preprocess_l(M, K, SVL, matLeft, matLeft_mod_ref); + + unsigned error = compare_matrices(K, M_mod, matLeft_mod_ref, matLeft_mod, + "Matrix preprocessing"); + if (!error) + error = compare_matrices(M, N, matResult_ref, matResult, + "Matrix multiplication"); + + free(matRight); + + free(matLeft); + free(matLeft_mod); + free(matLeft_mod_ref); + + free(matResult); + free(matResult_ref); + + return error ? EXIT_FAILURE : EXIT_SUCCESS; +} +``` + +The same ``main.c`` file is used for the assembly and intrinsic-based versions +of the matrix multiplication. It first sets the ``M``, ``K`` and ``N`` +parameters, to either the arguments supplied on the command line or uses the default +value. + +Depending on the ``M``, ``K``, ``N`` dimension parameters, ``main`` allocates memory for all the matrices and initializes ``matLeft`` and ``matRight`` with random data. The actual matrix multiplication implementation is provided through the ``IMPL`` macro. + +It then runs the matrix multiplication from ``IMPL`` and computes the reference values for the preprocessed matrix as well as the result matrix. It then compares the actual values to the reference values and reports errors, if there are any. Finally, all the memory is deallocated before exiting the program with a success or failure return code. + +### Compile and run it + +First, make sure that the ``sme2_matmul_asm`` executable is up-to-date: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make sme2_matmul_asm +``` + +Then execute ``sme2_matmul_asm`` on the FVP: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh sme2_matmul_asm +``` + +The output should be something similar to: + +```TXT +SME2 Matrix Multiply fp32 *asm* example with args 125 35 70 +Matrix preprocessing: PASS ! +Matrix multiplication: PASS ! + +Info: /OSCI/SystemC: Simulation stopped by user. +``` + +{{% notice Tip %}} +The example above uses the default values for the ``M`` (125), ``K``(25) and ``N``(70) +parameters. You can override this and provide your own values on the command line: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh sme2_matmul_asm 7 8 9 +``` + +Here the values ``M=7``, ``K=8`` and ``N=9`` are used instead. +{{% /notice %}} \ No newline at end of file diff --git a/content/learning-paths/cross-platform/sme2/6-SME2-matmul-intr.md b/content/learning-paths/cross-platform/sme2/6-SME2-matmul-intr.md new file mode 100644 index 0000000000..413f54bf48 --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/6-SME2-matmul-intr.md @@ -0,0 +1,344 @@ +--- +title: SME2 intrinsics matrix multiplication +weight: 8 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +## Matrix multiplication with SME2 intrinsics + +In this section, you will write an SME2 optimized matrix multiplication in C using the intrinsics that the compiler provides. + +*Intrinsics*, also know known as *compiler intrinsics* or *intrinsic functions*, are the functions available to application developers that the compiler has an +intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both. + +You can learn more about intrinsics in this [Wikipedia +Article on Intrinsic Function](https://en.wikipedia.org/wiki/Intrinsic_function). + +Using intrinsics allows the programmer to use the specific instructions +required to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable. + +All Arm-specific intrinsics are specified in the +[ACLE](https://github.com/ARM-software/acle), which is the Arm C Language Extension. ACLE +is supported by the main compilers, most notably [GCC](https://gcc.gnu.org/) and +[Clang](https://clang.llvm.org). + +## Streaming mode + +On the previous page, assembly language provided the programmer with full access to processor features. However, this comes at the cost of increased complexity and maintenance, particularly when managing large codebases with deeply nested function calls. Additionally, the assembly version operates at a very low level and does not fully handle the SME state. + +In real-world large-scale software, the program moves back and forth from streaming mode, and some streaming mode routines call other streaming mode routines, which means that some state needs to be saved and restored. This includes the ZA storage. This is defined in the ACLE and +supported by the compiler: the programmer *just* has to annotate the function +with some keywords and set up some registers (see function ``setup_sme`` in +``misc.c`` for an example). See [Introduction to streaming and non-streaming mode](https://arm-software.github.io/acle/main/acle.html#controlling-the-use-of-streaming-mode) +for further information. The rest of this section references information from the ACLE. + +The AArch64 architecture defines a concept called *streaming mode*, controlled +by a processor state bit called ``PSTATE.SM``. At any given point in time, the +processor is either in streaming mode (``PSTATE.SM==1``) or in non-streaming mode +(``PSTATE.SM==0``). There is an instruction called ``SMSTART`` to enter streaming mode +and an instruction called ``SMSTOP`` to return to non-streaming mode. + +Streaming mode has three main effects on C and C++ code: + +- It can change the length of SVE vectors and predicates: the length of an SVE + vector in streaming mode is called the “streaming vector length” (SVL), which + might be different from the normal non-streaming vector length. See + [Effect of streaming mode on VL](https://arm-software.github.io/acle/main/acle.html#effect-of-streaming-mode-on-vl) + for more details. +- Some instructions can only be executed in streaming mode, which means that + their associated ACLE intrinsics can only be used in streaming mode. These + intrinsics are called “streaming intrinsics”. +- Some other instructions can only be executed in non-streaming mode, which + means that their associated ACLE intrinsics can only be used in non-streaming + mode. These intrinsics are called “non-streaming intrinsics”. + +The C and C++ standards define the behavior of programs in terms of an *abstract +machine*. As an extension, the ACLE specification applies the distinction +between streaming mode and non-streaming mode to this abstract machine: at any +given point in time, the abstract machine is either in streaming mode or in +non-streaming mode. + +This distinction between processor mode and abstract machine mode is mostly just +a specification detail. However, the usual “as if” rule applies: the +processor's actual mode at runtime can be different from the abstract machine's +mode, provided that this does not alter the behavior of the program. One +practical consequence of this is that C and C++ code does not specify the exact +placement of ``SMSTART`` and ``SMSTOP`` instructions; the source code simply places +limits on where such instructions go. For example, when stepping through a +program in a debugger, the processor mode might sometimes be different from the +one implied by the source code. + +ACLE provides attributes that specify whether the abstract machine executes statements: + +- In non-streaming mode, in which case they are called *non-streaming statements*. +- In streaming mode, in which case they are called *streaming statements*. +- In either mode, in which case they are called *streaming-compatible statements*. + +SME provides an area of storage called ZA, of size ``SVL.B`` x ``SVL.B`` bytes. It +also provides a processor state bit called ``PSTATE.ZA`` to control whether ZA +is enabled. + +In C and C++ code, access to ZA is controlled at function granularity: a +function either uses ZA or it does not. Another way to say this is that a +function either “has ZA state” or it does not. + +If a function does have ZA state, the function can either share that ZA state +with the function's caller or create new ZA state “from scratch”. In the latter +case, it is the compiler's responsibility to free up ZA so that the function can +use it; see the description of the lazy saving scheme in +[AAPCS64](https://arm-software.github.io/acle/main/acle.html#AAPCS64) for details +about how the compiler does this. + +## Implementation + +Here again, a top level function named ``matmul_intr`` in ``matmul_intr.c`` +will be used to stitch together the preprocessing and the multiplication: + +```C "{ line_numbers = true }" +__arm_new("za") __arm_locally_streaming void matmul_intr( + uint64_t M, uint64_t K, uint64_t N, const float *restrict matLeft, + const float *restrict matRight, float *restrict matLeft_mod, + float *restrict matResult) { + uint64_t SVL = svcntsw(); + preprocess_l_intr(M, K, SVL, matLeft, matLeft_mod); + matmul_intr_impl(M, K, N, SVL, matLeft_mod, matRight, matResult); +} +``` + +Note the ``__arm_new("za")`` and ``__arm_locally_streaming`` at line 1 that will +make the compiler save the ZA storage so we can use it without destroying its +content if it was still in use by one of the callers. + +``SVL``, the dimension of the ZA storage, is requested from the underlying +hardware with the ``svcntsw()`` function call at line 5, and passed down to the +``preprocess_l_intr`` and ``matmul_intr_impl`` functions. ``svcntsw()`` is a +function provided be the ACLE library. + +### Matrix preprocessing + +```C "{ line_numbers = true }" +void preprocess_l_intr( + uint64_t M, uint64_t K, uint64_t SVL, const float *restrict a, + float *restrict a_mod) __arm_streaming __arm_inout("za") { + const uint64_t M_mod = SVL * (M / SVL + (M % SVL != 0 ? 1 : 0)); + + // The outer loop, iterating over rows (M dimension) + for (uint64_t row = 0; row < M; row += SVL) { + + svbool_t pMDim = svwhilelt_b32(row, M); + + // The inner loop, iterating on columns (K dimension). + for (uint64_t col = 0; col < K; col += 2 * SVL) { + + svcount_t pKDim = svwhilelt_c32(col, K, 2); + + // Load-as-rows + for (uint64_t trow = 0; trow < SVL; trow += 4) { + svcount_t p0 = svpsel_lane_c32(pKDim, pMDim, trow + 0); + svcount_t p1 = svpsel_lane_c32(pKDim, pMDim, trow + 1); + svcount_t p2 = svpsel_lane_c32(pKDim, pMDim, trow + 2); + svcount_t p3 = svpsel_lane_c32(pKDim, pMDim, trow + 3); + + const uint64_t tile_UL_corner = (row + trow) * K + col; + svfloat32x2_t zp0 = svld1_x2(p0, &a[tile_UL_corner + 0 * K]); + svfloat32x2_t zp1 = svld1_x2(p1, &a[tile_UL_corner + 1 * K]); + svfloat32x2_t zp2 = svld1_x2(p2, &a[tile_UL_corner + 2 * K]); + svfloat32x2_t zp3 = svld1_x2(p3, &a[tile_UL_corner + 3 * K]); + + svfloat32x4_t zq0 = svcreate4(svget2(zp0, 0), svget2(zp1, 0), + svget2(zp2, 0), svget2(zp3, 0)); + svfloat32x4_t zq1 = svcreate4(svget2(zp0, 1), svget2(zp1, 1), + svget2(zp2, 1), svget2(zp3, 1)); + svwrite_hor_za32_f32_vg4( + /* tile: */ 0, /* slice: */ trow, zq0); + svwrite_hor_za32_f32_vg4( + /* tile: */ 1, /* slice: */ trow, zq1); + } + + // Read-as-columns and store + const uint64_t dest_0 = row * K + col * SVL; + const uint64_t dest_1 = dest_0 + SVL * SVL; + for (uint64_t tcol = 0; tcol < SVL; tcol += 4) { + svcount_t p0 = svwhilelt_c32(dest_0 + tcol * SVL, K * M_mod, 4); + svcount_t p1 = svwhilelt_c32(dest_1 + tcol * SVL, K * M_mod, 4); + svfloat32x4_t zq0 = + svread_ver_za32_f32_vg4(/* tile: */ 0, /* slice: */ tcol); + svfloat32x4_t zq1 = + svread_ver_za32_f32_vg4(/* tile: */ 1, /* slice: */ tcol); + svst1(p0, &a_mod[dest_0 + tcol * SVL], zq0); + svst1(p1, &a_mod[dest_1 + tcol * SVL], zq1); + } + } + } +} +``` + +Note that ``preprocess_l_intr`` has been annotated at line 3 with: + +- ``__arm_streaming``, because this function is using streaming instructions, + +- ``__arm_inout("za")``, because ``preprocess_l_intr`` reuses the ZA storage + from its caller. + +The matrix preprocessing is performed in a double nested loop, over the ``M`` +(line 7) and ``K`` (line 12) dimensions of the input matrix ``a``. Both loops +have an ``SVL`` step increment, which corresponds to the horizontal and vertical +dimensions of the ZA storage that will be used. The dimensions of ``a`` may not +be perfect multiples of ``SVL`` though... which is why the predicates ``pMDim`` +(line 9) and ``pKDim`` (line 14) are computed in order to know which rows (respectively +columns) are valid. + +The core of ``preprocess_l_intr`` is made of two parts: + +- Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been + used at 2 different levels. At the lowest level, 4 rows are loaded at a time + (lines 24-27). But this goes much further because as SME2 has multi-vectors + operations (hence the ``svld1_x2`` intrinsic to load 2 rows in 2 vector + registers), this allows the function to load the consecutive row, which + happens to be the row from the neighboring tile on the right: this means two + tiles are processed at once. At line 29-32, the pairs of vector registers are + rearranged on quads of vector registers so they can be stored horizontally in + the two tiles' ZA storage at lines 33-36 with the ``svwrite_hor_za32_f32_vg4`` + intrinsic. Of course, as the input matrix may not have dimensions that are + perfect multiples of ``SVL``, the ``p0``, ``p1``, ``p2`` and ``p3`` predicates + are computed with the ``svpsel_lane_c32`` intrinsic (lines 18-21) so that + elements outside of the input matrix are set to 0 when they are loaded at + lines 24-27. + +- Lines 39 - 51: read the matrix tile as columns and store them. Now that the 2 + tiles have been loaded *horizontally*, they will be read *vertically* with the + ``svread_ver_za32_f32_vg4`` intrinsic to quad-registers of vectors (``zq0`` + and ``zq1``) at lines 45-48 and then stored with the ``svst1`` intrinsic to + the relevant location in the destination matrix ``a_mod`` (lines 49-50). Note + again the usage of predicates ``p0`` and ``p1`` (computed at lines 43-44) to + ``svst1`` to prevent writing out of the matrix bounds. + +Using intrinsics simplifies function development significantly, provided one has a good understanding of the SME2 instruction set. +Predicates, which are fundamental to SVE and SME, enable a natural expression of algorithms while handling corner cases efficiently. +Notably, there is no explicit condition checking within the loops to account for rows or columns extending beyond matrix bounds. + +### Outer-product multiplication + +```C "{ line_numbers = true }" +void matmul_intr_impl( + uint64_t M, uint64_t K, uint64_t N, uint64_t SVL, + const float *restrict matLeft_mod, const float *restrict matRight, + float *restrict matResult) __arm_streaming __arm_inout("za") { + + // Build the result matrix tile by tile. + for (uint64_t row = 0; row < M; row += SVL) { + + svbool_t pMDim = svwhilelt_b32(row, M); + + for (uint64_t col = 0; col < N; col += SVL) { + + svbool_t pNDim = svwhilelt_b32(col, N); + + // Outer product + accumulation + svzero_za(); + const uint64_t matLeft_pos = row * K; + const uint64_t matRight_UL_corner = col; + for (uint64_t k = 0; k < K; k++) { + svfloat32_t zL = + svld1(pMDim, &matLeft_mod[matLeft_pos + k * SVL]); + svfloat32_t zR = + svld1(pNDim, &matRight[matRight_UL_corner + k * N]); + svmopa_za32_m(0, pMDim, pNDim, zL, zR); + } + + // Store ZA to matResult. + const uint64_t result_tile_UL_corner = row * N + col; + for (uint64_t trow = 0; trow < SVL && row + trow < M; trow += 4) { + svbool_t p0 = svpsel_lane_b32(pNDim, pMDim, row + trow + 0); + svbool_t p1 = svpsel_lane_b32(pNDim, pMDim, row + trow + 1); + svbool_t p2 = svpsel_lane_b32(pNDim, pMDim, row + trow + 2); + svbool_t p3 = svpsel_lane_b32(pNDim, pMDim, row + trow + 3); + + svst1_hor_za32( + /* tile: */ 0, /* slice: */ trow + 0, p0, + &matResult[result_tile_UL_corner + (trow + 0) * N]); + svst1_hor_za32( + /* tile: */ 0, /* slice: */ trow + 1, p1, + &matResult[result_tile_UL_corner + (trow + 1) * N]); + svst1_hor_za32( + /* tile: */ 0, /* slice: */ trow + 2, p2, + &matResult[result_tile_UL_corner + (trow + 2) * N]); + svst1_hor_za32( + /* tile: */ 0, /* slice: */ trow + 3, p3, + &matResult[result_tile_UL_corner + (trow + 3) * N]); + } + } + } +} +``` + +Note again that ``matmul_intr_impl`` function has been annotated at line 4 with: + +- ``__arm_streaming``, because the function is using streaming instructions, + +- ``__arm_inout("za")``, because the function reuses the ZA storage from its caller. + +The multiplication with the outer product is performed in a double-nested loop, +over the ``M`` (line 7) and ``N`` (line 11) dimensions of the input matrices +``matLeft_mod`` and ``matRight``. Both loops have an ``SVL`` step increment, +which corresponds to the horizontal and vertical dimensions of the ZA storage +that will be used as one tile at a time will be processed. The ``M`` and ``N`` +dimensions of the inputs may not be perfect multiples of ``SVL`` so the +predicates ``pMDim`` (line 9) (respectively ``pNDim`` at line 13) are computed in order +to know which rows (respectively columns) are valid. + +The core of the multiplication is done in 2 parts: + +- Outer-product and accumulation at lines 15-25. As ``matLeft`` has been + laid-out perfectly in memory with ``preprocess_l_intr``, this part becomes + straightforward. First, the tile is zeroed with the ``svzero_za`` intrinsics + at line 16 so the outer products can be accumulated in the tile. The outer + products are computed and accumulation over the ``K`` common dimension with + the loop at line 19: the column of ``matleft_mod`` and the row of ``matRight`` + are loaded with the ``svld1`` intrinsics at line 20-23 to vector registers + ``zL`` and ``zR``, which are then used at line 24 with the ``svmopa_za32_m`` + intrinsic to perform the outer product and accumulation (to tile 0). This + is exactly what was shown in Figure 2 earlier in the Learning Path. + Note again the usage of the ``pMDim`` and ``pNDim`` predicates to deal + correctly with the rows and columns respectively which are out of bounds. + +- Storing of the result matrix at lines 27-46. The previous section computed the matrix multiplication result for the current tile, which now needs + to be written back to memory. This is done with the loop at line 29 which will + iterate over all rows of the tile: the ``svst1_hor_za32`` intrinsic at lines + 35-46 stores directly from the tile to memory. Note that the loop has been + unrolled by a factor of 4 (thus the ``trow += 4`` increment, line 29) and the + 4 ``svst1_hor_za32``. Again, the ``pMDim`` and ``pNDim`` predicates deal + gracefully with the parts of the tile which are out-of-bound for the + destination matrix ``matResult``. + +Once again, intrinsics makes it easy to fully leverage SME2, provided you have a solid understanding of its available instructions. +Predicates handle corner cases elegantly, ensuring robust execution. Most importantly, the code adapts to different SVL values across various hardware implementations without requiring recompilation. +This follows the key principle of compile-once, run-everywhere, allowing systems with larger SVL to execute computations more efficiently while using the same binary. + +### Compile and run + +The main function is exactly the same that was used for the assembly version, +with the ``IMPL`` macro defined to be ``intr`` in the ``Makefile``. + +First, make sure that the ``sme2_matmul_intr`` executable is up-to-date: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make sme2_matmul_intr +``` + +Then execute ``sme2_matmul_intr`` on the FVP: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh sme2_matmul_intr +``` + +This should output something similar to: + +```TXT +SME2 Matrix Multiply fp32 *intr* example with args 125 35 70 +Matrix preprocessing: PASS ! +Matrix multiplication: PASS ! + +Info: /OSCI/SystemC: Simulation stopped by user. +``` diff --git a/content/learning-paths/cross-platform/sme2/7-debugging.md b/content/learning-paths/cross-platform/sme2/7-debugging.md new file mode 100644 index 0000000000..3d490bd8af --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/7-debugging.md @@ -0,0 +1,110 @@ +--- +title: Debugging +weight: 9 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Debugging + +### Looking at the generated code + +In some cases, it is useful to look at the code generated by the compiler. In this Learning Path, the assembly listings have been produced and you can +inspect them. + +For example, the inner loop with the outer product and the accumulation of the matrix multiplication with intrinsics from the listing file ``sme2_matmul_intr.lst`` looks like this: + +```TXT +... +80001854: a540a1c0 ld1w { z0.s }, p0/z, [x14] +80001858: a540a661 ld1w { z1.s }, p1/z, [x19] +8000185c: f10006b5 subs x21, x21, #0x1 +80001860: 8b0d0273 add x19, x19, x13 +80001864: 8b0a01ce add x14, x14, x10 +80001868: 80812000 fmopa za0.s, p0/m, p1/m, z0.s, z1.s +8000186c: 54ffff41 b.ne 0x80001854 +... +``` + +### With debuggers + +Both of the main debuggers, ``gdb`` and ``lldb``, have some support for debugging SME2 code. Their usage is not shown in this Learning Path though, the main +reason for this being that this Learning Path focuses on the CPU in *baremetal* mode. + +This is a simplistic, and minimalistic environment, without an operating system, for example. Debug mode requires a debug monitor to interface between the debugger, the program, and the CPU. + +### With trace + +The FVP can emit an instruction trace file in text format, known as the Tarmac trace. This provides a convenient way for you to understand what the program is doing. + +In the excerpt shown below, you can see that the SVE register ``z0`` has been loaded with 16 values, as predicate ``p0`` was true, with an ``LD1W`` +instruction, whereas ``z1`` was loaded with only two values, as ``p1``. ``z0``, and ``z1`` are later used by the ``fmopa`` instruction to compute the +outer product, and the trace displays the content of the ZA storage. + +```TXT +923530000 ps IT (92353) 80001b08 a540a1a0 O EL3h_s : LD1W {z0.S},p0/Z,[x13] +923530000 ps MR4 81000868:000081000868 40000000 +923530000 ps MR4 8100086c:00008100086c 40800000 +923530000 ps MR4 81000870:000081000870 40c00000 +923530000 ps MR4 81000874:000081000874 41000000 +923530000 ps MR4 81000878:000081000878 41200000 +923530000 ps MR4 8100087c:00008100087c 41400000 +923530000 ps MR4 81000880:000081000880 41600000 +923530000 ps MR4 81000884:000081000884 41800000 +923530000 ps MR4 81000888:000081000888 41900000 +923530000 ps MR4 8100088c:00008100088c 41a00000 +923530000 ps MR4 81000890:000081000890 41b00000 +923530000 ps MR4 81000894:000081000894 41c00000 +923530000 ps MR4 81000898:000081000898 41d00000 +923530000 ps MR4 8100089c:00008100089c 41e00000 +923530000 ps MR4 810008a0:0000810008a0 41f00000 +923530000 ps MR4 810008a4:0000810008a4 42000000 +923530000 ps R Z0 42000000_41f00000_41e00000_41d00000_41c00000_41b00000_41a00000_41900000_41800000_41600000_41400000_41200000_41000000_40c00000_40800000_40000000 +923540000 ps IT (92354) 80001b0c a540a441 O EL3h_s : LD1W {z1.S},p1/Z,[x2] +923540000 ps MR4 81000780:000081000780 42027ae1 +923540000 ps MR4 81000784:000081000784 c16b5c29 +923540000 ps R Z1 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_c16b5c29_42027ae1 +923550000 ps IT (92355) 80001b10 f1000484 O EL3h_s : SUBS x4,x4,#1 +923550000 ps R cpsr 600003cd +923550000 ps R X4 0000000000000000 +923560000 ps IT (92356) 80001b14 8b0a0042 O EL3h_s : ADD x2,x2,x10 +923560000 ps R X2 0000000081000788 +923570000 ps IT (92357) 80001b18 8b1701ad O EL3h_s : ADD x13,x13,x23 +923570000 ps R X13 00000000810008A8 +923580000 ps IT (92358) 80001b1c 80812000 O EL3h_s : FMOPA ZA0.S,p0/M,p1/M,z0.S,z1.S +923580000 ps R ZA0H_S_0 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4190147b_42bd23d7 +923580000 ps R ZA0H_S_1 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_42a6e668_435a7852 +923580000 ps R ZA0H_S_2 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4314e3d7_43ab2f5c +923580000 ps R ZA0H_S_3 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4356547c_43e92290 +923580000 ps R ZA0H_S_4 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_438be28f_44138ae2 +923580000 ps R ZA0H_S_5 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_43ac9ae1_4432847b +923580000 ps R ZA0H_S_6 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_43cd5334_44517e15 +923580000 ps R ZA0H_S_7 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_43ee0b86_447077ae +923580000 ps R ZA0H_S_8 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_440761eb_4487b8a4 +923580000 ps R ZA0H_S_9 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4417be14_44973571 +923580000 ps R ZA0H_S_10 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_44281a3e_44a6b23e +923580000 ps R ZA0H_S_11 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_44387667_44b62f0a +923580000 ps R ZA0H_S_12 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4448d28f_44c5abd7 +923580000 ps R ZA0H_S_13 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_44592eb8_44d528a4 +923580000 ps R ZA0H_S_14 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_44698ae1_44e4a571 +923580000 ps R ZA0H_S_15 00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_4479e70a_44f4223e +``` + +You can get a Tarmac trace when invoking ``run-fvp.sh`` by adding the ``--trace`` option as the *first* argument, for example: + +```BASH +docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh --trace sme2_matmul_asm +``` + +Tracing is not enabled by default. It slows down the simulation significantly and the trace file can become very large for programs with large matrices. + +{{% notice Debugging tip %}} +It can be helpful when debugging to understand where an element in the +Tile is coming from. The current code base allows you to do that in ``debug`` +mode, when ``-DDEBUG`` is passed to the compiler in the ``Makefile``. If you +look into ``main.c``, you will notice that the matrix initialization is no +longer random, but instead initializes each element with its linear +index. This makes it *easier* to find where the matrix elements are loaded in +the tile in tarmac trace, for example. +{{% /notice %}} \ No newline at end of file diff --git a/content/learning-paths/cross-platform/sme2/8-going-further.md b/content/learning-paths/cross-platform/sme2/8-going-further.md new file mode 100644 index 0000000000..be2ac04bad --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/8-going-further.md @@ -0,0 +1,51 @@ +--- +title: Going further +weight: 10 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Generalize the algorithms + +In this Learning Path, you focused on using SME2 for matrix +multiplication with floating point numbers. However in practice, any library or framework supporting matrix multiplication should +also handle various integer types. + +You can see that the algorithm structure for matrix preprocessing as well +as multiplication with the outer product does not change at all for other data +types - they only need to be adapted. + +This is suitable for languages with [generic +programming](https://en.wikipedia.org/wiki/Generic_programming) like C++ with +templates. You can even make the template manage a case where the value +accumulated during the product uses a larger type than the input matrices. SME2 has the instructions to deal efficiently with this common case scenario. + +This enables the library developer to focus on the algorithm, testing, and optimizations, while allowing the compiler to generate multiple variants. + +## Unroll further + +You might have noticed that ``matmul_intr_impl`` computes only one tile at a time, for the sake of simplicity. + +SME2 does support multi-vector instructions, and some were used in ``preprocess_l_intr``, for example, ``svld1_x2``. + +Loading two vectors at a time enables the simultaneous computing of more tiles, and as the input matrices have been laid out in memory in a neat way, the consecutive +loading of the data is efficient. Implementing this approach can make improvements to the ``macc`` to load ``ratio``. + +In order to check your understanding of SME2, you can try to implement this unrolling yourself. You can check your work by comparing your results to the expected +reference values. + +## Apply strategies + +One method for optimization is to use strategies that are flexible depending on the matrices' dimensions. This is especially easy to set up when working in C or C++, +rather than directly in assembly language. + +By playing with the mathematical properties of matrix multiplication and the outer product, it is possible to minimize data movement as well as reduce the overall number of operations to perform. + +For example, it is common that one of the matrices is actually a vector, meaning that it has a single row or column, and then it becomes advantageous to transpose it. Can you see why? + +The answer is that as the elements are stored contiguously in memory, an ``Nx1`` and ``1xN`` matrices have the exact same memory layout. The transposition becomes a no-op, and the matrix elements stay in the same place in memory. + +An even more *degenerated* case that is easy to manage is when one of the matrices is essentially a scalar, which means that it is a matrix with one row and one column. + +Although our current code handles it correctly from a results point of view, a different algorithm and use of instructions might be more efficient. Can you think of another way? diff --git a/content/learning-paths/cross-platform/sme2/VSCode.png b/content/learning-paths/cross-platform/sme2/VSCode.png new file mode 100644 index 0000000000..b43c0030d1 Binary files /dev/null and b/content/learning-paths/cross-platform/sme2/VSCode.png differ diff --git a/content/learning-paths/cross-platform/sme2/_index.md b/content/learning-paths/cross-platform/sme2/_index.md new file mode 100644 index 0000000000..8763bbba6e --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/_index.md @@ -0,0 +1,49 @@ +--- +title: Accelerate Matrix Multiplication Performance with SME2 + +minutes_to_complete: 30 + +who_is_this_for: This Learning Path is an advanced topic for developers who want to learn about accelerating the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2). + +learning_objectives: + - Implement a reference matrix multiplication without using SME2. + - Use SME2 assembly instructions to improve the matrix multiplication performance. + - Use SME2 intrinsics to improve the matrix multiplication performance using the C programming language. + - Compile and run code with SME2 instructions. + +prerequisites: + - Basic knowledge of Arm's Scalable Matrix Extension (SME). + - Basic knowledge of Arm's Scalable Vector Extension (SVE). + - An intermediate understanding of C programming language and assembly language. + - A computer running Linux, MacOS, or Windows. + - Installations of Git and Docker. + - An emulator to run code with SME2 instructions. + - A compiler with support for SME2 instructions. + + +author_primary: Arnaud de Grandmaison + + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +armips: + - Neoverse + - Cortex-A +tools_software_languages: + - C + - Clang +operatingsystems: + - Linux, MacOS or Windows +shared_path: true +shared_between: + - servers-and-cloud-computing + - laptops-and-desktops + - smartphones-and-mobile + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/sme2/_next-steps.md b/content/learning-paths/cross-platform/sme2/_next-steps.md new file mode 100644 index 0000000000..514cabf9c7 --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/_next-steps.md @@ -0,0 +1,64 @@ +--- +next_step_guidance: PLACEHOLDER TEXT 1 + +recommended_path: /learning-paths/PLACEHOLDER_CATEGORY/PLACEHOLDER_LEARNING_PATH/ + +further_reading: + + - resource: + title: SVE Programming Examples + link: https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://developer.arm.com/documentation/dai0548/latest/&ved=2ahUKEwisi76m-f2GAxUDSKQEHfyWClAQFnoECA4QAQ&usg=AOvVaw1YPQ-aQsHmumnZykaFxM0b + type: documentation + + - resource: + title: Port Code to Arm Scalable Vector Extension (SVE) + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve + type: website + + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 1) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + + - resource: + title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture + type: website + + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 2) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + + - resource: + title: SME Programmer’s Guide + link: https://developer.arm.com/documentation/109246/latest + type: documentation + + - resource: + title: Matrix Multiplication + link: https://en.wikipedia.org/wiki/Matrix_multiplication + type: website + + - resource: + title: Compiler Intrinsics + link: https://en.wikipedia.org/wiki/Intrinsic_function + type: website + + - resource: + title: ACLE --- Arm C Language Extension + link: https://github.com/ARM-software/acle + type: website + + - resource: + title: Application Binary Interface for the Arm Architecture + link: https://github.com/ARM-software/abi-aa + type: website + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +weight: 21 # set to always be larger than the content in this path, and one more than 'review' +title: "Next Steps" # Always the same +layout: "learningpathall" # All files under learning paths have this same wrapper +--- diff --git a/content/learning-paths/cross-platform/sme2/_review.md b/content/learning-paths/cross-platform/sme2/_review.md new file mode 100644 index 0000000000..49682188ae --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/_review.md @@ -0,0 +1,45 @@ +--- +review: + - questions: + question: > + How does SME2 accelerate matrix multiplication? + answers: + - The matrix multiplication operation is a sum of outer products. + - Quantum physics. + correct_answer: 1 + explanation: > + The matrix multiplication operation can be expressed as a sum of outer products, + which allows the SME engine to perform many multiplications at once. + + - questions: + question: > + Why is the ZA storage so important for SME2? + answers: + - It is infinite. + - It holds a 2D view of matrices. + correct_answer: 2 + explanation: > + The ZA storage offers a 2D view of part of a matrix, which is also known as a tile. SME can operate + on complete tiles, or on horizontal or vertical slices of the tiles, which is a useful + and often-used feature in numerous algorithms. ZA storage is finite and has the size SVL x SVL. + + - questions: + question: > + What are predicates? + answers: + - Parts of a sentence or clause containing a verb and stating something about the subject. + - Predicates select the active lanes in a vector operation. + - Predicates are another word for flags from the Processor Status Register (PSR). + correct_answer: 2 + explanation: > + SVE is a predicate-centric architecture. Predicates allow Vector Length Agnosticism (VLA), they support complex nested conditions and loops and reduce vector loop management overhead by allowing lane predication in vector operations. Predicates have their own dedicated registers. + + + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +title: "Review" # Always the same title +weight: 20 # Set to always be larger than the content in this path +layout: "learningpathall" # All files under learning paths have this same wrapper +--- diff --git a/content/learning-paths/cross-platform/sme2/matmul.png b/content/learning-paths/cross-platform/sme2/matmul.png new file mode 100644 index 0000000000..8a45b76307 Binary files /dev/null and b/content/learning-paths/cross-platform/sme2/matmul.png differ diff --git a/content/learning-paths/cross-platform/sme2/outer_product.png b/content/learning-paths/cross-platform/sme2/outer_product.png new file mode 100644 index 0000000000..ed163f703d Binary files /dev/null and b/content/learning-paths/cross-platform/sme2/outer_product.png differ diff --git a/content/learning-paths/cross-platform/sme2/overview.md b/content/learning-paths/cross-platform/sme2/overview.md new file mode 100644 index 0000000000..db395e1ff9 --- /dev/null +++ b/content/learning-paths/cross-platform/sme2/overview.md @@ -0,0 +1,40 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Overview of Arm's Scalable Matrix Extension Version 2 + +### What is SME2? + +The Scalable Matrix Extension (SME) is an extension to the Armv9-A architecture. The Scalable Matrix Extension Version 2 (SME2) extends the SME architecture by accelerating vector operations to increase the number of applications that can benefit from the computational efficiency of SME, beyond its initial focus on outer products and matrix-matrix multiplication. + +SME2 extends SME by introducing multi-vector data-processing instructions, load to and store from multi-vectors, and a multi-vector predication mechanism. + +Additional architectural features of SME2 include: + +* Multi-vector multiply-accumulate instructions, with Z vectors as multiplier and multiplicand inputs and accumulating results into ZA array vectors, including widening multiplies that accumulate into more vectors than they read. + +* Multi-vector load, store, move, permute, and convert instructions, that use multiple SVE Z vectors as source and destination registers to pre-process inputs and post-process outputs of the ZA-targeting SME2 instructions. + +* *Predicate-as-counter*, which is an alternative predication mechanism that is added to the original SVE predication mechanism, to control operations performed on multiple vector registers. + +* Compressed neural network capability using dedicated lookup table instructions and outer product instructions that support binary neural networks. + +* A 512-bit architectural register ZT0, that supports the lookup table feature. + +### Suggested reading + +If you are not familiar with matrix multiplication, or would benefit from refreshing your knowledge, this [Wikipedia article on Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) is a good start. + +This Learning Path assumes some basic understanding of SVE and SME. If you are not familiar with SVE or SME, these are some useful resources that you can read first: + + - [Introducing the Scalable Matrix Extension for the Armv9-A + Architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture). + - [Arm Scalable Matrix Extension (SME) Introduction (Part + 1)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction). + - [Arm Scalable Matrix Extension (SME) Introduction (Part + 2)](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2). \ No newline at end of file