diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md index 6d798ad108..45f47e1713 100644 --- a/content/learning-paths/cross-platform/simd-loops/1-about.md +++ b/content/learning-paths/cross-platform/simd-loops/1-about.md @@ -17,7 +17,7 @@ extensions introduce new instructions, more flexible programming models, and support for concepts like predication, scalable vectors, and streaming modes. However, they also come with a learning curve. -That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code. +[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code. SIMD Loops is designed to help you learn how to write SVE and SME code. It is a collection diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md index 86328c023d..ce01b381f5 100644 --- a/content/learning-paths/cross-platform/simd-loops/2-using.md +++ b/content/learning-paths/cross-platform/simd-loops/2-using.md @@ -13,15 +13,46 @@ git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git cd simd-loops.git ``` +Confirm you are using an Arm machine by running: + +```bash +uname -m +``` + +The output on Linux should be: + +```output +aarch64 +``` + +And for macOS: + +```output +arm64 +``` + ## SIMD Loops structure -In the SIMD Loops project, the -source code for the loops is organized under the loops directory. The complete -list of loops is documented in the loops.inc file, which includes a brief +In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete +list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the naming pattern `loop_.c`, where `` represents the loop number. +A subset of the `loops.inc` file is below: + +```output +LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE) +LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE) +LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE) +LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE) +LOOP(005, "strlen short strings", "Use of FF and NF loads instructions") +LOOP(006, "strlen long strings", "Use of FF and NF loads instructions") +LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions") +LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions") +LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE +``` + A loop is structured as follows: ```C @@ -55,23 +86,79 @@ void inner_loop_(struct loop__data *data) { ... } Each loop is implemented in several SIMD extension variants, and conditional compilation is used to select one of the optimizations for the -`inner_loop_` function. The native C implementation is written first, and -it can be generated either when building natively (HAVE_NATIVE) or through -compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g., -SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE +`inner_loop_` function. + +The native C implementation is written first, and +it can be generated either when building natively with `-DHAVE_NATIVE` or through +compiler auto-vectorization `-DHAVE_AUTOVEC`. + +When SIMD ACLE is supported (SME, SVE, or NEON), +the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, -SVE2.1, SVE2, and others. The overall code structure also includes setup and +SVE2.1, SVE2, and others. + +The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness. At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly -variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics -sme_intrinsics` ...). - -As the result of the build, two types of binaries are generated. The first is a -single executable named `simd_loops`, which includes all the loop -implementations. A specific loop can be selected by passing parameters to the -program (e.g., `simd_loops -k -n `). The second type consists -of individual standalone binaries, each corresponding to a specific loop. +variants. + +```console +make +``` + +With no target specified the list of targets is printed: + +```output +all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics +``` + +You can build all loops for all targets using: + +```console +make all +``` + +You can build all loops for a single target, such as NEON, using: + +```console +make neon +``` + +As the result of the build, two types of binaries are generated. + +The first is a single executable named `simd_loops`, which includes all the loop implementations. + +A specific loop can be selected by passing parameters to the +program. + +For example, to run loop 1 for 5 iterations using the NEON target: + +```console +build/neon/bin/simd_loops -k 1 -n 5 +``` + +The output is: + +```output +Loop 001 - FP32 inner product + - Purpose: Use of fp32 MLA instruction + - Checksum correct. +``` + +The second type of binary is an individual loop. + +To run loop 1 as a standlone binary: + +```console +build/neon/standalone/bin/loop_001.elf +``` + +The output is: + +```output + - Checksum correct. +``` diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md index fa3b614a40..805e9f76ec 100644 --- a/content/learning-paths/cross-platform/simd-loops/3-example.md +++ b/content/learning-paths/cross-platform/simd-loops/3-example.md @@ -6,12 +6,16 @@ weight: 5 layout: learningpathall --- -To illustrate the structure and design principles of simd-loops, consider loop -202 as an example. `inner_loop_202` is defined at lines 69-79 in file +To illustrate the structure and design principles of SIMD Loops, consider loop +202 as an example. + +Use a text editor to look at the file `loops/loop_202.c` + +The function `inner_loop_202()` is defined at lines 60-70 in file `loops/loops_202.c` and calls the `matmul_fp32` routine defined in `matmul_fp32.c`. -Open `loops/matmul_fp32.c`. +Use a text editor to look at the file `loops/matmul_fp32.c` This loop implements a single precision floating point matrix multiplication of the form: @@ -39,10 +43,10 @@ struct loop_202_data { ``` For this loop: -- The first input matrix (A) is stored in column-major format in memory. +- The first input matrix (a) is stored in column-major format in memory. - The second input matrix (b) is stored in row-major format in memory. -- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they - overlap in some way) --- as indicated by the `restrict` keyword. +- None of the memory area designated by `a`, `b` and `c` alias (they + overlap in some way) as indicated by the `restrict` keyword. This layout choice helps optimize memory access patterns for all the targeted SIMD architectures. @@ -59,7 +63,7 @@ This design enables portability across different SIMD extensions. ## Function implementation -The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several +The `matmul_fp32()` function from file `loops/matmul_fp32.c` provides several optimizations of the single-precision floating-point matrix multiplication, including the ACLE intrinsics-based code, and the assembly hand-optimized code. diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md index d1e85d10d0..337d1a8d08 100644 --- a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md +++ b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md @@ -7,15 +7,17 @@ layout: learningpathall --- SIMD Loops is an invaluable -resource for developers looking to learn or master the intricacies of SVE and -SME on modern Arm architectures. By providing practical, hands-on examples, it +resource for developers looking to learn the intricacies of SVE and +SME on a variety of Arm architectures. By providing practical, hands-on examples, it bridges the gap between the architecture specification and real-world -application. Whether you're transitioning from NEON or starting fresh with SVE +application. + +Whether you're transitioning from NEON or starting fresh with SVE and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding and proficiency. With its extensive collection of loop kernels, detailed documentation, and -flexible build options, SIMD Loops empowers you to explore +flexible build options, SIMD Loops helps you to explore and leverage the full potential of Arm's advanced vector extensions. Dive into the project, experiment with the examples, and take your high-performance coding skills for Arm to the next level. @@ -23,4 +25,4 @@ skills for Arm to the next level. For more information and to get started, visit the GitLab project and refer to the [README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md) -for instructions on building and running the code. +for the latest instructions on building and running the code. diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md index d10ffa777f..bed384fa7c 100644 --- a/content/learning-paths/cross-platform/simd-loops/_index.md +++ b/content/learning-paths/cross-platform/simd-loops/_index.md @@ -31,7 +31,6 @@ operatingsystems: tools_software_languages: - GCC - Clang - - FVP shared_path: true shared_between: