From a79e5da38dac048e5e7df9dd9833299d1dc32557 Mon Sep 17 00:00:00 2001 From: Arnaud de Grandmaison Date: Fri, 11 Jul 2025 10:25:12 +0200 Subject: [PATCH 1/2] Add a new learning path on SIMD Loops. --- assets/contributors.csv | 3 +- .../cross-platform/simd-loops/1-about.md | 79 +++++ .../cross-platform/simd-loops/2-using.md | 71 +++++ .../cross-platform/simd-loops/3-example.md | 281 ++++++++++++++++++ .../cross-platform/simd-loops/4-conclusion.md | 28 ++ .../cross-platform/simd-loops/_index.md | 84 ++++++ .../cross-platform/simd-loops/_next-steps.md | 8 + .../ai-camera-pipelines/3-build.md | 3 +- 8 files changed, 555 insertions(+), 2 deletions(-) create mode 100644 content/learning-paths/cross-platform/simd-loops/1-about.md create mode 100644 content/learning-paths/cross-platform/simd-loops/2-using.md create mode 100644 content/learning-paths/cross-platform/simd-loops/3-example.md create mode 100644 content/learning-paths/cross-platform/simd-loops/4-conclusion.md create mode 100644 content/learning-paths/cross-platform/simd-loops/_index.md create mode 100644 content/learning-paths/cross-platform/simd-loops/_next-steps.md diff --git a/assets/contributors.csv b/assets/contributors.csv index e65a28cb1c..ef6f06ea90 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -100,4 +100,5 @@ Ann Cheng,Arm,anncheng-arm,hello-ann,, Fidel Makatia Omusilibwa,,,,, Ker Liu,,,,, Rui Chang,,,,, - +Alejandro Martinez Vicente,Arm,,,, +Mohamad Najem,Arm,,,, diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md new file mode 100644 index 0000000000..9516ac447e --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/1-about.md @@ -0,0 +1,79 @@ +--- +title: About SIMD Loops +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Writing high-performance software for Arm processors often involves delving into +its SIMD technologies. For many developers, that journey started with Neon --- a +familiar, fixed-width vector extension that has been around for years. But as +Arm architectures continue to evolve, so do their SIMD technologies. + +Enter the world of SVE and SME: two powerful, scalable vector extensions designed for modern +workloads. Unlike Neon, they aren’t just wider --- they’re different. These +extensions introduce new instructions, more flexible programming models, and +support for concepts like predication, scalable vectors, and streaming modes. +However, they also come with a learning curve. + +That’s where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) comes +in. + +[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is designed to help +you in the process of learning how to write SVE and SME code. It is a collection +of self-contained, real-world loop kernels --- written in a mix of C, ACLE +intrinsics, and inline assembly --- that target everything from simple arithmetic +to matrix multiplication, sorting, and string processing. You can compile them, +run them, step through them, and use them as a foundation for your own SIMD +work. + +If you’re familiar with Neon intrinsics and would like to explore what SVE and +SME have to offer, the [SIMD +Loops](https://gitlab.arm.com/architecture/simd-loops) project is for you ! + +## What is SIMD Loops ? + +[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is an open-source +project built to help you learn how to write SIMD code for modern Arm +architectures --- specifically using SVE (Scalable Vector Extension) and SME +(Scalable Matrix Extension). It is designed for programmers who already know +their way around Neon intrinsics but are now facing the more powerful --- and +more complex --- world of SVE and SME. + +The goal of SIMD Loops is to provide working, readable examples that demonstrate +how to use the full range of features available in SVE, SVE2, and SME2. Each +example is a self-contained loop kernel --- a small piece of code that performs +a specific task like matrix multiplication, vector reduction, histogram or +memory copy --- and shows how that task can be implemented across different +vector instruction sets. + +Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops +takes the opposite approach: it aims to showcase the architecture, not the +problem. The loop kernels are chosen to be realistic and meaningful, but the +main goal is to demonstrate how specific features and instructions work in +practice. If you’re trying to understand scalability, predication, +gather/scatter, streaming mode, ZA storage, compact instructions, or the +mechanics of matrix tiles --- this is where you’ll see them in action. + +The project includes: +- Dozens of numbered loop kernels, each focused on a specific feature or pattern +- Reference C implementations to establish expected behavior +- Inline assembly and/or intrinsics for scalar, Neon, SVE, SVE2, and SME2 +- Build support for different instruction sets, with runtime validation +- A simple command-line runner to execute any loop interactively +- Optional standalone binaries for bare-metal and simulator use + +You don’t need to worry about auto-vectorization, compiler flags, or tooling +quirks. Each loop is hand-written and annotated to make the use of SIMD features +clear. The intent is that you can study, modify, and run each loop as a learning +exercise --- and use the project as a foundation for your own exploration of +Arm’s vector extensions. + +## Where to get it? + +[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is available as an +open-source code licensed under BSD 3-Clause. You can access the source code +from the following GitLab project: +https://gitlab.arm.com/architecture/simd-loops + diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md new file mode 100644 index 0000000000..0c0c0fd7ad --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/2-using.md @@ -0,0 +1,71 @@ +--- +title: Using SIMD Loops +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +First, clone [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) and +change current directory to it with: + +```BASH +git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git +cd simd-loops.git +``` + +## SIMD Loops structure + +In the [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) project, the +source code for the loops is organized under the loops directory. The complete +list of loops is documented in the loops.inc file, which includes a brief +description and the purpose of each loop. Every loop is associated with a +uniquely named source file following the naming pattern `loop_.c`, where +`` represents the loop number. + +A loop is structured as follows: + +```C +// Includes and loop__data structure definition + +#if defined(HAVE_xxx_INTRINSICS) + +// Intrinsics versions: xxx = SME, SVE, or SIMD (Neon) versions +void inner_loop_(struct loop__data *data) { ... } + +#elif defined(HAVE_xxx) + +// Hand-written inline assembly : xxx = SME2P1, SME2, SVE2P1, SVE2, SVE, or SIMD +void inner_loop_(struct loop__data *data) { ... } + +#else + +// Equivalent C code +void inner_loop_(struct loop__data *data) { ... } + +#endif + +// Main of loop: Buffers allocations, loop function call, result functional checking +``` + +Each loop is implemented in several SIMD extension variants, and conditional +compilation is used to select one of the optimisations for the +`inner_loop_` function. When ACLE is supported (e.g. SME, SVE, or +SIMD/Neon), a high-level intrinsic implementation is compiled. If ACLE is not +available, the tool falls back to handwritten inline assembly targeting one of +the various SIMD extensions, including SME2.1, SME2, SVE2.1, SVE2, and others. +If no handwritten inline assembly is detected, a fallback implementation in +native C is used. The overall code structure also includes setup and cleanup +code in the main function, where memory buffers are allocated, the selected loop +kernel is executed, and results are verified for correctness. + +At compile time, you can select which loop optimisation to compile, whether it +is based on SME or SVE intrinsics, or one of the available inline assembly +variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics +sme_intrinsics` ...). + +As the result of the build, two types of binaries are generated. The first is a +single executable named `simd_loops`, which includes all the loop +implementations. A specific loop can be selected by passing parameters to the +program (e.g., `simd_loops -k -n `). The second type consists +of individual standalone binaries, each corresponding to a specific loop. diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md new file mode 100644 index 0000000000..924b9e357d --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/3-example.md @@ -0,0 +1,281 @@ +--- +title: Example +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +To illustrate the structure and design principles of simd-loops, consider loop +202 as an example. `inner_loop_202` is defined at lines 69-79 in file +`loops/loops_202.c` and calls the `matmul_fp32` routine defined in +`matmul_fp32.c`. + +Open `loops/matmul_fp32.c`. + +This loop implements a single precision floating point matrix multiplication of +the form: + +`C[M x N] = A[M x K] x B[K x N]` + +A matrix multiplication can be understood in two equivalent ways: +- As the dot product between each row of matrix `A` and each column of matrix `B`. +- As the sum of outer products between the columns of `A` and the rows of `B`. + +## Data structure + +The loop begins by defining the data structure, which captures the matrix +dimensions (`M`, `K`, `N`) along with input and output buffers: + +```C +struct loop_202_data { + uint64_t m; + uint64_t n; + uint64_t k; + float *restrict a; + float *restrict b; + float *restrict c; +}; +``` + +For this loop: +- The first input matrix (A) is stored in column-major format in memory. +- The second input matrix (b) is stored in row-major format in memory. +- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they + overlap in some way) --- as indicated by the `restrict` keyword. + +This layout choice helps optimize memory access patterns for all the targeted +SIMD architectures. + +## Loop attributes + +Next, the loop attributes are specified depending on the target architecture: +- For SME targets, the function `inner_loop_202` must be invoked with the + `__arm_streaming` attribute, using a shared `ZA` register context + (`__arm_inout("za")`). There attributes are wrapped in the LOOP_ATTR macro. +- For SVE or NEON targets, no additional attributes are required. + +This design enables portability across different SIMD extensions. + +## Function implementation + +The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several +optimizations of the single-precision floating-point matrix multiplication, +including the ACLE intrinsics-based code, and the assembly hand-optimized code. + +### Scalar code + +A scalar C implementation is provided at lines 40-52. This version follows the +dot-product formulation of matrix multiplication, serving both as a functional +reference and a baseline for auto-vectorization: + +```C { line_numbers="true", line_start="40" } + for (uint64_t x = 0; x < m; x++) { + for (uint64_t y = 0; y < n; y++) { + c[x * n + y] = 0.0f; + } + } + + // Loops ordered for contiguous memory access in inner loop + for (uint64_t z = 0; z < k; z++) + for (uint64_t x = 0; x < m; x++) { + for (uint64_t y = 0; y < n; y++) { + c[x * n + y] += a[z * m + x] * b[z * n + y]; + } + } +``` + +### SVE optimized code + +The SVE implementation uses the indexed floating-point multiply-accumulate +(`fmla`) instruction to optimize the matrix multiplication operation. In this +formulation, the outer-product is decomposed into multiple indexed +multiplication steps, with results accumulated directly into `Z` registers. + +In the intrinsic version (lines 167-210), the innermost loop is structured as follows: + +```C { line_numbers = "true", line_start="167"} + for (m_idx = 0; m_idx < m; m_idx += 8) { + for (n_idx = 0; n_idx < n; n_idx += svcntw() * 2) { + ZERO_PAIR(0); + ZERO_PAIR(1); + ZERO_PAIR(2); + ZERO_PAIR(3); + ZERO_PAIR(4); + ZERO_PAIR(5); + ZERO_PAIR(6); + ZERO_PAIR(7); + + ptr_a = &a[m_idx]; + ptr_b = &b[n_idx]; + while (ptr_a < cnd_k) { + lda_0 = LOADA_PAIR(0); + lda_1 = LOADA_PAIR(1); + ldb_0 = LOADB_PAIR(0); + ldb_1 = LOADB_PAIR(1); + + MLA_GROUP(0); + MLA_GROUP(1); + MLA_GROUP(2); + MLA_GROUP(3); + MLA_GROUP(4); + MLA_GROUP(5); + MLA_GROUP(6); + MLA_GROUP(7); + + ptr_a += m * 2; + ptr_b += n * 2; + } + + ptr_c = &c[n_idx]; + STORE_PAIR(0); + STORE_PAIR(1); + STORE_PAIR(2); + STORE_PAIR(3); + STORE_PAIR(4); + STORE_PAIR(5); + STORE_PAIR(6); + STORE_PAIR(7); + } + c += n * 8; + } +``` + +At the beginning of the loop, the accumulators (`Z` registers) are explicitly +initialized to zero. This is achieved using `svdup` intrinsic (or its equivalent +`dup` assembly instruction), encapsulated in the `ZERO_PAIR` macro. + +Within each iteration over the `K` dimension: +- 128 bits (four consecutive floating point values) are loaded from the matrix + `A`, using the load replicate `svld1rq` intrinsics (or `ld1rqw` in assembly) + in `LOADA_PAIR` macro. +- Two consecutive vectors are loaded from matrix `B`, using the SVE load + instructions, called by the `LOADB_PAIR` macro. +- A sequence of indexed multiply-accumulate operations is performed, computing + the product of each element from `A` with the vectors from `B`. +- The results are accumulated across the 16 `Z` register accumulators, + progressively building the partial results of the matrix multiplication. + +After completing all iterations across the `K` dimension, the accumulated +results in the `Z` registers are stored back to memory. The `STORE_PAIR` macro +writes the values into the corresponding locations of the output matrix `C`. + +The equivalent SVE hand-optimized assembly code is written at lines 478-598. + +This loop showcases how SVE registers and indexed `fmla` instructions enable +efficient decomposition of the outer-product formulation into parallel, +vectorized accumulation steps. + +For more details on SVE/SVE2 instruction semantics, optimization guidelines and +other documents refer to the [Scalable Vector Extensions +resources](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions). + +### SME2 optimized code + +The SME2 implementation leverages the outer-product formulation of the matrix +multiplication function, utilizing the `fmopa` SME instruction to perform the +outer-product and accumulate partial results in `ZA` tiles. + +A snippet of the loop is shown below: + +```C { line_numbers = "true", line_start="78"} +#if defined(__ARM_FEATURE_SME2p1) + svzero_za(); +#endif + + for (m_idx = 0; m_idx < m; m_idx += svl_s * 2) { + for (n_idx = 0; n_idx < n; n_idx += svl_s * 2) { +#if !defined(__ARM_FEATURE_SME2p1) + svzero_za(); +#endif + + ptr_a = &a[m_idx]; + ptr_b = &b[n_idx]; + while (ptr_a < cnd_k) { + vec_a0 = svld1_x2(c_all, &ptr_a[0]); + vec_b0 = svld1_x2(c_all, &ptr_b[0]); + vec_a1 = svld1_x2(c_all, &ptr_a[m]); + vec_b1 = svld1_x2(c_all, &ptr_b[n]); + + MOPA_TILE(0, 0, 0, 0); + MOPA_TILE(1, 0, 0, 1); + MOPA_TILE(2, 0, 1, 0); + MOPA_TILE(3, 0, 1, 1); + MOPA_TILE(0, 1, 0, 0); + MOPA_TILE(1, 1, 0, 1); + MOPA_TILE(2, 1, 1, 0); + MOPA_TILE(3, 1, 1, 1); + + ptr_a += m * 2; + ptr_b += n * 2; + } + + ptr_c = &c[n_idx]; + for (l_idx = 0; l_idx < l_cnd; l_idx += 8) { +#if defined(__ARM_FEATURE_SME2p1) + vec_c0 = svreadz_hor_za8_u8_vg4(0, l_idx + 0); + vec_c1 = svreadz_hor_za8_u8_vg4(0, l_idx + 4); +#else + vec_c0 = svread_hor_za8_u8_vg4(0, l_idx + 0); + vec_c1 = svread_hor_za8_u8_vg4(0, l_idx + 4); +#endif + + STORE_PAIR(0, 0, 1, 0); + STORE_PAIR(1, 0, 1, n); + STORE_PAIR(0, 2, 3, c_blk); + STORE_PAIR(1, 2, 3, c_off); + + ptr_c += n * 2; + } + } + c += c_blk * 2; + } +``` + +Within the SME2 intrinsics code (lines 91-106), the innermost loop iterates across +the `K` dimension - corresponding to the columns of matrix `A` and the rows of +matrix `B`. + +In each iteration: +- Two consecutive vectors are loaded from `A` and two consecutive vectors are + loaded from `B` (`vec_a`, and `vec_b`), using the multi-vector load + instructions. +- The `fmopa` instruction, encapsulated within the `MOPA_TILE` macro, computes + the outer product of the input vectors. +- The results are accumulated into the four 32-bit `ZA` tiles. + +After all iterations over K dimension, the accumulated results are stored back +to memory through a store loop at lines 111-124: + +During this phase, four rows of `ZA` tiles are read out into four `Z` vectors +using the `svread_hor_za8_u8_vg4` intrinsic (or the equivalent `mova` assembly +instruction). The vectors are then stored into the output buffer with SME +multi-vector `st1w` store instructions, wrapped in the `STORE_PAIR` macro. + +The equivalent SME2 hand-optimized code is at lines 229-340. + +For more details on instruction semantics, and SME/SME2 optimization guidelines, +refer to the official [SME Programmer's +Guide](https://developer.arm.com/documentation/109246/latest/). + +## Other optimizations + +Beyond the SME2 and SVE2 implementations shown above, this loop also includes several +alternative optimized versions, each leveraging architecture-specific features. + +### Neon + +The neon version (lines 612-710) relies on multiple structure load/store +combined with indexed `fmla` instructions to vectorize the matrix multiplication +operation. + +### SVE2.1 + +The SVE2.1 implementation (lines 355-462) extends the base SVE approach by +utilizing multi-vector load and store instructions. + +### SME2.1 + +The SME2.1 leverages the `movaz` instruction / `svreadz_hor_za8_u8_vg4` +intrinsic to simultaneously reinitialize `ZA` tile accumulators while moving +data out to registers. diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md new file mode 100644 index 0000000000..ac9814a73a --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md @@ -0,0 +1,28 @@ +--- +title: Conclusion +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is an invaluable +resource for developers looking to learn or master the intricacies of SVE and +SME on modern Arm architectures. By providing practical, hands-on examples, it +bridges the gap between the architecture specification and real-world +application. Whether you're transitioning from Neon or starting fresh with SVE +and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding +and proficiency. + +With its extensive collection of loop kernels, detailed documentation, and +flexible build options, [SIMD +Loops](https://gitlab.arm.com/architecture/simd-loops) empowers you to explore +and leverage the full potential of Arm's advanced vector extensions. Dive into +the project, experiment with the examples, and take your high-performance coding +skills for Arm to the next level. + +For more information and to get started, visit the [SIMD +Loops](https://gitlab.arm.com/architecture/simd-loops) GitLab project and refer +to the +[README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md) +for detailed instructions on building and running the code. Happy coding! diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md new file mode 100644 index 0000000000..fa624d0115 --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/_index.md @@ -0,0 +1,84 @@ +--- +title: "Code kata: perfect your SVE and SME instructions skills with SIMD Loops" + +minutes_to_complete: 30 + +who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2 and SME2 to improve the performance of their software for Arm processors. + +learning_objectives: + - Improve your writing of SIMD code with SVE and SME. + +prerequisites: + - An AArch64 computer running Linux or macOS. You can use cloud instances, see this list of [Arm cloud service providers](/learning-paths/servers-and-cloud-computing/csp/). + - Some familiarity of SIMD programming and Neon intrinsics + +author: + - Alejandro Martinez Vicente + - Mohamad Najem + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +armips: + - Neoverse +operatingsystems: + - Linux + - macOS +tools_software_languages: + - GCC + - Clang + - FVP + +further_reading: + - resource: + title: SVE Programming Examples + link: https://developer.arm.com/documentation/dai0548/latest + type: documentation + - resource: + title: Port Code to Arm Scalable Vector Extension (SVE) + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve + type: website + - resource: + title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture + type: website + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 1) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 2) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + - resource: + title: (Part 3) Matrix-matrix multiplication. Neon, SVE, and SME compared + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/matrix-matrix-multiplication-neon-sve-and-sme-compared + type: blog + - resource: + title: Build adaptive libraries with multiversioning + link: https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/ + type: website + - resource: + title: SME Programmer's Guide + link: https://developer.arm.com/documentation/109246/latest + type: documentation + - resource: + title: Compiler Intrinsics + link: https://en.wikipedia.org/wiki/Intrinsic_function + type: website + - resource: + title: ACLE - Arm C Language Extension + link: https://github.com/ARM-software/acle + type: website + - resource: + title: Application Binary Interface for the Arm Architecture + link: https://github.com/ARM-software/abi-aa + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/simd-loops/_next-steps.md b/content/learning-paths/cross-platform/simd-loops/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/cross-platform/simd-loops/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md index 6257f4cba2..f96e92d4fc 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md @@ -29,7 +29,8 @@ Build the Docker container used to compile the pipelines: ```bash docker build -t ai-camera-pipelines -f docker/Dockerfile \ --build-arg DOCKERHUB_MIRROR=docker.io \ - --build-arg CI_UID=$(id -u) . + --build-arg CI_UID=$(id -u) \ + docker ``` ## Build the AI Camera Pipelines From d81d700d0d5a9c75e68d4f62be6cd98f6fedff8a Mon Sep 17 00:00:00 2001 From: Arnaud de Grandmaison Date: Mon, 8 Sep 2025 14:34:14 +0200 Subject: [PATCH 2/2] [simd loops] Fix code snippet + wording improvements. --- .../cross-platform/simd-loops/1-about.md | 2 +- .../cross-platform/simd-loops/2-using.md | 31 ++++++++++++------- 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md index 9516ac447e..49b53a29e5 100644 --- a/content/learning-paths/cross-platform/simd-loops/1-about.md +++ b/content/learning-paths/cross-platform/simd-loops/1-about.md @@ -59,7 +59,7 @@ mechanics of matrix tiles --- this is where you’ll see them in action. The project includes: - Dozens of numbered loop kernels, each focused on a specific feature or pattern - Reference C implementations to establish expected behavior -- Inline assembly and/or intrinsics for scalar, Neon, SVE, SVE2, and SME2 +- Inline assembly and/or intrinsics for scalar, Neon, SVE, SVE2, SVE2.1, SME2 and SME2.1 - Build support for different instruction sets, with runtime validation - A simple command-line runner to execute any loop interactively - Optional standalone binaries for bare-metal and simulator use diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md index 0c0c0fd7ad..7ca3276628 100644 --- a/content/learning-paths/cross-platform/simd-loops/2-using.md +++ b/content/learning-paths/cross-platform/simd-loops/2-using.md @@ -28,20 +28,26 @@ A loop is structured as follows: ```C // Includes and loop__data structure definition +#if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC) + +// C code +void inner_loop_(struct loop__data *data) { ... } + #if defined(HAVE_xxx_INTRINSICS) // Intrinsics versions: xxx = SME, SVE, or SIMD (Neon) versions void inner_loop_(struct loop__data *data) { ... } -#elif defined(HAVE_xxx) +#elif defined() -// Hand-written inline assembly : xxx = SME2P1, SME2, SVE2P1, SVE2, SVE, or SIMD + // Hand-written inline assembly : +// = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1, +// __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON void inner_loop_(struct loop__data *data) { ... } #else -// Equivalent C code -void inner_loop_(struct loop__data *data) { ... } +#error "No implementations available for this target." #endif @@ -50,14 +56,15 @@ void inner_loop_(struct loop__data *data) { ... } Each loop is implemented in several SIMD extension variants, and conditional compilation is used to select one of the optimisations for the -`inner_loop_` function. When ACLE is supported (e.g. SME, SVE, or -SIMD/Neon), a high-level intrinsic implementation is compiled. If ACLE is not -available, the tool falls back to handwritten inline assembly targeting one of -the various SIMD extensions, including SME2.1, SME2, SVE2.1, SVE2, and others. -If no handwritten inline assembly is detected, a fallback implementation in -native C is used. The overall code structure also includes setup and cleanup -code in the main function, where memory buffers are allocated, the selected loop -kernel is executed, and results are verified for correctness. +`inner_loop_` function. The native C implementation is written first, and +it can be generated either when building natively (HAVE_NATIVE) or through +compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g., +SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE +support is not available, the build process falls back to handwritten inline +assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, +SVE2.1, SVE2, and others. The overall code structure also includes setup and +cleanup code in the main function, where memory buffers are allocated, the +selected loop kernel is executed, and results are verified for correctness. At compile time, you can select which loop optimisation to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly