Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,5 @@ Ann Cheng,Arm,anncheng-arm,hello-ann,,
Fidel Makatia Omusilibwa,,,,,
Ker Liu,,,,,
Rui Chang,,,,,

Alejandro Martinez Vicente,Arm,,,,
Mohamad Najem,Arm,,,,
79 changes: 79 additions & 0 deletions content/learning-paths/cross-platform/simd-loops/1-about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: About SIMD Loops
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

Writing high-performance software for Arm processors often involves delving into
its SIMD technologies. For many developers, that journey started with Neon --- a
familiar, fixed-width vector extension that has been around for years. But as
Arm architectures continue to evolve, so do their SIMD technologies.

Enter the world of SVE and SME: two powerful, scalable vector extensions designed for modern
workloads. Unlike Neon, they aren’t just wider --- they’re different. These
extensions introduce new instructions, more flexible programming models, and
support for concepts like predication, scalable vectors, and streaming modes.
However, they also come with a learning curve.

That’s where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) comes
in.

[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is designed to help
you in the process of learning how to write SVE and SME code. It is a collection
of self-contained, real-world loop kernels --- written in a mix of C, ACLE
intrinsics, and inline assembly --- that target everything from simple arithmetic
to matrix multiplication, sorting, and string processing. You can compile them,
run them, step through them, and use them as a foundation for your own SIMD
work.

If you’re familiar with Neon intrinsics and would like to explore what SVE and
SME have to offer, the [SIMD
Loops](https://gitlab.arm.com/architecture/simd-loops) project is for you !

## What is SIMD Loops ?

[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is an open-source
project built to help you learn how to write SIMD code for modern Arm
architectures --- specifically using SVE (Scalable Vector Extension) and SME
(Scalable Matrix Extension). It is designed for programmers who already know
their way around Neon intrinsics but are now facing the more powerful --- and
more complex --- world of SVE and SME.

The goal of SIMD Loops is to provide working, readable examples that demonstrate
how to use the full range of features available in SVE, SVE2, and SME2. Each
example is a self-contained loop kernel --- a small piece of code that performs
a specific task like matrix multiplication, vector reduction, histogram or
memory copy --- and shows how that task can be implemented across different
vector instruction sets.

Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
takes the opposite approach: it aims to showcase the architecture, not the
problem. The loop kernels are chosen to be realistic and meaningful, but the
main goal is to demonstrate how specific features and instructions work in
practice. If you’re trying to understand scalability, predication,
gather/scatter, streaming mode, ZA storage, compact instructions, or the
mechanics of matrix tiles --- this is where you’ll see them in action.

The project includes:
- Dozens of numbered loop kernels, each focused on a specific feature or pattern
- Reference C implementations to establish expected behavior
- Inline assembly and/or intrinsics for scalar, Neon, SVE, SVE2, SVE2.1, SME2 and SME2.1
- Build support for different instruction sets, with runtime validation
- A simple command-line runner to execute any loop interactively
- Optional standalone binaries for bare-metal and simulator use

You don’t need to worry about auto-vectorization, compiler flags, or tooling
quirks. Each loop is hand-written and annotated to make the use of SIMD features
clear. The intent is that you can study, modify, and run each loop as a learning
exercise --- and use the project as a foundation for your own exploration of
Arm’s vector extensions.

## Where to get it?

[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is available as an
open-source code licensed under BSD 3-Clause. You can access the source code
from the following GitLab project:
https://gitlab.arm.com/architecture/simd-loops

78 changes: 78 additions & 0 deletions content/learning-paths/cross-platform/simd-loops/2-using.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: Using SIMD Loops
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

First, clone [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) and
change current directory to it with:

```BASH
git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
cd simd-loops.git
```

## SIMD Loops structure

In the [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) project, the
source code for the loops is organized under the loops directory. The complete
list of loops is documented in the loops.inc file, which includes a brief
description and the purpose of each loop. Every loop is associated with a
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
`<NNN>` represents the loop number.

A loop is structured as follows:

```C
// Includes and loop_<NNN>_data structure definition

#if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC)

// C code
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#if defined(HAVE_xxx_INTRINSICS)

// Intrinsics versions: xxx = SME, SVE, or SIMD (Neon) versions
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#elif defined(<ASM_COND>)

// Hand-written inline assembly :
// <ASM_COND> = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1,
// __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#else

#error "No implementations available for this target."

#endif

// Main of loop: Buffers allocations, loop function call, result functional checking
```

Each loop is implemented in several SIMD extension variants, and conditional
compilation is used to select one of the optimisations for the
`inner_loop_<NNN>` function. The native C implementation is written first, and
it can be generated either when building natively (HAVE_NATIVE) or through
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE
support is not available, the build process falls back to handwritten inline
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
SVE2.1, SVE2, and others. The overall code structure also includes setup and
cleanup code in the main function, where memory buffers are allocated, the
selected loop kernel is executed, and results are verified for correctness.

At compile time, you can select which loop optimisation to compile, whether it
is based on SME or SVE intrinsics, or one of the available inline assembly
variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
sme_intrinsics` ...).

As the result of the build, two types of binaries are generated. The first is a
single executable named `simd_loops`, which includes all the loop
implementations. A specific loop can be selected by passing parameters to the
program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
of individual standalone binaries, each corresponding to a specific loop.
Loading