Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ extensions introduce new instructions, more flexible programming models, and
support for concepts like predication, scalable vectors, and streaming modes.
However, they also come with a learning curve.

That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
[SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) is a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.

SIMD Loops is designed to help
you learn how to write SVE and SME code. It is a collection
Expand Down
119 changes: 103 additions & 16 deletions content/learning-paths/cross-platform/simd-loops/2-using.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,46 @@ git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
cd simd-loops.git
```

Confirm you are using an Arm machine by running:

```bash
uname -m
```

The output on Linux should be:

```output
aarch64
```

And for macOS:

```output
arm64
```

## SIMD Loops structure

In the SIMD Loops project, the
source code for the loops is organized under the loops directory. The complete
list of loops is documented in the loops.inc file, which includes a brief
In the SIMD Loops project, the source code for the loops is organized under the loops directory. The complete
list of loops is documented in the `loops.inc` file, which includes a brief
description and the purpose of each loop. Every loop is associated with a
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
`<NNN>` represents the loop number.

A subset of the `loops.inc` file is below:

```output
LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE
```

A loop is structured as follows:

```C
Expand Down Expand Up @@ -55,23 +86,79 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

Each loop is implemented in several SIMD extension variants, and conditional
compilation is used to select one of the optimizations for the
`inner_loop_<NNN>` function. The native C implementation is written first, and
it can be generated either when building natively (HAVE_NATIVE) or through
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
`inner_loop_<NNN>` function.

The native C implementation is written first, and
it can be generated either when building natively with `-DHAVE_NATIVE` or through
compiler auto-vectorization `-DHAVE_AUTOVEC`.

When SIMD ACLE is supported (SME, SVE, or NEON),
the code is compiled using high-level intrinsics. If ACLE
support is not available, the build process falls back to handwritten inline
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
SVE2.1, SVE2, and others. The overall code structure also includes setup and
SVE2.1, SVE2, and others.

The overall code structure also includes setup and
cleanup code in the main function, where memory buffers are allocated, the
selected loop kernel is executed, and results are verified for correctness.

At compile time, you can select which loop optimization to compile, whether it
is based on SME or SVE intrinsics, or one of the available inline assembly
variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
sme_intrinsics` ...).

As the result of the build, two types of binaries are generated. The first is a
single executable named `simd_loops`, which includes all the loop
implementations. A specific loop can be selected by passing parameters to the
program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
of individual standalone binaries, each corresponding to a specific loop.
variants.

```console
make
```

With no target specified the list of targets is printed:

```output
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
```

You can build all loops for all targets using:

```console
make all
```

You can build all loops for a single target, such as NEON, using:

```console
make neon
```

As the result of the build, two types of binaries are generated.

The first is a single executable named `simd_loops`, which includes all the loop implementations.

A specific loop can be selected by passing parameters to the
program.

For example, to run loop 1 for 5 iterations using the NEON target:

```console
build/neon/bin/simd_loops -k 1 -n 5
```

The output is:

```output
Loop 001 - FP32 inner product
- Purpose: Use of fp32 MLA instruction
- Checksum correct.
```

The second type of binary is an individual loop.

To run loop 1 as a standlone binary:

```console
build/neon/standalone/bin/loop_001.elf
```

The output is:

```output
- Checksum correct.
```
18 changes: 11 additions & 7 deletions content/learning-paths/cross-platform/simd-loops/3-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,16 @@ weight: 5
layout: learningpathall
---

To illustrate the structure and design principles of simd-loops, consider loop
202 as an example. `inner_loop_202` is defined at lines 69-79 in file
To illustrate the structure and design principles of SIMD Loops, consider loop
202 as an example.

Use a text editor to look at the file `loops/loop_202.c`

The function `inner_loop_202()` is defined at lines 60-70 in file
`loops/loops_202.c` and calls the `matmul_fp32` routine defined in
`matmul_fp32.c`.

Open `loops/matmul_fp32.c`.
Use a text editor to look at the file `loops/matmul_fp32.c`

This loop implements a single precision floating point matrix multiplication of
the form:
Expand Down Expand Up @@ -39,10 +43,10 @@ struct loop_202_data {
```

For this loop:
- The first input matrix (A) is stored in column-major format in memory.
- The first input matrix (a) is stored in column-major format in memory.
- The second input matrix (b) is stored in row-major format in memory.
- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they
overlap in some way) --- as indicated by the `restrict` keyword.
- None of the memory area designated by `a`, `b` and `c` alias (they
overlap in some way) as indicated by the `restrict` keyword.

This layout choice helps optimize memory access patterns for all the targeted
SIMD architectures.
Expand All @@ -59,7 +63,7 @@ This design enables portability across different SIMD extensions.

## Function implementation

The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several
The `matmul_fp32()` function from file `loops/matmul_fp32.c` provides several
optimizations of the single-precision floating-point matrix multiplication,
including the ACLE intrinsics-based code, and the assembly hand-optimized code.

Expand Down
12 changes: 7 additions & 5 deletions content/learning-paths/cross-platform/simd-loops/4-conclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,22 @@ layout: learningpathall
---

SIMD Loops is an invaluable
resource for developers looking to learn or master the intricacies of SVE and
SME on modern Arm architectures. By providing practical, hands-on examples, it
resource for developers looking to learn the intricacies of SVE and
SME on a variety of Arm architectures. By providing practical, hands-on examples, it
bridges the gap between the architecture specification and real-world
application. Whether you're transitioning from NEON or starting fresh with SVE
application.

Whether you're transitioning from NEON or starting fresh with SVE
and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding
and proficiency.

With its extensive collection of loop kernels, detailed documentation, and
flexible build options, SIMD Loops empowers you to explore
flexible build options, SIMD Loops helps you to explore
and leverage the full potential of Arm's advanced vector extensions. Dive into
the project, experiment with the examples, and take your high-performance coding
skills for Arm to the next level.

For more information and to get started, visit the GitLab project and refer
to the
[README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md)
for instructions on building and running the code.
for the latest instructions on building and running the code.
1 change: 0 additions & 1 deletion content/learning-paths/cross-platform/simd-loops/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ operatingsystems:
tools_software_languages:
- GCC
- Clang
- FVP

shared_path: true
shared_between:
Expand Down