Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
277d394
Deploy Nginx on the Microsoft Azure Cobalt 100 processors
odidev Jul 31, 2025
bba0ef0
Deploy Golang on the Microsoft Azure Cobalt 100 processors
odidev Aug 19, 2025
ad515bf
PostgreSQL Tune: Updated page cache guidance
jsrz Sep 15, 2025
3d965da
Update kernel_comp_lib.md
pareenaverma Sep 16, 2025
c8e251f
Merge pull request #2317 from jsrz/postgresql_tune_updated
pareenaverma Sep 16, 2025
63d431a
Update stats_current_test_info.yml
pareenaverma Sep 16, 2025
7d50fb7
PostgreSQL Tuning Update
jsrz Sep 16, 2025
d4f8152
Merge pull request #2318 from ArmDeveloperEcosystem/update-stats-curr…
jasonrandrews Sep 16, 2025
b35ffc9
Merge pull request #2319 from jsrz/postgresql_recycle
jasonrandrews Sep 16, 2025
385d9f8
Tech review of java azure LP
pareenaverma Sep 16, 2025
3e0e6ff
Changed references to AVH in Zephyr learning path
vovamarch Sep 16, 2025
6e5f1ce
Merge pull request #2320 from pareenaverma/content_review
pareenaverma Sep 16, 2025
677c057
Update zephyr.md
pareenaverma Sep 16, 2025
4f19004
Update _index.md
pareenaverma Sep 16, 2025
fd4955d
Merge pull request #2208 from odidev/nginx_LP
pareenaverma Sep 16, 2025
2254ab3
Merge pull request #2321 from vovamarch/main
pareenaverma Sep 16, 2025
7549347
Update to ALP: Vision LLM inference on Android with KleidiAI and MNN
Sep 17, 2025
147ea60
Update 1-devenv-and-model.md
pareenaverma Sep 17, 2025
29720ca
Update 1-devenv-and-model.md
pareenaverma Sep 17, 2025
f6eb493
Merge pull request #2322 from amalaugustinejose/vision-llm-inference-…
pareenaverma Sep 17, 2025
3c80a1b
Update 4-build-model.md
pareenaverma Sep 17, 2025
44f6a5e
Merge pull request #2323 from pareenaverma/main
pareenaverma Sep 17, 2025
ad9bed0
Final tech review of SIMD Loops
jasonrandrews Sep 17, 2025
88e4fa1
Merge pull request #2324 from jasonrandrews/review
jasonrandrews Sep 17, 2025
d4b7e7c
Tech review of NGINX Azure LP
pareenaverma Sep 17, 2025
f26c92b
Merge pull request #2325 from pareenaverma/content_review
pareenaverma Sep 17, 2025
5969b49
starting content review
madeline-underwood Sep 17, 2025
171c6cf
Update zephyr.md
pareenaverma Sep 18, 2025
0fab38b
Merge pull request #2326 from pareenaverma/content_review
pareenaverma Sep 18, 2025
ee3896a
content development
madeline-underwood Sep 18, 2025
0322c41
Updated LOs
madeline-underwood Sep 18, 2025
4a918b7
Removed superflous fields
madeline-underwood Sep 18, 2025
bcc7868
Updates
madeline-underwood Sep 18, 2025
fce20ad
Corrected numbering on weightings
madeline-underwood Sep 18, 2025
f7094a4
Streamlining
madeline-underwood Sep 18, 2025
420959f
Mapping vector extensions to Arm
jasonrandrews Sep 18, 2025
a16d086
Merge pull request #2327 from jasonrandrews/review
jasonrandrews Sep 18, 2025
f54274e
Merge branch 'ArmDeveloperEcosystem:main' into simd_loops
madeline-underwood Sep 18, 2025
34ad463
Fixed issues with index file
madeline-underwood Sep 18, 2025
f8c811f
Merge branch 'simd_loops' of https://github.com/madeline-underwood/ar…
madeline-underwood Sep 18, 2025
9653244
Added further LOs
madeline-underwood Sep 18, 2025
c31d636
Tweaks
madeline-underwood Sep 18, 2025
5f96715
Tweaked LOs to remove superfluous ones
madeline-underwood Sep 18, 2025
39004e0
Update _index.md
pareenaverma Sep 18, 2025
d7c1a28
Merge pull request #2238 from odidev/golang_LP
pareenaverma Sep 18, 2025
3c2ff7d
Merge pull request #2328 from madeline-underwood/simd_loops
jasonrandrews Sep 18, 2025
2db241a
category and tag updates
jasonrandrews Sep 19, 2025
d3c5f3b
Merge pull request #2329 from jasonrandrews/review
jasonrandrews Sep 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions content/learning-paths/automotive/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,24 @@ title: Automotive
weight: 4
subjects_filter:
- Containers and Virtualization: 3
- Performance and Architecture: 2
- Performance and Architecture: 5
operatingsystems_filter:
- Baremetal: 1
- Linux: 4
- Linux: 7
- macOS: 1
- RTOS: 1
tools_software_languages_filter:
- Automotive: 1
- C: 1
- Arm Development Studio: 1
- Arm Zena CSS: 1
- C: 2
- C++: 1
- Clang: 2
- DDS: 1
- Docker: 2
- GCC: 2
- Python: 2
- Raspberry Pi: 1
- ROS 2: 1
- ROS2: 2
- ROS 2: 3
- Rust: 1
- Zenoh: 1
---
59 changes: 12 additions & 47 deletions content/learning-paths/cross-platform/simd-loops/1-about.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,35 @@
---
title: About single instruction, multiple data (SIMD) loops
weight: 3
title: About Single Instruction, Multiple Data loops
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

Writing high-performance software for Arm processors often involves delving into
SIMD technologies. For many developers, that journey started with NEON, a
familiar, fixed-width vector extension that has been around for many years. But as
Arm architectures continue to evolve, so do their SIMD technologies.
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs

Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern
workloads. Unlike NEON, they are not just wider; they are fundamentally different. These
extensions introduce new instructions, more flexible programming models, and
support for concepts like predication, scalable vectors, and streaming modes.
However, they also come with a learning curve.
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.

That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code.
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.

SIMD Loops is designed to help
you learn how to write SVE and SME code. It is a collection
of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE)
intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic
to matrix multiplication, sorting, and string processing. You can compile them,
run them, step through them, and use them as a foundation for your own SIMD
work.
## What is the SIMD Loops project?

If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME.
The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads.

## What is SIMD Loops?
Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops).

SIMD Loops is an open-source
project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm
architectures, specifically using SVE and SME.
It is designed for programmers who already know
their way around NEON intrinsics but are now facing the more powerful and
complex world of SVE and SME.
This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets.

The goal of SIMD Loops is to provide working, readable examples that demonstrate
how to use the full range of features available in SVE, SVE2, and SME2. Each
example is a self-contained loop kernel, a small piece of code that performs
a specific task like matrix multiplication, vector reduction, histogram, or
memory copy. These examples show how that task can be implemented across different
vector instruction sets.

Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops
takes the opposite approach. It aims to showcase the architecture rather than
the problem. The loop kernels are chosen to be realistic and meaningful, but the
main goal is to demonstrate how specific features and instructions work in
practice. If you are trying to understand scalability, predication,
gather/scatter, streaming mode, ZA storage, compact instructions, or the
mechanics of matrix tiles, this is where you will see them in action.
Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action.

The project includes:
- Dozens of numbered loop kernels, each focused on a specific feature or pattern
- Many numbered loop kernels, each focused on a specific feature or pattern
- Reference C implementations to establish expected behavior
- Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1
- Build support for different instruction sets, with runtime validation
- A simple command-line runner to execute any loop interactively
- Optional standalone binaries for bare-metal and simulator use

You do not need to worry about auto-vectorization, compiler flags, or tooling
quirks. Each loop is hand-written and annotated to make the use of SIMD features
clear. The intent is that you can study, modify, and run each loop as a learning
exercise, and use the project as a foundation for your own exploration of
Arm’s vector extensions.
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.


140 changes: 105 additions & 35 deletions content/learning-paths/cross-platform/simd-loops/2-using.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,74 @@
---
title: Using SIMD Loops
weight: 4
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

To get started, clone the SIMD Loops project and change current directory:
## Set up your development environment

To get started, clone the SIMD Loops project and change to the project directory:

```bash
git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git
cd simd-loops.git
```

Confirm that you are using an Arm machine:

```bash
uname -m
```

Expected output on Linux:

```output
aarch64
```

Expected output on macOS:

```output
arm64
```

## SIMD Loops structure

In the SIMD Loops project, the
source code for the loops is organized under the loops directory. The complete
list of loops is documented in the loops.inc file, which includes a brief
description and the purpose of each loop. Every loop is associated with a
uniquely named source file following the naming pattern `loop_<NNN>.c`, where
`<NNN>` represents the loop number.
In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_<NNN>.c`, where `<NNN>` represents the loop number.

A subset of the `loops.inc` file is below:

```output
LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE)
LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE)
LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE)
LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE)
LOOP(005, "strlen short strings", "Use of FF and NF loads instructions")
LOOP(006, "strlen long strings", "Use of FF and NF loads instructions")
LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions")
LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions")
LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE)
```

A loop is structured as follows:

```C
```c
// Includes and loop_<NNN>_data structure definition

#if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC)

// C code
// C reference or auto-vectorized version
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#if defined(HAVE_xxx_INTRINSICS)

// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON) versions
// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON)
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#elif defined(<ASM_COND>)

// Hand-written inline assembly :
// Hand-written inline assembly
// <ASM_COND> = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1,
// __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON
void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }
Expand All @@ -50,28 +79,69 @@ void inner_loop_<NNN>(struct loop_<NNN>_data *data) { ... }

#endif

// Main of loop: Buffers allocations, loop function call, result functional checking
// Main of loop: buffer allocation, loop function call, result checking
```

Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_<NNN>` function.

The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.

When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.

The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.

At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.

```console
make
```

With no target specified, the list of targets is printed:

```output
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
```

Build all loops for all targets:

```console
make all
```

Build all loops for a single target, such as NEON:

```console
make neon
```

As a result of the build, two types of binaries are generated.

The first is a single executable named `simd_loops`, which includes all loop implementations.

Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target:

```console
build/neon/bin/simd_loops -k 1 -n 5
```

Example output:

```output
Loop 001 - FP32 inner product
- Purpose: Use of fp32 MLA instruction
- Checksum correct.
```

Each loop is implemented in several SIMD extension variants, and conditional
compilation is used to select one of the optimizations for the
`inner_loop_<NNN>` function. The native C implementation is written first, and
it can be generated either when building natively (HAVE_NATIVE) or through
compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g.,
SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE
support is not available, the build process falls back to handwritten inline
assembly targeting one of the available SIMD extensions, such as SME2.1, SME2,
SVE2.1, SVE2, and others. The overall code structure also includes setup and
cleanup code in the main function, where memory buffers are allocated, the
selected loop kernel is executed, and results are verified for correctness.

At compile time, you can select which loop optimization to compile, whether it
is based on SME or SVE intrinsics, or one of the available inline assembly
variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics
sme_intrinsics` ...).

As the result of the build, two types of binaries are generated. The first is a
single executable named `simd_loops`, which includes all the loop
implementations. A specific loop can be selected by passing parameters to the
program (e.g., `simd_loops -k <NNN> -n <iterations>`). The second type consists
of individual standalone binaries, each corresponding to a specific loop.
The second type of binary is an individual loop.

To run loop 1 as a standalone binary:

```console
build/neon/standalone/bin/loop_001.elf
```

Example output:

```output
- Checksum correct.
```
Loading
Loading