diff --git a/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md b/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md new file mode 100644 index 0000000000..d17480fcbc --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md @@ -0,0 +1,145 @@ +--- +title: Migrating SIMD code to the Arm architecture +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Vectorization on x86 vs. Arm + +Migrating SIMD (Single Instruction, Multiple Data) code from x86 extensions to Arm extensions is an important task for software developers aiming to optimize performance on Arm platforms. + +Understanding the mapping between x86 instruction sets like SSE, AVX, and AMX to Arm's NEON, SVE, and SME extensions is essential for ensuring portability and high performance. This Learning Path provides an overview to help you design a migration plan, leveraging Arm features such as scalable vector lengths and advanced matrix operations, to effectively adapt your code. + +Vectorization is a key optimization strategy where one instruction processes multiple data elements simultaneously. It drives performance in HPC, AI/ML, signal processing, and data analytics. + +Both x86 and Arm processors offer rich SIMD capabilities, but they differ in philosophy and design. The x86 architecture provides fixed-width vector units of 128, 256, and 512 bits. The Arm architecture offers a mix of fixed-width, for NEON, and scalable vectors for SVE and SME ranging from 128 to 2048 bits. + +If you are interested in migrating SIMD software to Arm, understanding these differences ensures portable, high-performance code. + +## Arm vector and matrix extensions + +### NEON + +NEON is a 128-bit SIMD extension available across all Armv8 cores, including both mobile and Neoverse platforms. It is particularly well-suited for multimedia processing, digital signal processing (DSP), and packet processing workloads. Conceptually, NEON is equivalent to x86 SSE or AVX-128, making it the primary target for migrating SSE workloads. Compiler support for auto-vectorization to NEON is mature, simplifying the migration process for developers. + +### Scalable Vector Extension (SVE) + +SVE introduces a revolutionary approach to SIMD with its vector-length agnostic (VLA) design. Registers in SVE can range from 128 to 2048 bits, with the exact width determined by the hardware implementation in multiples of 128 bits. This flexibility allows the same binary to run efficiently across different hardware generations. SVE also features advanced capabilities like per-element predication, which eliminates branch divergence, and native support for gather/scatter operations, enabling efficient handling of irregular memory accesses. While SVE is ideal for high-performance computing (HPC) and future-proof portability, developers must adapt to its unique programming model, which differs significantly from fixed-width SIMD paradigms. SVE is most similar to AVX-512 on x86, offering greater portability and scalability. + +### Scalable Matrix Extension (SME) + +SME is designed to accelerate matrix multiplication and is similar to AMX. Unlike AMX, which relies on dot-product-based operations, SME employs outer-product-based operations, providing greater flexibility for custom AI and HPC kernels. SME integrates seamlessly with SVE, utilizing scalable tiles and a streaming mode to optimize performance. It is particularly well-suited for AI training and inference workloads, as well as dense linear algebra in HPC applications. + +## x86 vector and matrix extensions + +### Streaming SIMD Extensions (SSE) + +The SSE instruction set provides 128-bit XMM registers and supports both integer and floating-point SIMD operations. Despite being an older technology, SSE remains a baseline for many libraries due to its widespread adoption. + +However, its fixed-width design and limited throughput make it less competitive compared to more modern extensions like AVX. When migrating code from SSE to Arm, developers will find that SSE maps well to Arm NEON, enabling a relatively straightforward transition. + +### Advanced Vector Extensions (AVX) + +The AVX extensions introduce 256-bit YMM registers with AVX and 512-bit ZMM registers with AVX-512, offering significant performance improvements over SSE. Key features include Fused Multiply-Add (FMA) operations, masked operations in AVX-512, and VEX/EVEX encodings that allow for more operands and flexibility. + +Migrating AVX code to Arm requires careful consideration, as AVX maps to NEON for up to 128-bit operations or to SVE for scalable-width operations. Since SVE is vector-length agnostic, porting AVX code often involves refactoring to accommodate this new paradigm. + +### Advanced Matrix Extensions (AMX) + +AMX is a specialized instruction set designed for accelerating matrix operations using dedicated matrix-tile registers, effectively treating 2D arrays as first-class citizens. It is particularly well-suited for AI workloads, such as convolutions and General Matrix Multiplications (GEMMs). + +When migrating AMX workloads to Arm, you can leverage Arm SME, which conceptually aligns with AMX but employs a different programming model based on outer products rather than dot products. This difference requires you to adapt their code to fully exploit SME's capabilities. + +## Comparison tables + +## SSE vs. NEON + +| Feature | SSE | NEON | +|-----------------------|---------------------------------------------------------------|----------------------------------------------------------------| +| **Register width** | 128-bit (XMM registers) | 128-bit (Q registers) | +| **Vector length model**| Fixed 128 bits | Fixed 128 bits | +| **Predication / masking**| Minimal predication; SSE lacks full mask registers | Conditional select instructions; no hardware mask registers | +| **Gather / Scatter** | No native gather/scatter (introduced in AVX2 and later) | No native gather/scatter; requires software emulation | +| **Instruction set scope**| Arithmetic, logical, shuffle, blend, conversion, basic SIMD | Arithmetic, logical, shuffle, saturating ops, multimedia, crypto extensions (AES, SHA)| +| **Floating-point support**| Single and double precision floating-point SIMD operations | Single and double precision floating-point SIMD operations | +| **Typical applications**| Legacy SIMD workloads; general-purpose vector arithmetic | Multimedia processing, DSP, cryptography, embedded compute | +| **Extensibility** | Extended by AVX/AVX2/AVX-512 for wider vectors and advanced features| NEON fixed at 128-bit vectors; ARM SVE offers scalable vectors but is separate | +| **Programming model** | Intrinsics supported in C/C++; assembly used for optimization | Intrinsics widely used; inline assembly less common | + + +## AVX vs. SVE (SVE2) + +| Feature | x86: AVX / AVX-512 | ARM: SVE / SVE2 | +|-----------------------|---------------------------------------------------------|---------------------------------------------------------------| +| **Register width** | Fixed: 256-bit (YMM), 512-bit (ZMM) | Scalable: 128 to 2048 bits (in multiples of 128 bits) | +| **Vector length model**| Fixed vector length; requires multiple code paths or compiler dispatch for different widths | Vector-length agnostic; same binary runs on any hardware vector width | +| **Predication / masking**| Mask registers for per-element operations (AVX-512) | Rich predication with per-element predicate registers | +| **Gather/Scatter** | Native gather/scatter support (AVX2 and AVX-512) | Native gather/scatter with efficient implementation across vector widths | +| **Key operations** | Wide SIMD, fused multiply-add (FMA), conflict detection, advanced masking | Wide SIMD, fused multiply-add (FMA), predicated operations, gather/scatter, reduction operations, bit manipulation | +| **Best suited for** | HPC, AI workloads, scientific computing, data analytics | HPC, AI, scientific compute, cloud and scalable workloads | +| **Limitations** | Power and thermal throttling on heavy 512-bit usage; complex software ecosystem | Requires vector-length agnostic programming style; ecosystem and hardware adoption still maturing | + +## AMX vs. SME + +| Feature | x86: AMX | ARM: SME | +|-----------------------|---------------------------------------------------------|------------------------------------------------------------| +| **Register width** | Tile registers with fixed dimensions: 16×16 for BF16, 64×16 for INT8 (about 1 KB total) | Scalable matrix tiles integrated with SVE, implementation-dependent tile dimensions | +| **Vector length model**| Fixed tile dimensions based on data type | Implementation-dependent tile dimensions, scales with SVE vector length | +| **Predication / masking**| No dedicated predication or masking in AMX tiles | Predication integrated through SVE predicate registers | +| **Gather/Scatter** | Not supported within AMX; handled by other instructions | Supported via integration with SVE’s gather/scatter features | +| **Key operations** | Focused on dot-product based matrix multiplication, optimized for GEMM and convolutions | Focus on outer-product matrix multiplication with streaming mode for dense linear algebra | +| **Best suited for** | AI/ML workloads such as training and inference, specifically GEMM and convolution kernels | AI/ML training and inference, scientific computing, dense linear algebra workloads | +| **Limitations** | Hardware and software ecosystem currently limited (primarily Intel Xeon platforms) | Emerging hardware support; compiler and library ecosystem in development | + + +## Key Differences for Developers + +When migrating from x86 SIMD extensions to Arm SIMD, there are several important architectural and programming differences for you to consider. + +### Vector Length Model + +x86 SIMD extensions such as SSE, AVX, and AVX-512 operate on fixed vector widths, 128, 256, or 512 bits. This often necessitates multiple code paths or compiler dispatch techniques to efficiently exploit available hardware SIMD capabilities. Arm NEON, similar to SSE, uses a fixed 128-bit vector width, making it a familiar, fixed-size SIMD baseline. + +In contrast, Arm’s Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME) introduce a vector-length agnostic model. This allows vectors to scale from 128 bits up to 2048 bits depending on the hardware, enabling the same binary to run efficiently across different implementations without modification. + +### Programming and Intrinsics + +x86 offers a comprehensive and mature set of SIMD intrinsics that increase in complexity especially with AVX-512 due to advanced masking and lane-crossing operations. Arm NEON intrinsics resemble SSE intrinsics and are relatively straightforward for porting existing SIMD code. However, Arm SVE and SME intrinsics are designed for a more predicated and vector-length agnostic style of programming. + +When migrating to SVE/SME you are encouraged to leverage compiler auto-vectorization with predication support, moving away from heavy reliance on low-level intrinsics to achieve scalable, portable performance. + +### Matrix Acceleration + +For matrix computation, AMX provides fixed-size tile registers optimized for dot-product operations such as GEMM and convolutions. In comparison, Arm SME extends the scalable vector compute model with scalable matrix tiles designed around outer-product matrix multiplication and novel streaming modes. + +SME’s flexible, hardware-adaptable tile sizes and tight integration with SVE’s predication model provide a highly adaptable platform for AI training, inference, and scientific computing. + +Both AMX and SME are currently available on limited set of platforms. + +### Overall Summary + +Migrating from x86 SIMD to Arm SIMD entails embracing Arm’s scalable and predicated SIMD programming model embodied by SVE and SME, which supports future-proof, portable code across a wide range of hardware. + +NEON remains important for fixed-width SIMD similar to SSE but may be less suited for emerging HPC and AI workloads that demand scale and flexibility. + +You need to adapt to Arm’s newer vector-length agnostic programming and tooling to fully leverage scalable SIMD and matrix architectures. + +Understanding these key differences in vector models, programming paradigms, and matrix acceleration capabilities helps you migrate and achieve good performance on Arm. + +## Migration tools + +There are tools and libraries that help translate SSE intrinsics to NEON intrinsics, which can shorten the migration effort and produce efficient Arm code. These libraries enable many SSE operations to be mapped to NEON equivalents, but some SSE features have no direct NEON counterparts and require workarounds or redesign. + +Overall, NEON is the standard for SIMD on Arm much like SSE for x86, making it the closest analogue for porting SIMD-optimized software from x86 to ARM. + +[sse2neon](https://github.com/DLTcollab/sse2neon) is an open-source header library that provides a translation layer from Intel SSE2 intrinsics to Arm NEON intrinsics. It enables many SSE2-optimized codebases to be ported to Arm platforms with minimal code modification by mapping familiar SSE2 instructions to their NEON equivalents. + + +[SIMD Everywhere (SIMDe)](https://github.com/simd-everywhere/simde) is a comprehensive, header-only library designed to ease the transition of SIMD code between different architectures. It provides unified implementations of SIMD intrinsics across x86 SSE/AVX, Arm NEON, and other SIMD instruction sets, facilitating portable and maintainable SIMD code. SIMDe supports a wide range of SIMD extensions and includes implementations that fall back to scalar code when SIMD is unavailable, maximizing compatibility. + + +[Google Highway](https://github.com/google/highway) is a high-performance SIMD optimized vector hashing and data processing library designed by Google. It leverages platform-specific SIMD instructions, including Arm NEON and x86 AVX, to deliver fast, portable, and scalable hashing functions and vector operations. Highway is particularly well-suited for large-scale data processing, machine learning, and performance-critical applications requiring efficient SIMD usage across architectures. + +You can also review [Porting architecture specific intrinsics](/learning-paths/cross-platform/intrinsics/) for more information. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md b/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md new file mode 100644 index 0000000000..015060804c --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md @@ -0,0 +1,352 @@ +--- +title: Vector extension code examples +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## SAXPY Example code + +As a way to provide some hands-on experience, you can study and run example code to better understand the vector extensions. The example used here is SAXPY. + +SAXPY stands for "Single-Precision A·X Plus Y" and is a fundamental operation in linear algebra. It computes the result of the equation `y[i] = a * x[i] + y[i]` for all elements in the arrays `x` and `y`. + +SAXPY is widely used in numerical computing, particularly in vectorized and parallelized environments, due to its simplicity and efficiency. + +### Reference version + +Below is a plain C implementation of SAXPY without any vector extensions. + +This serves as a reference for the optimized examples provided later. + +```c +#include +#include +#include + +void saxpy(float a, const float *x, float *y, size_t n) { + for (size_t i = 0; i < n; ++i) { + y[i] = a * x[i] + y[i]; + } +} + +int main() { + size_t n = 1000; + float* x = malloc(n * sizeof(float)); + float* y = malloc(n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) { + sum += y[i]; + } + printf("Plain C SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_plain.c` and build and run the code using: + +```bash +gcc -O3 -o saxpy_plain saxpy_plain.c +./saxpy_plain +``` + +You can use Clang for any of the examples by replacing `gcc` with `clang` on the command line. + +### Arm NEON version (128-bit SIMD, 4 floats per operation) + +NEON operates on fixed 128-bit registers, able to process 4 single-precision float values simultaneously in every vector instruction. + +This extension is available on most Arm-based devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads. + +The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead. + +```c +#include +#include +#include +#include + +void saxpy_neon(float a, const float *x, float *y, size_t n) { + size_t i = 0; + float32x4_t va = vdupq_n_f32(a); + for (; i + 16 <= n; i += 16) { + float32x4_t x0 = vld1q_f32(x + i); + float32x4_t y0 = vld1q_f32(y + i); + float32x4_t x1 = vld1q_f32(x + i + 4); + float32x4_t y1 = vld1q_f32(y + i + 4); + float32x4_t x2 = vld1q_f32(x + i + 8); + float32x4_t y2 = vld1q_f32(y + i + 8); + float32x4_t x3 = vld1q_f32(x + i + 12); + float32x4_t y3 = vld1q_f32(y + i + 12); + vst1q_f32(y + i, vfmaq_f32(y0, va, x0)); + vst1q_f32(y + i + 4, vfmaq_f32(y1, va, x1)); + vst1q_f32(y + i + 8, vfmaq_f32(y2, va, x2)); + vst1q_f32(y + i + 12, vfmaq_f32(y3, va, x3)); + } + for (; i < n; ++i) y[i] = a * x[i] + y[i]; +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(16, n * sizeof(float)); + float* y = aligned_alloc(16, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_neon(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("NEON SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_neon.c`. + +First, verify your system supports NEON: + +```bash +grep -m1 -ow asimd /proc/cpuinfo +``` + +If NEON is supported, you should see `asimd` in the output. If no output appears, NEON is not available. + +Then build and run the code using: + +```bash +gcc -O3 -march=armv8-a+simd -o saxpy_neon saxpy_neon.c +./saxpy_neon +``` + +### AVX2 (256-bit SIMD, 8 floats per operation) + +AVX2 doubles the SIMD width compared to NEON, processing 8 single-precision floats at a time in 256-bit registers. + +This wider SIMD capability enables higher data throughput for numerical and HPC workloads on Intel and AMD CPUs. + +```c +#include +#include +#include +#include + +void saxpy_avx2(float a, const float *x, float *y, size_t n) { + const __m256 va = _mm256_set1_ps(a); + size_t i = 0; + for (; i + 8 <= n; i += 8) { + __m256 vx = _mm256_loadu_ps(x + i); + __m256 vy = _mm256_loadu_ps(y + i); + __m256 vout = _mm256_fmadd_ps(va, vx, vy); + _mm256_storeu_ps(y + i, vout); + } + for (; i < n; ++i) y[i] = a * x[i] + y[i]; +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(32, n * sizeof(float)); + float* y = aligned_alloc(32, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_avx2(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("AVX2 SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_avx2.c`. + +First, verify your system supports AVX2: + +```bash +grep -m1 -ow avx2 /proc/cpuinfo +``` + +If AVX2 is supported, you should see `avx2` in the output. If no output appears, AVX2 is not available. + +Then build and run the code using: + +```bash +gcc -O3 -mavx2 -mfma -o saxpy_avx2 saxpy_avx2.c +./saxpy_avx2 +``` + +### Arm SVE (hardware dependent: 4 to 16+ floats per operation) + +Arm SVE lets the hardware determine the register width, which can range from 128 up to 2048 bits. This means each operation can process from 4 to 64 single-precision floats at a time, depending on the implementation. + +Cloud instances using AWS Graviton, Google Axion, and Microsoft Azure Cobalt processors implement 128-bit SVE. The Fujitsu A64FX processor implements a vector length of 512 bits. + +SVE encourages writing vector-length agnostic code: the compiler automatically handles tail cases, and your code runs efficiently on any Arm SVE hardware. + +```c +#include +#include +#include +#include + +void saxpy_sve(float a, const float *x, float *y, size_t n) { + size_t i = 0; + svfloat32_t va = svdup_f32(a); + while (i < n) { + svbool_t pg = svwhilelt_b32((uint32_t)i, (uint32_t)n); + svfloat32_t vx = svld1(pg, x + i); + svfloat32_t vy = svld1(pg, y + i); + svfloat32_t vout = svmla_m(pg, vy, va, vx); + svst1(pg, y + i, vout); + i += svcntw(); + } +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(64, n * sizeof(float)); + float* y = aligned_alloc(64, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_sve(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("SVE SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_sve.c`. + +First, verify your system supports SVE: + +```bash +grep -m1 -ow sve /proc/cpuinfo +``` + +If SVE is supported, you should see `sve` in the output. If no output appears, SVE is not available. + +Then build and run the code using: + +```bash +gcc -O3 -march=armv8-a+sve -o saxpy_sve saxpy_sve.c +./saxpy_sve +``` + +### AVX-512 (512-bit SIMD, 16 floats per operation) + +AVX-512 provides the widest SIMD registers of mainstream x86 architectures, processing 16 single-precision floats per 512-bit operation. + +AVX-512 availability varies across x86 processors. It's found on Intel Xeon server processors and some high-end desktop processors, as well as select AMD EPYC models. + +For very large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing. + +```c +#include +#include +#include +#include + +void saxpy_avx512(float a, const float* x, float* y, size_t n) { + const __m512 va = _mm512_set1_ps(a); + size_t i = 0; + for (; i + 16 <= n; i += 16) { + __m512 vx = _mm512_loadu_ps(x + i); + __m512 vy = _mm512_loadu_ps(y + i); + __m512 vout = _mm512_fmadd_ps(va, vx, vy); + _mm512_storeu_ps(y + i, vout); + } + const size_t r = n - i; + if (r) { + __mmask16 m = (1u << r) - 1u; + __m512 vx = _mm512_maskz_loadu_ps(m, x + i); + __m512 vy = _mm512_maskz_loadu_ps(m, y + i); + __m512 vout = _mm512_fmadd_ps(va, vx, vy); + _mm512_mask_storeu_ps(y + i, m, vout); + } +} + +int main() { + size_t n = 1000; + float *x = aligned_alloc(64, n * sizeof(float)); + float *y = aligned_alloc(64, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_avx512(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("AVX-512 SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +First, verify your system supports AVX-512: + +```bash +grep -m1 -ow avx512f /proc/cpuinfo +``` + +If AVX-512 is supported, you should see `avx512f` in the output. If no output appears, AVX-512 is not available. + +Then build and run the code using: + +```bash +gcc -O3 -mavx512f -o saxpy_avx512 saxpy_avx512.c +./saxpy_avx512 +``` + +### Summary + +Wider data lanes mean each operation processes more elements, offering higher throughput on supported hardware. However, actual performance depends on factors like memory bandwidth, the number of execution units, and workload characteristics. + +Processors also improve performance by implementing multiple SIMD execution units rather than just making vectors wider. For example, Arm Neoverse V2 has 4 SIMD units while Neoverse N2 has 2 SIMD units. Modern CPUs often combine both approaches (wider vectors and multiple execution units) to maximize parallel processing capability. + +Each vector extension requires different intrinsics, compilation flags, and programming approaches. While x86 and Arm vector extensions serve similar purposes and achieve comparable performance gains, you will need to understand the options and details to create portable code. + +You should also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware. diff --git a/content/learning-paths/cross-platform/vectorization-comparison/_index.md b/content/learning-paths/cross-platform/vectorization-comparison/_index.md new file mode 100644 index 0000000000..d2a54fe293 --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/_index.md @@ -0,0 +1,85 @@ +--- +title: "Mapping x86 vector extensions to Arm: a migration overview" + +minutes_to_complete: 30 + +draft: true +cascade: + draft: true + +who_is_this_for: This is an advanced topic for software developers who want to learn how to migrate vectorized code to Arm. + +learning_objectives: + - Understand how Arm vector extensions, including NEON, Scalable Vector Extension (SVE), and Scalable Matrix Extension (SME) map to vector extensions from other architectures. + - Start planning how to migrate your SIMD code to the Arm architecture. + +prerequisites: + - Familiarity with vector extensions, SIMD programming, and compiler intrinsics. + - Access to Linux systems with NEON and SVE support. + +author: + - Jason Andrews + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +armips: + - Neoverse +operatingsystems: + - Linux +tools_software_languages: + - GCC + - Clang + +shared_path: true +shared_between: + - servers-and-cloud-computing + - laptops-and-desktops + - mobile-graphics-and-gaming + - automotive + +further_reading: + - resource: + title: SVE Programming Examples + link: https://developer.arm.com/documentation/dai0548/latest + type: documentation + - resource: + title: Port Code to Arm Scalable Vector Extension (SVE) + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve + type: website + - resource: + title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture + type: website + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 1) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: Build adaptive libraries with multiversioning + link: https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/ + type: website + - resource: + title: SME Programmer's Guide + link: https://developer.arm.com/documentation/109246/latest + type: documentation + - resource: + title: Compiler Intrinsics + link: https://en.wikipedia.org/wiki/Intrinsic_function + type: website + - resource: + title: ACLE - Arm C Language Extension + link: https://github.com/ARM-software/acle + type: website + - resource: + title: Application Binary Interface for the Arm Architecture + link: https://github.com/ARM-software/abi-aa + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md b/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +---