diff --git a/content/learning-paths/automotive/_index.md b/content/learning-paths/automotive/_index.md index d58b01c2eb..97fb52787c 100644 --- a/content/learning-paths/automotive/_index.md +++ b/content/learning-paths/automotive/_index.md @@ -12,20 +12,24 @@ title: Automotive weight: 4 subjects_filter: - Containers and Virtualization: 3 -- Performance and Architecture: 2 +- Performance and Architecture: 5 operatingsystems_filter: - Baremetal: 1 -- Linux: 4 +- Linux: 7 +- macOS: 1 - RTOS: 1 tools_software_languages_filter: -- Automotive: 1 -- C: 1 +- Arm Development Studio: 1 +- Arm Zena CSS: 1 +- C: 2 +- C++: 1 +- Clang: 2 - DDS: 1 - Docker: 2 +- GCC: 2 - Python: 2 - Raspberry Pi: 1 -- ROS 2: 1 -- ROS2: 2 +- ROS 2: 3 - Rust: 1 - Zenoh: 1 --- diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md index 6d798ad108..8081261413 100644 --- a/content/learning-paths/cross-platform/simd-loops/1-about.md +++ b/content/learning-paths/cross-platform/simd-loops/1-about.md @@ -1,70 +1,35 @@ --- -title: About single instruction, multiple data (SIMD) loops -weight: 3 +title: About Single Instruction, Multiple Data loops +weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -Writing high-performance software for Arm processors often involves delving into -SIMD technologies. For many developers, that journey started with NEON, a -familiar, fixed-width vector extension that has been around for many years. But as -Arm architectures continue to evolve, so do their SIMD technologies. +## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs -Enter the world of Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME): two powerful, scalable vector extensions designed for modern -workloads. Unlike NEON, they are not just wider; they are fundamentally different. These -extensions introduce new instructions, more flexible programming models, and -support for concepts like predication, scalable vectors, and streaming modes. -However, they also come with a learning curve. +Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with NEON, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you. -That is where [SIMD Loops](https://gitlab.arm.com/architecture/simd-loops) becomes a valuable resource, enabling you to quickly and effectively learn how to write high-performance SIMD code. +This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike NEON, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match. -SIMD Loops is designed to help -you learn how to write SVE and SME code. It is a collection -of self-contained, real-world loop kernels written in a mix of C, Arm C Language Extensions (ACLE) -intrinsics, and inline assembly. These kernels target tasks ranging from simple arithmetic -to matrix multiplication, sorting, and string processing. You can compile them, -run them, step through them, and use them as a foundation for your own SIMD -work. +## What is the SIMD Loops project? -If you are familiar with NEON intrinsics, you can use SIMD Loops to learn and explore SVE and SME. +The SIMD Loops project offers a hands-on way to climb the learning curve. It is a public codebase of self-contained, real loop kernels written in C, Arm C Language Extensions (ACLE) intrinsics, and selected inline assembly. Kernels span tasks such as matrix multiply, sorting, and string processing. You can build them, run them, step through them, and adapt them for your own SIMD workloads. -## What is SIMD Loops? +Visit the [SIMD Loops Repo](https://gitlab.arm.com/architecture/simd-loops). -SIMD Loops is an open-source -project, licensed under BSD 3-Clause, built to help you learn how to write SIMD code for modern Arm -architectures, specifically using SVE and SME. -It is designed for programmers who already know -their way around NEON intrinsics but are now facing the more powerful and -complex world of SVE and SME. +This open-source project (BSD-3-Clause) teaches SIMD development on modern Arm CPUs with SVE, SVE2, SME, and SME2. It’s aimed at developers who know NEON intrinsics and want to explore newer extensions. The goal of SIMD Loops is to provide working, readable examples that demonstrate how to use the full range of features available in SVE, SVE2, and SME2. Each example is a self-contained loop kernel - a small piece of code that performs a specific task like matrix multiplication, vector reduction, histogram, or memory copy. These examples show how that task can be implemented across different vector instruction sets. -The goal of SIMD Loops is to provide working, readable examples that demonstrate -how to use the full range of features available in SVE, SVE2, and SME2. Each -example is a self-contained loop kernel, a small piece of code that performs -a specific task like matrix multiplication, vector reduction, histogram, or -memory copy. These examples show how that task can be implemented across different -vector instruction sets. - -Unlike a cookbook that tries to provide a recipe for every problem, SIMD Loops -takes the opposite approach. It aims to showcase the architecture rather than -the problem. The loop kernels are chosen to be realistic and meaningful, but the -main goal is to demonstrate how specific features and instructions work in -practice. If you are trying to understand scalability, predication, -gather/scatter, streaming mode, ZA storage, compact instructions, or the -mechanics of matrix tiles, this is where you will see them in action. +Unlike a cookbook that attempts to provide a recipe for every problem, SIMD Loops takes the opposite approach. It aims to showcase the architecture rather than the problem itself. The loop kernels are chosen to be realistic and meaningful, but the main goal is to demonstrate how specific features and instructions work in practice. If you are trying to understand scalability, predication, gather/scatter, streaming mode, ZA storage, compact instructions, or the mechanics of matrix tiles, this is where you can see them in action. The project includes: -- Dozens of numbered loop kernels, each focused on a specific feature or pattern +- Many numbered loop kernels, each focused on a specific feature or pattern - Reference C implementations to establish expected behavior - Inline assembly and/or intrinsics for scalar, NEON, SVE, SVE2, SVE2.1, SME2, and SME2.1 - Build support for different instruction sets, with runtime validation - A simple command-line runner to execute any loop interactively - Optional standalone binaries for bare-metal and simulator use -You do not need to worry about auto-vectorization, compiler flags, or tooling -quirks. Each loop is hand-written and annotated to make the use of SIMD features -clear. The intent is that you can study, modify, and run each loop as a learning -exercise, and use the project as a foundation for your own exploration of -Arm’s vector extensions. +You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop. diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md index 86328c023d..3772a55822 100644 --- a/content/learning-paths/cross-platform/simd-loops/2-using.md +++ b/content/learning-paths/cross-platform/simd-loops/2-using.md @@ -1,45 +1,74 @@ --- title: Using SIMD Loops -weight: 4 +weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -To get started, clone the SIMD Loops project and change current directory: +## Set up your development environment + +To get started, clone the SIMD Loops project and change to the project directory: ```bash git clone https://gitlab.arm.com/architecture/simd-loops simd-loops.git cd simd-loops.git ``` +Confirm that you are using an Arm machine: + +```bash +uname -m +``` + +Expected output on Linux: + +```output +aarch64 +``` + +Expected output on macOS: + +```output +arm64 +``` + ## SIMD Loops structure -In the SIMD Loops project, the -source code for the loops is organized under the loops directory. The complete -list of loops is documented in the loops.inc file, which includes a brief -description and the purpose of each loop. Every loop is associated with a -uniquely named source file following the naming pattern `loop_.c`, where -`` represents the loop number. +In the SIMD Loops project, the source code for the loops is organized under the `loops` directory. The complete list of loops is documented in the `loops.inc` file, which includes a brief description and the purpose of each loop. Every loop is associated with a uniquely named source file following the pattern `loop_.c`, where `` represents the loop number. + +A subset of the `loops.inc` file is below: + +```output +LOOP(001, "FP32 inner product", "Use of fp32 MLA instruction", STREAMING_COMPATIBLE) +LOOP(002, "UINT32 inner product", "Use of u32 MLA instruction", STREAMING_COMPATIBLE) +LOOP(003, "FP64 inner product", "Use of fp64 MLA instruction", STREAMING_COMPATIBLE) +LOOP(004, "UINT64 inner product", "Use of u64 MLA instruction", STREAMING_COMPATIBLE) +LOOP(005, "strlen short strings", "Use of FF and NF loads instructions") +LOOP(006, "strlen long strings", "Use of FF and NF loads instructions") +LOOP(008, "Precise fp64 add reduction", "Use of FADDA instructions") +LOOP(009, "Pointer chasing", "Use of CTERM and BRK instructions") +LOOP(010, "Conditional reduction (fp)", "Use of CLAST (SIMD&FP scalar) instructions", STREAMING_COMPATIBLE) +``` A loop is structured as follows: -```C +```c // Includes and loop__data structure definition #if defined(HAVE_NATIVE) || defined(HAVE_AUTOVEC) -// C code +// C reference or auto-vectorized version void inner_loop_(struct loop__data *data) { ... } #if defined(HAVE_xxx_INTRINSICS) -// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON) versions +// Intrinsics versions: xxx = SME, SVE, or SIMD (NEON) void inner_loop_(struct loop__data *data) { ... } #elif defined() - // Hand-written inline assembly : +// Hand-written inline assembly // = __ARM_FEATURE_SME2p1, __ARM_FEATURE_SME2, __ARM_FEATURE_SVE2p1, // __ARM_FEATURE_SVE2, __ARM_FEATURE_SVE, or __ARM_NEON void inner_loop_(struct loop__data *data) { ... } @@ -50,28 +79,69 @@ void inner_loop_(struct loop__data *data) { ... } #endif -// Main of loop: Buffers allocations, loop function call, result functional checking +// Main of loop: buffer allocation, loop function call, result checking +``` + +Each loop is implemented in several SIMD extension variants. Conditional compilation selects one of the implementations for the `inner_loop_` function. + +The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`. + +When SIMD ACLE is supported (SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others. + +The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness. + +At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants. + +```console +make +``` + +With no target specified, the list of targets is printed: + +```output +all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics +``` + +Build all loops for all targets: + +```console +make all +``` + +Build all loops for a single target, such as NEON: + +```console +make neon +``` + +As a result of the build, two types of binaries are generated. + +The first is a single executable named `simd_loops`, which includes all loop implementations. + +Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the NEON target: + +```console +build/neon/bin/simd_loops -k 1 -n 5 +``` + +Example output: + +```output +Loop 001 - FP32 inner product + - Purpose: Use of fp32 MLA instruction + - Checksum correct. ``` -Each loop is implemented in several SIMD extension variants, and conditional -compilation is used to select one of the optimizations for the -`inner_loop_` function. The native C implementation is written first, and -it can be generated either when building natively (HAVE_NATIVE) or through -compiler auto-vectorization (HAVE_AUTOVEC). When SIMD ACLE is supported (e.g., -SME, SVE, or NEON), the code is compiled using high-level intrinsics. If ACLE -support is not available, the build process falls back to handwritten inline -assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, -SVE2.1, SVE2, and others. The overall code structure also includes setup and -cleanup code in the main function, where memory buffers are allocated, the -selected loop kernel is executed, and results are verified for correctness. - -At compile time, you can select which loop optimization to compile, whether it -is based on SME or SVE intrinsics, or one of the available inline assembly -variants (`make scalar neon sve2 sme2 sve2p1 sme2p1 sve_intrinsics -sme_intrinsics` ...). - -As the result of the build, two types of binaries are generated. The first is a -single executable named `simd_loops`, which includes all the loop -implementations. A specific loop can be selected by passing parameters to the -program (e.g., `simd_loops -k -n `). The second type consists -of individual standalone binaries, each corresponding to a specific loop. +The second type of binary is an individual loop. + +To run loop 1 as a standalone binary: + +```console +build/neon/standalone/bin/loop_001.elf +``` + +Example output: + +```output + - Checksum correct. +``` diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md index fa3b614a40..54e1512df9 100644 --- a/content/learning-paths/cross-platform/simd-loops/3-example.md +++ b/content/learning-paths/cross-platform/simd-loops/3-example.md @@ -1,31 +1,32 @@ --- title: Code example -weight: 5 +weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -To illustrate the structure and design principles of simd-loops, consider loop -202 as an example. `inner_loop_202` is defined at lines 69-79 in file -`loops/loops_202.c` and calls the `matmul_fp32` routine defined in -`matmul_fp32.c`. +## Overview: loop 202 matrix multiplication example -Open `loops/matmul_fp32.c`. +To illustrate the structure and design principles of SIMD Loops, consider loop 202 as an example. -This loop implements a single precision floating point matrix multiplication of -the form: +Use a text editor to open `loops/loop_202.c`. -`C[M x N] = A[M x K] x B[K x N]` +The function `inner_loop_202()` is defined around lines 60–70 in `loops/loop_202.c` and calls the `matmul_fp32` routine defined in `loops/matmul_fp32.c` -A matrix multiplication can be understood in two equivalent ways: -- As the dot product between each row of matrix `A` and each column of matrix `B`. -- As the sum of outer products between the columns of `A` and the rows of `B`. +Open `loops/matmul_fp32.c` in your editor. -## Data structure +This loop implements single-precision floating-point matrix multiplication of the form: -The loop begins by defining the data structure, which captures the matrix -dimensions (`M`, `K`, `N`) along with input and output buffers: +`C[M × N] = A[M × K] × B[K × N]` + +You can view matrix multiplication in two equivalent ways: +- As the dot product between each row of `A` and each column of `B` +- As the sum of outer products between the columns of `A` and the rows of `B` + +## Data structure definition + +The loop begins by defining a data structure that captures the matrix dimensions (`M`, `K`, `N`) along with input and output buffers: ```C struct loop_202_data { @@ -39,35 +40,27 @@ struct loop_202_data { ``` For this loop: -- The first input matrix (A) is stored in column-major format in memory. -- The second input matrix (b) is stored in row-major format in memory. -- None of the memory area designated by `a`, `b` anf `c` alias (i.e. they - overlap in some way) --- as indicated by the `restrict` keyword. +- Matrix `a` is stored in column-major order +- Matrix `b` is stored in row-major order +- The memory regions referenced by `a`, `b`, and `c` do not alias, as indicated by the `restrict` keyword -This layout choice helps optimize memory access patterns for all the targeted -SIMD architectures. +This layout helps optimize memory access patterns across the targeted SIMD architectures. -## Loop attributes +## Loop attributes by architecture -Next, the loop attributes are specified depending on the target architecture: -- For SME targets, the function `inner_loop_202` must be invoked with the - `__arm_streaming` attribute, using a shared `ZA` register context - (`__arm_inout("za")`). There attributes are wrapped in the LOOP_ATTR macro. -- For SVE or NEON targets, no additional attributes are required. +Loop attributes are specified per target architecture: +- **SME targets** — `inner_loop_202` is invoked with the `__arm_streaming` attribute and uses a shared `ZA` register context (`__arm_inout("za")`). These attributes are wrapped in the `LOOP_ATTR` macro +- **SVE or NEON targets** — no additional attributes are required -This design enables portability across different SIMD extensions. +This design enables portability across SIMD extensions. -## Function implementation +## Function implementation in loops/matmul_fp32.c -The `matmul_fp32` function from file `loops/matmul_fp32.c` provides several -optimizations of the single-precision floating-point matrix multiplication, -including the ACLE intrinsics-based code, and the assembly hand-optimized code. +`loops/matmul_fp32.c` provides several optimizations of matrix multiplication, including ACLE intrinsics and hand-optimized assembly. ### Scalar code -A scalar C implementation is provided at lines 40-52. This version follows the -dot-product formulation of matrix multiplication, serving both as a functional -reference and a baseline for auto-vectorization: +A scalar C implementation appears around lines 40–52. It follows the dot-product formulation and serves as both a functional reference and an auto-vectorization baseline: ```C { line_numbers="true", line_start="40" } for (uint64_t x = 0; x < m; x++) { @@ -85,16 +78,13 @@ reference and a baseline for auto-vectorization: } ``` -### SVE optimized code +### SVE-optimized code -The SVE implementation uses the indexed floating-point multiply-accumulate -(`fmla`) instruction to optimize the matrix multiplication operation. In this -formulation, the outer-product is decomposed into multiple indexed -multiplication steps, with results accumulated directly into `Z` registers. +The SVE version uses indexed floating-point multiply–accumulate (`fmla`) to optimize the matrix multiplication operation. The outer product is decomposed into indexed multiply steps, and results accumulate directly in `Z` registers. -In the intrinsic version (lines 167-210), the innermost loop is structured as follows: +In the intrinsics version (lines 167–210), the innermost loop is structured as follows: -```C { line_numbers = "true", line_start="167"} +```C { line_numbers="true", line_start="167" } for (m_idx = 0; m_idx < m; m_idx += 8) { for (n_idx = 0; n_idx < n; n_idx += svcntw() * 2) { ZERO_PAIR(0); @@ -141,36 +131,23 @@ In the intrinsic version (lines 167-210), the innermost loop is structured as fo } ``` -At the beginning of the loop, the accumulators (`Z` registers) are explicitly -initialized to zero. This is achieved using `svdup` intrinsic (or its equivalent -`dup` assembly instruction), encapsulated in the `ZERO_PAIR` macro. +At the beginning of the loop, the accumulators (`Z` registers) are zeroed using `svdup` (or `dup` in assembly), encapsulated in the `ZERO_PAIR` macro. Within each iteration over the `K` dimension: -- 128 bits (four consecutive floating point values) are loaded from the matrix - `A`, using the load replicate `svld1rq` intrinsics (or `ld1rqw` in assembly) - in `LOADA_PAIR` macro. -- Two consecutive vectors are loaded from matrix `B`, using the SVE load - instructions, called by the `LOADB_PAIR` macro. -- A sequence of indexed multiply-accumulate operations is performed, computing - the product of each element from `A` with the vectors from `B`. -- The results are accumulated across the 16 `Z` register accumulators, - progressively building the partial results of the matrix multiplication. +- 128 bits (four consecutive floating-point values) are loaded from `A` using replicate loads `svld1rq` (or `ld1rqw`), through `LOADA_PAIR` +- Two vectors are loaded from `B` using SVE vector loads, using `LOADB_PAIR` +- Indexed `fmla` operations compute element–vector products and accumulate into 16 `Z` register accumulators +- Partial sums build up the output tile -After completing all iterations across the `K` dimension, the accumulated -results in the `Z` registers are stored back to memory. The `STORE_PAIR` macro -writes the values into the corresponding locations of the output matrix `C`. +After all `K` iterations, results in the `Z` registers are stored to `C` using the `STORE_PAIR` macro. -The equivalent SVE hand-optimized assembly code is written at lines 478-598. +The equivalent SVE hand-optimized assembly appears around lines 478–598. -This loop showcases how SVE registers and indexed `fmla` instructions enable -efficient decomposition of the outer-product formulation into parallel, -vectorized accumulation steps. +This loop shows how SVE registers and indexed `fmla` enable efficient decomposition of the outer-product formulation into parallel, vectorized accumulation. -For more details on SVE/SVE2 instruction semantics, optimization guidelines and -other documents refer to the [Scalable Vector Extensions -resources](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions). +For SVE/SVE2 semantics and optimization guidance, see the [Scalable Vector Extensions resources](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions). -### SME2 optimized code +## SME2-optimized code The SME2 implementation leverages the outer-product formulation of the matrix multiplication function, utilizing the `fmopa` SME instruction to perform the @@ -178,7 +155,7 @@ outer-product and accumulate partial results in `ZA` tiles. A snippet of the loop is shown below: -```C { line_numbers = "true", line_start="78"} +```C { line_numbers="true", line_start="78" } #if defined(__ARM_FEATURE_SME2p1) svzero_za(); #endif @@ -232,50 +209,28 @@ A snippet of the loop is shown below: } ``` -Within the SME2 intrinsics code (lines 91-106), the innermost loop iterates across -the `K` dimension - corresponding to the columns of matrix `A` and the rows of -matrix `B`. +Within the SME2 intrinsics code (lines 91–106), the innermost loop iterates across +the `K` dimension - columns of `A` and rows of `B`. In each iteration: -- Two consecutive vectors are loaded from `A` and two consecutive vectors are - loaded from `B` (`vec_a`, and `vec_b`), using the multi-vector load - instructions. -- The `fmopa` instruction, encapsulated within the `MOPA_TILE` macro, computes - the outer product of the input vectors. -- The results are accumulated into the four 32-bit `ZA` tiles. +- Two consecutive vectors are loaded from `A` and two from `B` (`vec_a*`, `vec_b*`) using multi-vector load intrinsics +- `fmopa`, wrapped by `MOPA_TILE`, computes the outer product +- Partial results accumulate in four 32-bit `ZA` tiles -After all iterations over K dimension, the accumulated results are stored back -to memory through a store loop at lines 111-124: +After all `K` iterations, results are written back in a store loop (lines 111–124). -During this phase, four rows of `ZA` tiles are read out into four `Z` vectors -using the `svread_hor_za8_u8_vg4` intrinsic (or the equivalent `mova` assembly -instruction). The vectors are then stored into the output buffer with SME -multi-vector `st1w` store instructions, wrapped in the `STORE_PAIR` macro. +During this phase, rows of `ZA` tiles are read into `Z` vectors using `svread_hor_za8_u8_vg4` (or `svreadz_hor_za8_u8_vg4` on SME2.1). Vectors are then stored to the output buffer using SME multi-vector `st1w` stores using `STORE_PAIR`. -The equivalent SME2 hand-optimized code is at lines 229-340. +The equivalent SME2 hand-optimized assembly appears around lines 229–340. -For more details on instruction semantics, and SME/SME2 optimization guidelines, -refer to the official [SME Programmer's -Guide](https://developer.arm.com/documentation/109246/latest/). +For instruction semantics and SME/SME2 optimization guidance, see the [SME Programmer's Guide](https://developer.arm.com/documentation/109246/latest/). ## Other optimizations -Beyond the SME2 and SVE2 implementations shown above, this loop also includes several -alternative optimized versions, each leveraging architecture-specific features. - -### NEON - -The neon version (lines 612-710) relies on multiple structure load/store -combined with indexed `fmla` instructions to vectorize the matrix multiplication -operation. - -### SVE2.1 +Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features: -The SVE2.1 implementation (lines 355-462) extends the base SVE approach by -utilizing multi-vector load and store instructions. +- **NEON**: the NEON version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation. -### SME2.1 +- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores. -The SME2.1 leverages the `movaz` instruction / `svreadz_hor_za8_u8_vg4` -intrinsic to simultaneously reinitialize `ZA` tile accumulators while moving -data out to registers. +- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers. diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md index d1e85d10d0..ccb0de356c 100644 --- a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md +++ b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md @@ -1,26 +1,20 @@ --- -title: Conclusion -weight: 6 +title: How to learn with SIMD Loops +weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -SIMD Loops is an invaluable -resource for developers looking to learn or master the intricacies of SVE and -SME on modern Arm architectures. By providing practical, hands-on examples, it -bridges the gap between the architecture specification and real-world -application. Whether you're transitioning from NEON or starting fresh with SVE -and SME, SIMD Loops offers a comprehensive toolkit to enhance your understanding -and proficiency. +## Bridging the gap between specs and real code -With its extensive collection of loop kernels, detailed documentation, and -flexible build options, SIMD Loops empowers you to explore -and leverage the full potential of Arm's advanced vector extensions. Dive into -the project, experiment with the examples, and take your high-performance coding -skills for Arm to the next level. +SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications. -For more information and to get started, visit the GitLab project and refer -to the -[README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md) -for instructions on building and running the code. +Whether you are moving from NEON or starting directly with SVE and SME, the project offers: +- A broad catalog of kernels that highlight specific features (predication, VLA programming, gather/scatter, streaming mode, ZA tiles) +- Clear, readable implementations in C, ACLE intrinsics, and selected inline assembly +- Flexible build targets and a simple runner to execute and validate loops + +Use the repository to explore, modify, and benchmark kernels so you can understand tradeoffs and apply the patterns to your own workloads. + +For more information and to get started, visit the GitLab project and see the [README.md](https://gitlab.arm.com/architecture/simd-loops/-/blob/main/README.md) for the latest instructions on building and running the code. diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md index d10ffa777f..d9cfa835ba 100644 --- a/content/learning-paths/cross-platform/simd-loops/_index.md +++ b/content/learning-paths/cross-platform/simd-loops/_index.md @@ -3,18 +3,19 @@ title: "Code kata: perfect your SVE and SME skills with SIMD Loops" minutes_to_complete: 30 -draft: true -cascade: - draft: true - -who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2 and SME2 to improve software performance on Arm processors. +who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors. learning_objectives: - - Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME). + - Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME) + - Describe what SIMD Loops contains and how kernels are organized across scalar, NEON, SVE,SVE2, and SME2 variants + - Build and run a selected kernel with the provided runner and validate correctness against the C reference + - Choose the appropriate build target to compare NEON, SVE/SVE2, and SME2 implementations + prerequisites: - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers. - Some familiarity with SIMD programming and NEON intrinsics. + - Recent toolchains that support SVE/SME (GCC 13+ or Clang 16+ recommended) author: - Alejandro Martinez Vicente @@ -29,9 +30,10 @@ operatingsystems: - Linux - macOS tools_software_languages: - - GCC - - Clang - - FVP + - C + - C++ + - GCC + - Clang shared_path: true shared_between: diff --git a/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md b/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md new file mode 100644 index 0000000000..d17480fcbc --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/1-vectorization.md @@ -0,0 +1,145 @@ +--- +title: Migrating SIMD code to the Arm architecture +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Vectorization on x86 vs. Arm + +Migrating SIMD (Single Instruction, Multiple Data) code from x86 extensions to Arm extensions is an important task for software developers aiming to optimize performance on Arm platforms. + +Understanding the mapping between x86 instruction sets like SSE, AVX, and AMX to Arm's NEON, SVE, and SME extensions is essential for ensuring portability and high performance. This Learning Path provides an overview to help you design a migration plan, leveraging Arm features such as scalable vector lengths and advanced matrix operations, to effectively adapt your code. + +Vectorization is a key optimization strategy where one instruction processes multiple data elements simultaneously. It drives performance in HPC, AI/ML, signal processing, and data analytics. + +Both x86 and Arm processors offer rich SIMD capabilities, but they differ in philosophy and design. The x86 architecture provides fixed-width vector units of 128, 256, and 512 bits. The Arm architecture offers a mix of fixed-width, for NEON, and scalable vectors for SVE and SME ranging from 128 to 2048 bits. + +If you are interested in migrating SIMD software to Arm, understanding these differences ensures portable, high-performance code. + +## Arm vector and matrix extensions + +### NEON + +NEON is a 128-bit SIMD extension available across all Armv8 cores, including both mobile and Neoverse platforms. It is particularly well-suited for multimedia processing, digital signal processing (DSP), and packet processing workloads. Conceptually, NEON is equivalent to x86 SSE or AVX-128, making it the primary target for migrating SSE workloads. Compiler support for auto-vectorization to NEON is mature, simplifying the migration process for developers. + +### Scalable Vector Extension (SVE) + +SVE introduces a revolutionary approach to SIMD with its vector-length agnostic (VLA) design. Registers in SVE can range from 128 to 2048 bits, with the exact width determined by the hardware implementation in multiples of 128 bits. This flexibility allows the same binary to run efficiently across different hardware generations. SVE also features advanced capabilities like per-element predication, which eliminates branch divergence, and native support for gather/scatter operations, enabling efficient handling of irregular memory accesses. While SVE is ideal for high-performance computing (HPC) and future-proof portability, developers must adapt to its unique programming model, which differs significantly from fixed-width SIMD paradigms. SVE is most similar to AVX-512 on x86, offering greater portability and scalability. + +### Scalable Matrix Extension (SME) + +SME is designed to accelerate matrix multiplication and is similar to AMX. Unlike AMX, which relies on dot-product-based operations, SME employs outer-product-based operations, providing greater flexibility for custom AI and HPC kernels. SME integrates seamlessly with SVE, utilizing scalable tiles and a streaming mode to optimize performance. It is particularly well-suited for AI training and inference workloads, as well as dense linear algebra in HPC applications. + +## x86 vector and matrix extensions + +### Streaming SIMD Extensions (SSE) + +The SSE instruction set provides 128-bit XMM registers and supports both integer and floating-point SIMD operations. Despite being an older technology, SSE remains a baseline for many libraries due to its widespread adoption. + +However, its fixed-width design and limited throughput make it less competitive compared to more modern extensions like AVX. When migrating code from SSE to Arm, developers will find that SSE maps well to Arm NEON, enabling a relatively straightforward transition. + +### Advanced Vector Extensions (AVX) + +The AVX extensions introduce 256-bit YMM registers with AVX and 512-bit ZMM registers with AVX-512, offering significant performance improvements over SSE. Key features include Fused Multiply-Add (FMA) operations, masked operations in AVX-512, and VEX/EVEX encodings that allow for more operands and flexibility. + +Migrating AVX code to Arm requires careful consideration, as AVX maps to NEON for up to 128-bit operations or to SVE for scalable-width operations. Since SVE is vector-length agnostic, porting AVX code often involves refactoring to accommodate this new paradigm. + +### Advanced Matrix Extensions (AMX) + +AMX is a specialized instruction set designed for accelerating matrix operations using dedicated matrix-tile registers, effectively treating 2D arrays as first-class citizens. It is particularly well-suited for AI workloads, such as convolutions and General Matrix Multiplications (GEMMs). + +When migrating AMX workloads to Arm, you can leverage Arm SME, which conceptually aligns with AMX but employs a different programming model based on outer products rather than dot products. This difference requires you to adapt their code to fully exploit SME's capabilities. + +## Comparison tables + +## SSE vs. NEON + +| Feature | SSE | NEON | +|-----------------------|---------------------------------------------------------------|----------------------------------------------------------------| +| **Register width** | 128-bit (XMM registers) | 128-bit (Q registers) | +| **Vector length model**| Fixed 128 bits | Fixed 128 bits | +| **Predication / masking**| Minimal predication; SSE lacks full mask registers | Conditional select instructions; no hardware mask registers | +| **Gather / Scatter** | No native gather/scatter (introduced in AVX2 and later) | No native gather/scatter; requires software emulation | +| **Instruction set scope**| Arithmetic, logical, shuffle, blend, conversion, basic SIMD | Arithmetic, logical, shuffle, saturating ops, multimedia, crypto extensions (AES, SHA)| +| **Floating-point support**| Single and double precision floating-point SIMD operations | Single and double precision floating-point SIMD operations | +| **Typical applications**| Legacy SIMD workloads; general-purpose vector arithmetic | Multimedia processing, DSP, cryptography, embedded compute | +| **Extensibility** | Extended by AVX/AVX2/AVX-512 for wider vectors and advanced features| NEON fixed at 128-bit vectors; ARM SVE offers scalable vectors but is separate | +| **Programming model** | Intrinsics supported in C/C++; assembly used for optimization | Intrinsics widely used; inline assembly less common | + + +## AVX vs. SVE (SVE2) + +| Feature | x86: AVX / AVX-512 | ARM: SVE / SVE2 | +|-----------------------|---------------------------------------------------------|---------------------------------------------------------------| +| **Register width** | Fixed: 256-bit (YMM), 512-bit (ZMM) | Scalable: 128 to 2048 bits (in multiples of 128 bits) | +| **Vector length model**| Fixed vector length; requires multiple code paths or compiler dispatch for different widths | Vector-length agnostic; same binary runs on any hardware vector width | +| **Predication / masking**| Mask registers for per-element operations (AVX-512) | Rich predication with per-element predicate registers | +| **Gather/Scatter** | Native gather/scatter support (AVX2 and AVX-512) | Native gather/scatter with efficient implementation across vector widths | +| **Key operations** | Wide SIMD, fused multiply-add (FMA), conflict detection, advanced masking | Wide SIMD, fused multiply-add (FMA), predicated operations, gather/scatter, reduction operations, bit manipulation | +| **Best suited for** | HPC, AI workloads, scientific computing, data analytics | HPC, AI, scientific compute, cloud and scalable workloads | +| **Limitations** | Power and thermal throttling on heavy 512-bit usage; complex software ecosystem | Requires vector-length agnostic programming style; ecosystem and hardware adoption still maturing | + +## AMX vs. SME + +| Feature | x86: AMX | ARM: SME | +|-----------------------|---------------------------------------------------------|------------------------------------------------------------| +| **Register width** | Tile registers with fixed dimensions: 16×16 for BF16, 64×16 for INT8 (about 1 KB total) | Scalable matrix tiles integrated with SVE, implementation-dependent tile dimensions | +| **Vector length model**| Fixed tile dimensions based on data type | Implementation-dependent tile dimensions, scales with SVE vector length | +| **Predication / masking**| No dedicated predication or masking in AMX tiles | Predication integrated through SVE predicate registers | +| **Gather/Scatter** | Not supported within AMX; handled by other instructions | Supported via integration with SVE’s gather/scatter features | +| **Key operations** | Focused on dot-product based matrix multiplication, optimized for GEMM and convolutions | Focus on outer-product matrix multiplication with streaming mode for dense linear algebra | +| **Best suited for** | AI/ML workloads such as training and inference, specifically GEMM and convolution kernels | AI/ML training and inference, scientific computing, dense linear algebra workloads | +| **Limitations** | Hardware and software ecosystem currently limited (primarily Intel Xeon platforms) | Emerging hardware support; compiler and library ecosystem in development | + + +## Key Differences for Developers + +When migrating from x86 SIMD extensions to Arm SIMD, there are several important architectural and programming differences for you to consider. + +### Vector Length Model + +x86 SIMD extensions such as SSE, AVX, and AVX-512 operate on fixed vector widths, 128, 256, or 512 bits. This often necessitates multiple code paths or compiler dispatch techniques to efficiently exploit available hardware SIMD capabilities. Arm NEON, similar to SSE, uses a fixed 128-bit vector width, making it a familiar, fixed-size SIMD baseline. + +In contrast, Arm’s Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME) introduce a vector-length agnostic model. This allows vectors to scale from 128 bits up to 2048 bits depending on the hardware, enabling the same binary to run efficiently across different implementations without modification. + +### Programming and Intrinsics + +x86 offers a comprehensive and mature set of SIMD intrinsics that increase in complexity especially with AVX-512 due to advanced masking and lane-crossing operations. Arm NEON intrinsics resemble SSE intrinsics and are relatively straightforward for porting existing SIMD code. However, Arm SVE and SME intrinsics are designed for a more predicated and vector-length agnostic style of programming. + +When migrating to SVE/SME you are encouraged to leverage compiler auto-vectorization with predication support, moving away from heavy reliance on low-level intrinsics to achieve scalable, portable performance. + +### Matrix Acceleration + +For matrix computation, AMX provides fixed-size tile registers optimized for dot-product operations such as GEMM and convolutions. In comparison, Arm SME extends the scalable vector compute model with scalable matrix tiles designed around outer-product matrix multiplication and novel streaming modes. + +SME’s flexible, hardware-adaptable tile sizes and tight integration with SVE’s predication model provide a highly adaptable platform for AI training, inference, and scientific computing. + +Both AMX and SME are currently available on limited set of platforms. + +### Overall Summary + +Migrating from x86 SIMD to Arm SIMD entails embracing Arm’s scalable and predicated SIMD programming model embodied by SVE and SME, which supports future-proof, portable code across a wide range of hardware. + +NEON remains important for fixed-width SIMD similar to SSE but may be less suited for emerging HPC and AI workloads that demand scale and flexibility. + +You need to adapt to Arm’s newer vector-length agnostic programming and tooling to fully leverage scalable SIMD and matrix architectures. + +Understanding these key differences in vector models, programming paradigms, and matrix acceleration capabilities helps you migrate and achieve good performance on Arm. + +## Migration tools + +There are tools and libraries that help translate SSE intrinsics to NEON intrinsics, which can shorten the migration effort and produce efficient Arm code. These libraries enable many SSE operations to be mapped to NEON equivalents, but some SSE features have no direct NEON counterparts and require workarounds or redesign. + +Overall, NEON is the standard for SIMD on Arm much like SSE for x86, making it the closest analogue for porting SIMD-optimized software from x86 to ARM. + +[sse2neon](https://github.com/DLTcollab/sse2neon) is an open-source header library that provides a translation layer from Intel SSE2 intrinsics to Arm NEON intrinsics. It enables many SSE2-optimized codebases to be ported to Arm platforms with minimal code modification by mapping familiar SSE2 instructions to their NEON equivalents. + + +[SIMD Everywhere (SIMDe)](https://github.com/simd-everywhere/simde) is a comprehensive, header-only library designed to ease the transition of SIMD code between different architectures. It provides unified implementations of SIMD intrinsics across x86 SSE/AVX, Arm NEON, and other SIMD instruction sets, facilitating portable and maintainable SIMD code. SIMDe supports a wide range of SIMD extensions and includes implementations that fall back to scalar code when SIMD is unavailable, maximizing compatibility. + + +[Google Highway](https://github.com/google/highway) is a high-performance SIMD optimized vector hashing and data processing library designed by Google. It leverages platform-specific SIMD instructions, including Arm NEON and x86 AVX, to deliver fast, portable, and scalable hashing functions and vector operations. Highway is particularly well-suited for large-scale data processing, machine learning, and performance-critical applications requiring efficient SIMD usage across architectures. + +You can also review [Porting architecture specific intrinsics](/learning-paths/cross-platform/intrinsics/) for more information. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md b/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md new file mode 100644 index 0000000000..015060804c --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/2-code-examples.md @@ -0,0 +1,352 @@ +--- +title: Vector extension code examples +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## SAXPY Example code + +As a way to provide some hands-on experience, you can study and run example code to better understand the vector extensions. The example used here is SAXPY. + +SAXPY stands for "Single-Precision A·X Plus Y" and is a fundamental operation in linear algebra. It computes the result of the equation `y[i] = a * x[i] + y[i]` for all elements in the arrays `x` and `y`. + +SAXPY is widely used in numerical computing, particularly in vectorized and parallelized environments, due to its simplicity and efficiency. + +### Reference version + +Below is a plain C implementation of SAXPY without any vector extensions. + +This serves as a reference for the optimized examples provided later. + +```c +#include +#include +#include + +void saxpy(float a, const float *x, float *y, size_t n) { + for (size_t i = 0; i < n; ++i) { + y[i] = a * x[i] + y[i]; + } +} + +int main() { + size_t n = 1000; + float* x = malloc(n * sizeof(float)); + float* y = malloc(n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) { + sum += y[i]; + } + printf("Plain C SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_plain.c` and build and run the code using: + +```bash +gcc -O3 -o saxpy_plain saxpy_plain.c +./saxpy_plain +``` + +You can use Clang for any of the examples by replacing `gcc` with `clang` on the command line. + +### Arm NEON version (128-bit SIMD, 4 floats per operation) + +NEON operates on fixed 128-bit registers, able to process 4 single-precision float values simultaneously in every vector instruction. + +This extension is available on most Arm-based devices and is excellent for accelerating loops and signal processing tasks in mobile and embedded workloads. + +The example below processes 16 floats per iteration using four separate NEON operations to improve instruction-level parallelism and reduce loop overhead. + +```c +#include +#include +#include +#include + +void saxpy_neon(float a, const float *x, float *y, size_t n) { + size_t i = 0; + float32x4_t va = vdupq_n_f32(a); + for (; i + 16 <= n; i += 16) { + float32x4_t x0 = vld1q_f32(x + i); + float32x4_t y0 = vld1q_f32(y + i); + float32x4_t x1 = vld1q_f32(x + i + 4); + float32x4_t y1 = vld1q_f32(y + i + 4); + float32x4_t x2 = vld1q_f32(x + i + 8); + float32x4_t y2 = vld1q_f32(y + i + 8); + float32x4_t x3 = vld1q_f32(x + i + 12); + float32x4_t y3 = vld1q_f32(y + i + 12); + vst1q_f32(y + i, vfmaq_f32(y0, va, x0)); + vst1q_f32(y + i + 4, vfmaq_f32(y1, va, x1)); + vst1q_f32(y + i + 8, vfmaq_f32(y2, va, x2)); + vst1q_f32(y + i + 12, vfmaq_f32(y3, va, x3)); + } + for (; i < n; ++i) y[i] = a * x[i] + y[i]; +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(16, n * sizeof(float)); + float* y = aligned_alloc(16, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_neon(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("NEON SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_neon.c`. + +First, verify your system supports NEON: + +```bash +grep -m1 -ow asimd /proc/cpuinfo +``` + +If NEON is supported, you should see `asimd` in the output. If no output appears, NEON is not available. + +Then build and run the code using: + +```bash +gcc -O3 -march=armv8-a+simd -o saxpy_neon saxpy_neon.c +./saxpy_neon +``` + +### AVX2 (256-bit SIMD, 8 floats per operation) + +AVX2 doubles the SIMD width compared to NEON, processing 8 single-precision floats at a time in 256-bit registers. + +This wider SIMD capability enables higher data throughput for numerical and HPC workloads on Intel and AMD CPUs. + +```c +#include +#include +#include +#include + +void saxpy_avx2(float a, const float *x, float *y, size_t n) { + const __m256 va = _mm256_set1_ps(a); + size_t i = 0; + for (; i + 8 <= n; i += 8) { + __m256 vx = _mm256_loadu_ps(x + i); + __m256 vy = _mm256_loadu_ps(y + i); + __m256 vout = _mm256_fmadd_ps(va, vx, vy); + _mm256_storeu_ps(y + i, vout); + } + for (; i < n; ++i) y[i] = a * x[i] + y[i]; +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(32, n * sizeof(float)); + float* y = aligned_alloc(32, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_avx2(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("AVX2 SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_avx2.c`. + +First, verify your system supports AVX2: + +```bash +grep -m1 -ow avx2 /proc/cpuinfo +``` + +If AVX2 is supported, you should see `avx2` in the output. If no output appears, AVX2 is not available. + +Then build and run the code using: + +```bash +gcc -O3 -mavx2 -mfma -o saxpy_avx2 saxpy_avx2.c +./saxpy_avx2 +``` + +### Arm SVE (hardware dependent: 4 to 16+ floats per operation) + +Arm SVE lets the hardware determine the register width, which can range from 128 up to 2048 bits. This means each operation can process from 4 to 64 single-precision floats at a time, depending on the implementation. + +Cloud instances using AWS Graviton, Google Axion, and Microsoft Azure Cobalt processors implement 128-bit SVE. The Fujitsu A64FX processor implements a vector length of 512 bits. + +SVE encourages writing vector-length agnostic code: the compiler automatically handles tail cases, and your code runs efficiently on any Arm SVE hardware. + +```c +#include +#include +#include +#include + +void saxpy_sve(float a, const float *x, float *y, size_t n) { + size_t i = 0; + svfloat32_t va = svdup_f32(a); + while (i < n) { + svbool_t pg = svwhilelt_b32((uint32_t)i, (uint32_t)n); + svfloat32_t vx = svld1(pg, x + i); + svfloat32_t vy = svld1(pg, y + i); + svfloat32_t vout = svmla_m(pg, vy, va, vx); + svst1(pg, y + i, vout); + i += svcntw(); + } +} + +int main() { + size_t n = 1000; + float* x = aligned_alloc(64, n * sizeof(float)); + float* y = aligned_alloc(64, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_sve(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("SVE SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +Use a text editor to copy the code to a file `saxpy_sve.c`. + +First, verify your system supports SVE: + +```bash +grep -m1 -ow sve /proc/cpuinfo +``` + +If SVE is supported, you should see `sve` in the output. If no output appears, SVE is not available. + +Then build and run the code using: + +```bash +gcc -O3 -march=armv8-a+sve -o saxpy_sve saxpy_sve.c +./saxpy_sve +``` + +### AVX-512 (512-bit SIMD, 16 floats per operation) + +AVX-512 provides the widest SIMD registers of mainstream x86 architectures, processing 16 single-precision floats per 512-bit operation. + +AVX-512 availability varies across x86 processors. It's found on Intel Xeon server processors and some high-end desktop processors, as well as select AMD EPYC models. + +For very large arrays and high-performance workloads, AVX-512 delivers extremely high throughput, with additional masking features for efficient tail processing. + +```c +#include +#include +#include +#include + +void saxpy_avx512(float a, const float* x, float* y, size_t n) { + const __m512 va = _mm512_set1_ps(a); + size_t i = 0; + for (; i + 16 <= n; i += 16) { + __m512 vx = _mm512_loadu_ps(x + i); + __m512 vy = _mm512_loadu_ps(y + i); + __m512 vout = _mm512_fmadd_ps(va, vx, vy); + _mm512_storeu_ps(y + i, vout); + } + const size_t r = n - i; + if (r) { + __mmask16 m = (1u << r) - 1u; + __m512 vx = _mm512_maskz_loadu_ps(m, x + i); + __m512 vy = _mm512_maskz_loadu_ps(m, y + i); + __m512 vout = _mm512_fmadd_ps(va, vx, vy); + _mm512_mask_storeu_ps(y + i, m, vout); + } +} + +int main() { + size_t n = 1000; + float *x = aligned_alloc(64, n * sizeof(float)); + float *y = aligned_alloc(64, n * sizeof(float)); + float a = 2.5f; + + for (size_t i = 0; i < n; ++i) { + x[i] = (float)i; + y[i] = (float)(n - i); + } + + saxpy_avx512(a, x, y, n); + + float sum = 0.0f; + for (size_t i = 0; i < n; ++i) sum += y[i]; + printf("AVX-512 SAXPY sum: %f\n", sum); + + free(x); + free(y); + return 0; +} +``` + +First, verify your system supports AVX-512: + +```bash +grep -m1 -ow avx512f /proc/cpuinfo +``` + +If AVX-512 is supported, you should see `avx512f` in the output. If no output appears, AVX-512 is not available. + +Then build and run the code using: + +```bash +gcc -O3 -mavx512f -o saxpy_avx512 saxpy_avx512.c +./saxpy_avx512 +``` + +### Summary + +Wider data lanes mean each operation processes more elements, offering higher throughput on supported hardware. However, actual performance depends on factors like memory bandwidth, the number of execution units, and workload characteristics. + +Processors also improve performance by implementing multiple SIMD execution units rather than just making vectors wider. For example, Arm Neoverse V2 has 4 SIMD units while Neoverse N2 has 2 SIMD units. Modern CPUs often combine both approaches (wider vectors and multiple execution units) to maximize parallel processing capability. + +Each vector extension requires different intrinsics, compilation flags, and programming approaches. While x86 and Arm vector extensions serve similar purposes and achieve comparable performance gains, you will need to understand the options and details to create portable code. + +You should also look for existing libraries that already work across vector extensions before you get too deep into code porting. This is often a good way to leverage the available SIMD capabilities on your target hardware. diff --git a/content/learning-paths/cross-platform/vectorization-comparison/_index.md b/content/learning-paths/cross-platform/vectorization-comparison/_index.md new file mode 100644 index 0000000000..d2a54fe293 --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/_index.md @@ -0,0 +1,85 @@ +--- +title: "Mapping x86 vector extensions to Arm: a migration overview" + +minutes_to_complete: 30 + +draft: true +cascade: + draft: true + +who_is_this_for: This is an advanced topic for software developers who want to learn how to migrate vectorized code to Arm. + +learning_objectives: + - Understand how Arm vector extensions, including NEON, Scalable Vector Extension (SVE), and Scalable Matrix Extension (SME) map to vector extensions from other architectures. + - Start planning how to migrate your SIMD code to the Arm architecture. + +prerequisites: + - Familiarity with vector extensions, SIMD programming, and compiler intrinsics. + - Access to Linux systems with NEON and SVE support. + +author: + - Jason Andrews + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +armips: + - Neoverse +operatingsystems: + - Linux +tools_software_languages: + - GCC + - Clang + +shared_path: true +shared_between: + - servers-and-cloud-computing + - laptops-and-desktops + - mobile-graphics-and-gaming + - automotive + +further_reading: + - resource: + title: SVE Programming Examples + link: https://developer.arm.com/documentation/dai0548/latest + type: documentation + - resource: + title: Port Code to Arm Scalable Vector Extension (SVE) + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve + type: website + - resource: + title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture + type: website + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 1) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: Build adaptive libraries with multiversioning + link: https://learn.arm.com/learning-paths/cross-platform/function-multiversioning/ + type: website + - resource: + title: SME Programmer's Guide + link: https://developer.arm.com/documentation/109246/latest + type: documentation + - resource: + title: Compiler Intrinsics + link: https://en.wikipedia.org/wiki/Intrinsic_function + type: website + - resource: + title: ACLE - Arm C Language Extension + link: https://github.com/ARM-software/acle + type: website + - resource: + title: Application Binary Interface for the Arm Architecture + link: https://github.com/ARM-software/abi-aa + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md b/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/cross-platform/vectorization-comparison/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/_index.md b/content/learning-paths/embedded-and-microcontrollers/_index.md index dc4f325370..945b031b43 100644 --- a/content/learning-paths/embedded-and-microcontrollers/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/_index.md @@ -11,7 +11,7 @@ maintopic: true operatingsystems_filter: - Android: 1 - Baremetal: 30 -- Linux: 29 +- Linux: 30 - macOS: 7 - RTOS: 9 - Windows: 4 @@ -20,7 +20,7 @@ subjects_filter: - Containers and Virtualization: 6 - Embedded Linux: 4 - Libraries: 3 -- ML: 15 +- ML: 16 - Performance and Architecture: 21 - RTOS Fundamentals: 4 - Security: 2 @@ -35,18 +35,16 @@ tools_software_languages_filter: - Arm Compute Library: 2 - Arm Development Studio: 8 - Arm Fast Models: 4 -- Arm Virtual Hardware: 11 +- Arm Virtual Hardware: 12 - Assembly: 1 -- AVH: 1 -- C: 3 -- C/C++: 1 +- C: 4 +- C++: 1 - ChatGPT: 1 - Clang: 1 - CMSIS: 4 - CMSIS-DSP: 1 - CMSIS-Toolbox: 3 - CNN: 1 -- Coding: 26 - Containerd: 1 - DetectNet: 1 - Docker: 10 @@ -54,39 +52,40 @@ tools_software_languages_filter: - Edge AI: 1 - Edge Impulse: 1 - ExecuTorch: 3 -- Fixed Virtual Platform: 10 +- FastAPI: 1 - FPGA: 1 - Fusion 360: 1 -- FVP: 1 +- FVP: 10 - GCC: 9 -- GenAI: 2 +- Generative AI: 2 - GitHub: 3 - GitLab: 1 +- gpiozero: 1 - Himax SDK: 1 - Hugging Face: 3 - IP Explorer: 4 - Jupyter Notebook: 1 - K3s: 1 -- Keil: 5 -- Keil MDK: 3 +- Keil MDK: 7 +- Keil RTX RTOS: 2 +- Keil Studio Cloud: 1 - Kubernetes: 1 +- lgpio: 1 - LLM: 2 - MCP: 1 -- MDK: 1 - MPS3: 1 - MXNet: 1 -- Neon: 1 +- NEON: 1 - NumPy: 1 +- Ollama: 1 - Paddle: 1 - Porcupine: 1 -- Python: 7 +- Python: 8 - PyTorch: 3 - QEMU: 1 -- Raspberry Pi: 6 +- Raspberry Pi: 7 - Remote.It: 1 -- RTX: 2 - Runbook: 4 -- Slicing software: 1 - STM32: 2 - TensorFlow: 3 - TensorRT: 1 @@ -95,7 +94,7 @@ tools_software_languages_filter: - TrustZone: 2 - TVMC: 1 - vcpkg: 1 -- Yocto Linux: 1 +- Yocto Project: 1 - Zephyr: 1 weight: 5 --- diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md index 5559b9f147..08d2e97004 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md @@ -65,10 +65,7 @@ Then, generate a model file on the `.pte` format using the Arm examples. The Ahe ```bash cd $ET_HOME -python -m examples.arm.aot_arm_compiler --model_name=examples/arm/simple_nn.py \ ---delegate --quantize --target=ethos-u85-256 \ ---so_library=cmake-out-aot-lib/kernels/quantized/libquantized_ops_aot_lib.so \ ---system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only +python -m examples.arm.aot_arm_compiler --model_name=examples/arm/simple_nn.py --delegate --quantize --target=ethos-u85-256 --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only ``` From the Arm Examples directory, you can build an embedded Arm runner with the `.pte` included. This allows you to optimize the performance of your model, and ensures compatibility with the CPU kernels on the FVP. Finally, generate the executable `arm_executor_runner`. @@ -122,4 +119,4 @@ I [executorch:arm_executor_runner.cpp:412] Model in 0x70000000 $ I [executorch:arm_executor_runner.cpp:414] Model PTE file loaded. Size: 3360 bytes. ``` -You have now set up your environment for TinyML development on Arm, and tested a small PyTorch and ExecuTorch Neural Network. In the next Learning Path of this series, you will learn about optimizing neural networks to run on Arm. \ No newline at end of file +You have now set up your environment for TinyML development on Arm, and tested a small PyTorch and ExecuTorch Neural Network. In the next Learning Path of this series, you will learn about optimizing neural networks to run on Arm. diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr/zephyr.md b/content/learning-paths/embedded-and-microcontrollers/zephyr/zephyr.md index 255fb92093..c82c9ff460 100644 --- a/content/learning-paths/embedded-and-microcontrollers/zephyr/zephyr.md +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr/zephyr.md @@ -15,15 +15,15 @@ You can get the Zephyr source, install the Zephyr SDK, build sample applications ## Host platform -Zephyr SDK is available on Windows, Linux, and macOS hosts. However the FVP is only available for Windows and Linux hosts. +Zephyr SDK is available on Windows, Linux, and macOS hosts. However the FVP is natively available for Windows and Linux hosts, and can be run on macOS with a Docker as explained in [AVH FVPs on macOS](/install-guides/fvps-on-macos/). -These instructions assume an Ubuntu Linux host machine or use of Arm Virtual Hardware (AVH). +These instructions assume an Ubuntu Linux host machine. ## Corstone-300 FVP {#fvp} -The Corstone-300 FVP is available from the [Arm Ecosystem FVP](https://developer.arm.com/downloads/-/arm-ecosystem-fvps) page. Setup instructions are given in the [install guide](/install-guides/fm_fvp). +The Corstone-300 FVP is available for download from the [Arm Ecosystem FVP](https://developer.arm.com/downloads/-/arm-ecosystem-fvps) page. Setup instructions are given in the [install guide](/install-guides/fm_fvp). -Alternatively, you can access the FVP with [Arm Virtual Hardware](https://www.arm.com/products/development-tools/simulation/virtual-hardware). Setup instructions are given in the [Arm Virtual Hardware install guide](/install-guides/avh#corstone). +Alternatively, you can access the FVP from [Arm Tools Artifactory](https://www.keil.arm.com/artifacts/#models/arm/avh-fvp). Setup instructions are given in the [AVH FVPs in Arm Tools Artifactory](https://arm-software.github.io/AVH/main/infrastructure/html/avh_fvp_artifactory.html). ## Install the required software to build Zephyr @@ -92,7 +92,7 @@ You can build the [hello world](https://docs.zephyrproject.org/latest/samples/he ```bash { env_source="/shared/zephyrproject/.venv/bin/activate",cwd="/shared" } cd zephyrproject/zephyr -west build -p auto -b mps3_an547 samples/hello_world +west build -p auto -b mps3/corstone300/fvp samples/hello_world samples/hello_world ``` {{% notice Note %}} @@ -104,18 +104,11 @@ The application binaries are placed in the `~/zephyrproject/zephyr/build/zephyr/ ## Run Zephyr application on Corstone-300 FVP {#runzephyr} -### Using local machine with the FVP installed +Execute on the machine with the installed Corstone-300 FVP: ```fvp { fvp_name="FVP_Corstone_SSE-300_Ethos-U55",cwd="/shared/zephyrproject/zephyr" } FVP_Corstone_SSE-300_Ethos-U55 -a build/zephyr/zephyr.elf -C mps3_board.visualisation.disable-visualisation=1 --simlimit 30 ``` -### Using Arm Virtual Hardware - -To run on AVH: - -```console -VHT_Corstone_SSE-300_Ethos-U55 -a build/zephyr/zephyr.elf -C mps3_board.visualisation.disable-visualisation=1 --simlimit 30 -``` {{% notice Optional switches %}} `-C mps3_board.visualisation.disable-visualisation=1` disables the FVP visualization. This can speed up launch time for the FVP. diff --git a/content/learning-paths/iot/_index.md b/content/learning-paths/iot/_index.md index 35b89dabc2..4f5be1e835 100644 --- a/content/learning-paths/iot/_index.md +++ b/content/learning-paths/iot/_index.md @@ -26,20 +26,19 @@ tools_software_languages_filter: - Arm Virtual Hardware: 6 - AWS IoT Greengrass: 1 - Azure: 1 -- Balena Cloud: 1 -- Balena OS: 1 +- balenaCloud: 1 +- BalenaOS: 1 - C: 1 -- Coding: 3 - Docker: 2 -- Fixed Virtual Platform: 1 +- FVP: 1 - GitHub: 3 - Matter: 1 - MCP: 1 - Python: 2 - Raspberry Pi: 4 - Remote.It: 1 -- ROS2: 1 +- ROS 2: 1 - Rust: 1 -- VS Code: 1 +- Visual Studio Code: 1 - Zenoh: 1 --- diff --git a/content/learning-paths/laptops-and-desktops/_index.md b/content/learning-paths/laptops-and-desktops/_index.md index 47c7b9b83f..25ff68127d 100644 --- a/content/learning-paths/laptops-and-desktops/_index.md +++ b/content/learning-paths/laptops-and-desktops/_index.md @@ -8,16 +8,16 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 2 -- ChromeOS: 1 -- Linux: 31 -- macOS: 8 +- ChromeOS: 2 +- Linux: 33 +- macOS: 9 - Windows: 44 subjects_filter: - CI-CD: 5 -- Containers and Virtualization: 6 +- Containers and Virtualization: 7 - Migration to Arm: 28 - ML: 2 -- Performance and Architecture: 25 +- Performance and Architecture: 27 subtitle: Create and migrate apps for power efficient performance title: Laptops and Desktops tools_software_languages_filter: @@ -27,24 +27,21 @@ tools_software_languages_filter: - Arm Development Studio: 1 - Arm Performance Libraries: 2 - Arm64EC: 1 -- assembly: 1 -- C: 3 +- Assembly: 1 +- C: 8 - C#: 6 -- C++: 6 -- C/C++: 4 +- C++: 11 - CCA: 1 -- Clang: 11 -- cmake: 1 -- CMake: 2 -- Coding: 16 +- Clang: 13 +- CMake: 3 - CSS: 1 - Daytona: 1 - Docker: 5 -- GCC: 10 +- GCC: 12 - Git: 1 - GitHub: 3 - GitLab: 1 -- GoogleTest: 1 +- Google Test: 1 - HTML: 2 - Hyper-V: 1 - i3: 1 @@ -57,7 +54,7 @@ tools_software_languages_filter: - llvm-mca: 1 - MSBuild: 1 - MTE: 1 -- Neon: 1 +- NEON: 1 - Neovim: 1 - Node.js: 3 - ONNX Runtime: 1 @@ -72,9 +69,9 @@ tools_software_languages_filter: - SVE: 1 - SVE2: 1 - Trusted Firmware: 1 +- Ubuntu: 1 - Visual Studio: 14 -- Visual Studio Code: 10 -- VS Code: 3 +- Visual Studio Code: 13 - Windows Forms: 1 - Windows Performance Analyzer: 1 - Windows Presentation Foundation: 1 diff --git a/content/learning-paths/mobile-graphics-and-gaming/_index.md b/content/learning-paths/mobile-graphics-and-gaming/_index.md index 2721c997a3..aae0dcbb19 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/_index.md @@ -10,42 +10,42 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 31 -- Linux: 28 -- macOS: 13 -- Windows: 12 +- Linux: 30 +- macOS: 14 +- Windows: 14 subjects_filter: - Gaming: 6 - Graphics: 6 -- ML: 10 -- Performance and Architecture: 32 +- ML: 12 +- Performance and Architecture: 34 subtitle: Optimize Android apps and build faster games using cutting-edge Arm tech title: Mobile, Graphics, and Gaming tools_software_languages_filter: - 7-Zip: 1 -- adb: 1 +- adb: 2 - Android: 4 -- Android NDK: 1 +- Android NDK: 2 - Android SDK: 1 - Android Studio: 10 - Arm Development Studio: 1 - Arm Mobile Studio: 1 - Arm Performance Studio: 3 -- assembly: 1 +- Assembly: 1 - Bazel: 1 -- C: 2 +- C: 4 - C#: 3 -- C++: 9 -- C/C++: 1 +- C++: 11 - CCA: 1 -- Clang: 10 +- Clang: 12 - CMake: 1 -- Coding: 16 +- Docker: 1 - ExecuTorch: 1 - Frame Advisor: 1 -- GCC: 10 -- GenAI: 2 +- GCC: 12 +- Generative AI: 2 - Godot: 1 -- GoogleTest: 1 +- Google Pixel 8: 1 +- Google Test: 1 - Hugging Face: 5 - Java: 6 - KleidiAI: 1 @@ -55,17 +55,14 @@ tools_software_languages_filter: - LLVM: 1 - llvm-mca: 1 - MediaPipe: 2 -- Memory Bug Report: 1 -- Memory Tagging Extension: 1 -- Mobile: 7 -- mobile: 1 -- NDK: 1 +- MTE: 2 - NEON: 1 - ONNX Runtime: 1 - OpenGL ES: 1 - Python: 4 - PyTorch: 1 - QEMU: 1 +- RenderDoc: 1 - RME: 1 - Runbook: 15 - Rust: 2 @@ -73,9 +70,11 @@ tools_software_languages_filter: - SVE2: 1 - Trusted Firmware: 1 - Unity: 6 -- Unreal Engine: 3 -- VS Code: 1 -- Vulkan: 3 +- Unreal Engine: 4 +- Visual Studio: 1 +- Visual Studio Code: 1 +- Vulkan: 4 +- Vulkan SDK: 1 - XNNPACK: 1 weight: 3 --- diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md index 22fa2701a4..b7118c9a15 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md @@ -46,6 +46,13 @@ pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12) If Python 3.x is not the default version, try running `python3 --version` and `pip3 --version`. {{% /notice %}} +It is recommended to use a python virtual environment: + +```bash +python3.12 -m venv vision_llm +source vision_llm/bin/activate +``` + ## Set up Phone Connection You need to set up an authorized connection with your phone. The Android SDK Platform Tools package, included with Android Studio, provides Android Debug Bridge (ADB) for transferring files. @@ -72,7 +79,7 @@ The pre-quantized model is available in Hugging Face, you can download with the ```bash git lfs install git clone https://huggingface.co/taobao-mnn/Qwen2.5-VL-3B-Instruct-MNN -git checkout 9057334b3f85a7f106826c2fa8e57c1aee727b53 +git checkout a4622194b3c518139e2cb8099e147e3d71975f7a ``` ## (Optional) Download and Convert the Model @@ -81,28 +88,48 @@ If you need to quantize the model with customized parameter, the following comma ```bash cd $HOME pip install -U huggingface_hub -huggingface-cli download Qwen/Qwen2-VL-2B-Instruct --local-dir ./Qwen2-VL-2B-Instruct/ -git clone https://github.com/wangzhaode/llm-export -cd llm-export && pip install . +hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ./Qwen2.5-VL-3B-Instruct/ +pip install llmexport ``` -Use the `llm-export` repository to quantize the model with these options: +Use `llmexport` to quantize the model with these options: ```bash -llmexport --path ../Qwen2-VL-2B-Instruct/ --export mnn --quant_bit 4 \ - --quant_block 0 --dst_path Qwen2-VL-2B-Instruct-convert-4bit-per_channel --sym +llmexport --path ../Qwen2.5-VL-3B-Instruct/ --export mnn --quant_bit 4 \ + --quant_block 64 --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock ``` +{{% notice Note %}} +If you run into issues where llmexport is not able to access utils, try the following +```bash +# From your project dir (inside the venv) +cat > llmexport_fixed.py <<'PY' +import sys, importlib +# make "utils" resolve to "llmexport.utils" +sys.modules.setdefault("utils", importlib.import_module("llmexport.utils")) + +from llmexport.__main__ import main +if __name__ == "__main__": + main() +PY + +# Use this instead of the entrypoint: +python llmexport_fixed.py \ + --path Qwen2.5-VL-3B-Instruct \ + --export mnn --quant_bit 4 --quant_block 64 \ + --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock +``` +{{% /notice %}} + The table below gives you an explanation of the different arguments: | Parameter | Description | Explanation | |------------------|-------------|--------------| | `--quant_bit` | MNN quant bit, 4 or 8, default is 4. | `4` represents q4 quantization. | -| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `128` represents 128 per-block quantization. | -| `--sym` | Symmetric quantization (without zeropoint); default is False. | The quantization parameter that enables symmetrical quantization. | +| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `64` represents 64 per-block quantization. | To learn more about the parameters, see the [transformers README.md](https://github.com/alibaba/MNN/tree/master/transformers). -Verify that the model was built correctly by checking that the `Qwen2-VL-2B-Instruct-convert-4bit-per_channel` directory is at least 1 GB in size. +Verify that the model was built correctly by checking that the `Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock` directory is at least 2GB in size. ## Push the model to Android device diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md index b831ee28bf..863eb1a49c 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md @@ -29,7 +29,7 @@ Run the following commands to clone the MNN repository and checkout the source t cd $HOME git clone https://github.com/alibaba/MNN.git cd MNN -git checkout 282cebeb785118865b9c903decc4b5cd98d5025e +git checkout a739ea5870a4a45680f0e36ba9662ca39f2f4eec ``` Create a build directory and run the build script. diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md index acea511628..15ba0c5239 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md @@ -12,7 +12,7 @@ MNN is a high-performance, lightweight deep learning framework designed for both **MNN-LLM** is a large language model (LLM) runtime solution built on the MNN engine. It enables local deployment of LLMs across diverse platforms, including mobile devices, PCs, and IoT systems, and supports leading models such as Qianwen, Baichuan, Zhipu, and Llama for efficient, accessible AI-powered experiences. -KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen Vision 2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) model. +KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen2.5 Vision 3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) model. ## Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model designed for image recognition tasks. Unlike traditional convolutional neural networks (CNNs) that use convolutional layers, ViT leverages the transformer architecture originally developed for natural language processing (NLP). diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index 792fa14883..c42dd243cc 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -7,9 +7,9 @@ key_ip: - Neoverse maintopic: true operatingsystems_filter: -- Android: 2 -- Linux: 157 -- macOS: 11 +- Android: 3 +- Linux: 175 +- macOS: 13 - Windows: 14 pinned_modules: - module: @@ -18,14 +18,14 @@ pinned_modules: - providers - migration subjects_filter: -- CI-CD: 6 -- Containers and Virtualization: 29 -- Databases: 15 +- CI-CD: 7 +- Containers and Virtualization: 31 +- Databases: 17 - Libraries: 9 -- ML: 29 -- Performance and Architecture: 62 +- ML: 31 +- Performance and Architecture: 71 - Storage: 1 -- Web: 10 +- Web: 12 subtitle: Optimize cloud native apps on Arm for performance and cost title: Servers and Cloud Computing tools_software_languages_filter: @@ -34,27 +34,31 @@ tools_software_languages_filter: - 5G: 1 - ACL: 1 - AI: 1 -- Amazon Web Services: 1 - Android Studio: 1 - Ansible: 2 +- Apache Bench: 1 +- Apache Spark: 2 +- Apache Tomcat: 2 - Arm Compiler for Linux: 1 - Arm Development Studio: 3 - Arm ISA: 1 - Arm Performance Libraries: 1 +- Arm Streamline: 1 - armclang: 1 - armie: 1 - ArmRAL: 1 - ASP.NET Core: 2 -- Assembly: 4 -- assembly: 1 +- Assembly: 5 - async-profiler: 1 -- AWS: 1 +- AWS: 2 - AWS CDK: 2 +- AWS Cloud Formation: 1 - AWS CodeBuild: 1 - AWS EC2: 2 - AWS Elastic Container Service (ECS): 1 - AWS Elastic Kubernetes Service (EKS): 3 - AWS Graviton: 1 +- AWS Lambda: 1 - Azure CLI: 2 - Azure Portal: 1 - Bash: 1 @@ -62,61 +66,62 @@ tools_software_languages_filter: - Bastion: 3 - BOLT: 2 - bpftool: 1 -- C: 5 +- C: 10 - C#: 2 -- C++: 8 -- C/C++: 2 +- C++: 12 - Capstone: 1 -- CCA: 7 +- CCA: 8 - Clair: 1 -- Clang: 10 +- Clang: 12 - ClickBench: 1 - ClickHouse: 1 -- CloudFormation: 1 - CMake: 1 -- Coding: 17 - conda: 1 - Daytona: 1 - Demo: 3 - Django: 1 -- Docker: 18 -- Envoy: 2 +- Docker: 22 +- Envoy: 3 - ExecuTorch: 1 - FAISS: 1 - FlameGraph: 1 - Flink: 1 - Fortran: 1 - FunASR: 1 -- FVP: 4 -- GCC: 22 +- FVP: 7 +- GCC: 24 - gdb: 1 - Geekbench: 1 -- GenAI: 12 +- Generative AI: 12 - GitHub: 6 +- GitHub Actions: 1 +- GitHub CLI: 1 - GitLab: 1 -- Glibc: 1 +- glibc: 1 - Go: 4 +- go test -bench: 1 +- Golang: 1 - Google Axion: 3 - Google Benchmark: 1 -- Google Cloud: 1 -- GoogleTest: 1 +- Google Cloud: 2 +- Google Test: 1 - HammerDB: 1 - Herd7: 1 -- Hugging Face: 10 +- Hugging Face: 11 - InnoDB: 1 - Intrinsics: 1 - iPerf3: 1 -- Java: 3 +- Java: 4 - JAX: 1 +- JMH: 1 - Kafka: 1 - Keras: 1 - Kubernetes: 10 -- Lambda: 1 - Libamath: 1 - libbpf: 1 - Linaro Forge: 1 - Litmus7: 1 -- Llama.cpp: 1 +- Llama.cpp: 2 - LLM: 10 - llvm-mca: 1 - LSE: 1 @@ -124,33 +129,36 @@ tools_software_languages_filter: - Memcached: 2 - MLPerf: 1 - ModelScope: 1 -- MongoDB: 2 +- MongoDB: 4 +- mongostat: 1 +- mongotop: 1 - mpi: 1 - MySQL: 9 -- NEON: 4 -- Neon: 3 +- NEON: 7 +- Networking: 1 - Nexmark: 1 -- Nginx: 3 +- NGINX: 4 - Node.js: 3 - Ollama: 1 - ONNX Runtime: 1 - OpenBLAS: 1 -- OpenJDK-21: 1 +- OpenJDK 21: 2 - OpenShift: 1 -- OrchardCore: 1 +- Orchard Core: 1 - PAPI: 1 -- perf: 5 -- Perf: 1 +- perf: 6 - PostgreSQL: 4 -- Python: 28 +- Python: 31 - PyTorch: 9 - QEMU: 1 - RAG: 1 - Redis: 3 - Remote.It: 2 -- RME: 7 +- RME: 8 - Runbook: 71 - Rust: 2 +- Service Mesh: 1 +- Siege: 1 - snappy: 1 - Snort3: 1 - SQL: 7 @@ -165,27 +173,27 @@ tools_software_languages_filter: - TensorFlow: 2 - Terraform: 11 - ThirdAI: 1 -- Tomcat: 1 - Trusted Firmware: 1 +- Trustee: 1 - TSan: 1 - TypeScript: 1 - Vectorscan: 1 -- Veraison: 1 -- Visual Studio Code: 4 +- Veraison: 2 +- Visual Studio Code: 5 - vLLM: 2 -- VS Code: 1 - vvenc: 1 - Whisper: 1 - WindowsPerf: 1 - WordPress: 3 -- wrk2: 1 +- wrk2: 2 - x265: 1 +- YCSB: 1 - zlib: 1 -- Zookeeper: 1 +- ZooKeeper: 1 weight: 1 cloud_service_providers_filter: - AWS: 17 -- Google Cloud: 13 -- Microsoft Azure: 10 +- Google Cloud: 18 +- Microsoft Azure: 15 - Oracle: 2 --- diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md new file mode 100644 index 0000000000..467205cda0 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md @@ -0,0 +1,59 @@ +--- +title: Deploy Golang on the Microsoft Azure Cobalt 100 processors + +draft: true +cascade: + draft: true + +minutes_to_complete: 40 + +who_is_this_for: This Learning Path is designed for software developers looking to migrate their Golang workloads from x86_64 to Arm-based platforms, specifically on the Microsoft Azure Cobalt 100 processors. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. + - Deploy Golang on an Arm64-based virtual machine running Ubuntu Pro 24.04 LTS. + - Perform Golang baseline testing and benchmarking on both x86_64 and Arm64 virtual machine. + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) + - Basic understanding of Linux command line. + - Familiarity with the [Golang](https://go.dev/) and deployment practices on Arm64 platforms. + +author: Jason Andrews + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Golang + - go test -bench + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Effective Go Benchmarking + link: https://go.dev/doc/effective_go#testing + type: Guide + - resource: + title: Testing and Benchmarking in Go + link: https://pkg.go.dev/testing + type: Official Documentation + - resource: + title: Using go test -bench for Benchmarking + link: https://pkg.go.dev/cmd/go#hdr-Testing_flags + type: Reference + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md new file mode 100644 index 0000000000..4ad80dac05 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md @@ -0,0 +1,18 @@ +--- +title: "Overview" + +weight: 2 + +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Golang +Golang (or Go) is an open-source programming language developed by Google, designed for simplicity, efficiency, and scalability. It provides built-in support for concurrency, strong typing, and a rich standard library, making it ideal for building reliable, high-performance applications. + +Go is widely used for cloud-native development, microservices, system programming, DevOps tools, and distributed systems. Learn more from the [Go official website](https://go.dev/) and its [official documentation](https://go.dev/doc/). diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md new file mode 100644 index 0000000000..c71870f584 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md @@ -0,0 +1,161 @@ +--- +title: Golang Baseline Testing +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +### Baseline testing of Golang Web Page on Azure Arm64 +This section demonstrates how to test your Go installation on the **Ubuntu Pro 24.04 LTS Arm64** virtual machine by creating and running a simple Go web server that serves a styled HTML page. + +**1. Create Project Directory** + +First, create a new folder called goweb to contain all project files, and then navigate into it: + +```console +mkdir goweb && cd goweb +``` +This command creates a new directory named goweb and then switches into it. + +**2. Create HTML Page with Bootstrap Styling** + +Next, create a file named `index.html` using the nano editor: + +```console +nano index.html +``` + +Paste the following HTML code into the index.html file. This builds a simple, styled web page with a header, a welcome message, and a button using Bootstrap. + +```html + + + + + + Go Web on Azure ARM64 + + + + +
+
+

Go Web on Azure Arm64

+

This page is powered by Golang running on the Microsoft Azure Cobalt 100 processors.

+ Test API Endpoint +
+
+ + +``` +**3. Create Golang Web Server** + +Now create the Go program that will serve this web page: + +```console +nano main.go +``` +Paste the following code into the main.go file. This sets up a very basic web server that serves files from the current folder, including the **index.html** you just created. When it runs, it will print a message showing the server address. + +```go +package main +import ( + "encoding/json" + "log" + "net/http" + "time" +) +func main() { + // Serve index.html for root + http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path == "/" { + http.ServeFile(w, r, "index.html") + return + } + http.FileServer(http.Dir(".")).ServeHTTP(w, r) + }) + // REST API endpoint for JSON response + http.HandleFunc("/api/hello", func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + json.NewEncoder(w).Encode(map[string]string{ + "message": "Hello from Go on Azure ARM64!", + "time": time.Now().Format(time.RFC1123), + }) + }) + log.Println("Server running on http://0.0.0.0:80") + log.Fatal(http.ListenAndServe(":80", nil)) +} +``` +{{% notice Note %}}Running on port 80 requires root privileges. Use sudo with the full Go path if needed.{{% /notice %}} +**4. Run on the Web Server** + +Run your Go program with: + +```console +sudo /usr/local/go/bin/go run main.go +``` + +This compiles and immediately starts the server. If the server starts successfully, you will see the following message in your terminal:: + +```output +2025/08/19 04:35:06 Server running on http://0.0.0.0:80 +``` +**5. Allow HTTP Traffic in Firewall** + +On **Ubuntu Pro 24.04 LTS** virtual machines, **UFW (Uncomplicated Firewall)** is used to manage firewall rules. By default, it allows only SSH (port 22) and blocks most other traffic. + +So even if Azure allows HTTP on port 80 (added to inbound ports during VM creation), your VM’s firewall may still block it until you run: + +```console +sudo ufw allow 80/tcp +sudo ufw enable +``` +You can verify that HTTP is now allowed with: + +```console +sudo ufw status +``` +You should see an output similar to: +```output +Status: active + +To Action From +-- ------ ---- +8080/tcp ALLOW Anywhere +80/tcp ALLOW Anywhere +8080/tcp (v6) ALLOW Anywhere (v6) +80/tcp (v6) ALLOW Anywhere (v6) +``` + +**6. Open in Browser** + +Run the following command to print your VM’s public URL, then open it in a browser: + +```console +echo "http://$(curl -s ifconfig.me)/" +``` +When you visit this link, you should see the styled HTML page being served directly by your Go application. + +You should see the Golang web page confirming a successful installation of Golang. + +![golang](images/go-web.png) + +Now, your Golang instance is ready for further benchmarking and production use. diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md new file mode 100644 index 0000000000..9408bdf6db --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md @@ -0,0 +1,203 @@ +--- +title: Benchmarking via go test -bench +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run the performance tests using go test -bench + +`go test -bench` (the benchmarking mode of go test) is Golang’s built-in benchmarking framework that measures the performance of functions by running them repeatedly and reporting execution time (**ns/op**), memory usage, and allocations. With the `-benchmem flag`, it also shows memory usage and allocations. It’s simple, reliable, and requires only writing benchmark functions in the standard Golang testing package. + +1. Create a Project Folder + +Open your terminal and create a new folder for this project: + +```console +mkdir gosort-bench +cd gosort-bench +``` + +2. Initialize a Go Module + +Inside the project directory, run following command: + +```console +go mod init gosort-bench +``` +This creates a go.mod file, which defines the module path (gosort-bench in this case) and marks the directory as a Go project. The go.mod file also allows Go to manage dependencies (external libraries) automatically, ensuring your project remains reproducible and easy to maintain. + +3. Add Sorting Functions + +Create a file called **sorting.go**: + +```console +nano sorting.go +``` +Paste this code in **sorting.go** file: + +```go +package sorting +func BubbleSort(arr []int) { + n := len(arr) + for i := 0; i < n-1; i++ { + for j := 0; j < n-i-1; j++ { + if arr[j] > arr[j+1] { + arr[j], arr[j+1] = arr[j+1], arr[j] + } + } + } +} + +func QuickSort(arr []int) { + quickSort(arr, 0, len(arr)-1) +} + +func quickSort(arr []int, low, high int) { + if low < high { + pivot := partition(arr, low, high) + quickSort(arr, low, pivot-1) + quickSort(arr, pivot+1, high) + } +} + +func partition(arr []int, low, high int) int { + pivot := arr[high] + i := low - 1 + for j := low; j < high; j++ { + if arr[j] < pivot { + i++ + arr[i], arr[j] = arr[j], arr[i] + } + } + arr[i+1], arr[high] = arr[high], arr[i+1] + return i + 1 +} +``` +- The code contains **two sorting methods**, Bubble Sort and Quick Sort, which arrange numbers in order from smallest to largest. +- **Bubble Sort** works by repeatedly comparing two numbers side by side and swapping them if they are in the wrong order. It keeps doing this until the whole list is sorted. +- **Quick Sor**t is faster. It picks a "pivot" number and splits the list into two groups — numbers smaller than the pivot and numbers bigger than it. Then it sorts each group separately. +- The **function** partition helps Quick Sort decide where to split the list based on the pivot number. +- In short, **Bubble Sort is simple but slow,** while **Quick Sort is smarter and usually much faster for big lists of numbers**. + +You create the sorting folder and then move `sorting.go` into it to organize your code properly so that the Go module can reference it as `gosort-bench/sorting`. + +```console +mkdir sorting +mv sorting.go sorting/ +``` + +4. Add Benchmark Tests + +Create another file called s**orting_benchmark_test.go**: + +```console +nano sorting_benchmark_test.go +```` + +Paste the below code: + +```go +package sorting_test +import ( + "math/rand" + "testing" + "gosort-bench/sorting" +) +const LENGTH = 10000 +func makeRandomNumberSlice(n int) []int { + numbers := make([]int, n) + for i := range numbers { + numbers[i] = rand.Intn(n) + } + return numbers +} +func BenchmarkBubbleSort(b *testing.B) { + for i := 0; i < b.N; i++ { + b.StopTimer() + numbers := makeRandomNumberSlice(LENGTH) + b.StartTimer() + sorting.BubbleSort(numbers) + } +} + +func BenchmarkQuickSort(b *testing.B) { + for i := 0; i < b.N; i++ { + b.StopTimer() + numbers := makeRandomNumberSlice(LENGTH) + b.StartTimer() + sorting.QuickSort(numbers) + } +} +``` + +- The code is a **benchmark test** that checks how fast Bubble Sort and Quick Sort run in Go. +- It first creates a **list of 10,000 random numbers** each time before running a sort, so the test is fair and consistent. +- **BenchmarkBubbleSort** measures the speed of sorting using the slower Bubble Sort method. +- **BenchmarkQuickSort** measures the speed of sorting using the faster Quick Sort method. + +When you run **go test -bench=. -benchmem**, Go will show you how long each sort takes and how much memory it uses, so you can compare the two sorting techniques. + +### Run the Benchmark + +Execute the benchmark suite using the following command: +```console +go test -bench=. -benchmem +``` +- **-bench=.** - runs all functions starting with Benchmark. +- **-benchmem** - also shows memory usage (allocations per operation). + +You should see the output similar to this: + +```output +goos: linux +goarch: arm64 +pkg: gosort-bench +BenchmarkBubbleSort-4 32 36616759 ns/op 0 B/op 0 allocs/op +BenchmarkQuickSort-4 3506 340873 ns/op 0 B/op 0 allocs/op +PASS +ok gosort-bench 2.905s +``` +### Matrics Explanation + +- **ns/op** - nanoseconds per operation (lower is better). +- **B/op** - bytes of memory used per operation. +- **allocs/op** - how many memory allocations happened per operation. + +### Benchmark summary on Arm64 +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| Benchmark | Value on Virtual Machine | +|-------------------|--------------------------| +| BubbleSort (ns/op) | 36,616,759 | +| QuickSort (ns/op) | 340,873 | +| BubbleSort runs | 32 | +| QuickSort runs | 3,506 | +| Allocations/op | 0 | +| Bytes/op | 0 | +| Total time (s) | 2.905 | + +### Benchmark summary on x86_64 +Here is a summary of the benchmark results collected on x86_64 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| Benchmark | Value on Virtual Machine | +|-------------------|--------------------------| +| BubbleSort (ns/op) | 42,801,947 | +| QuickSort (ns/op) | 512,726 | +| BubbleSort runs | 27 | +| QuickSort runs | 2,332 | +| Allocations/op | 0 | +| Bytes/op | 0 | +| Total time (s) | 2.716 | + + +### Benchmarking comparison summary + +When you compare the benchmarking results you will notice that on the Azure Cobalt 100: + +- **Arm64 maintains consistency** – the virtual machine delivered stable and predictable results, showing that Arm64 optimizations are effective for compute workloads. +- **BubbleSort (CPU-heavy, O(n²))** – runs in **~36.6M ns/op**, proving that raw CPU performance on Arm64 is consistent and unaffected by environmental factors. +- **QuickSort (efficient O(n log n))** – execution is very fast (**~341K ns/op**), demonstrating that Arm64 handles algorithmic workloads efficiently. +- **No memory overhead** – the benchmark shows **0 B/op and 0 allocs/op**, confirming Golang’s memory efficiency is preserved on Arm64. +- **Run counts align closely** – **BubbleSort (32 runs)** and **QuickSort (3,506 runs)** indicate Arm64 delivers repeatable and reliable performance. diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md new file mode 100644 index 0000000000..9571395aa2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md @@ -0,0 +1,50 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". +1. Select "Create", and click on "Virtual Machine" from the drop-down list. +2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". +3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. +4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") + +5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +6. Fill in the Administrator username for your VM. +7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. +8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") + +9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") + +10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") + +11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md new file mode 100644 index 0000000000..a97e52576e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md @@ -0,0 +1,121 @@ +--- +title: Install Golang +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## Install Golang on Ubuntu Pro 24.04 LTS (Arm64) +This section covers installing the latest Golang version on **Ubuntu Pro 24.04 LTS Arm64**, configuring the environment, and verifying the setup. + +1. Download the Golang archive + +This command downloads the official Golang package for Linux Arm64 from the Golang website. + +```console +wget https://go.dev/dl/go1.25.0.linux-arm64.tar.gz +``` +{{% notice Note %}} +There are many enhancements added to Golang version 1.18, that has resulted in up to a 20% increase in performance for Golang workloads on Arm-based servers. Please see [this reference content](https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/) for the details. + +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) also recommends Golang version 1.18 as the minimum recommended on the Arm platforms. +{{% /notice %}} + +2. Extract the archive into `/usr/local` + +This unpacks the Golang files into the system directory /usr/local, which is a standard place for system-wide software. + +```console +sudo tar -C /usr/local -xzf ./go1.25.0.linux-arm64.tar.gz +``` + +3. Add Golang to your system PATH + +This updates your .bashrc file so your shell can recognize the Golang command from anywhere. + +```console +echo 'export PATH="$PATH:/usr/local/go/bin"' >> ~/.bashrc +``` + +4. Apply the PATH changes immediately + +This reloads your .bashrc so you don’t need to log out and log back in for the changes to take effect. + +```console +source ~/.bashrc +``` + +5. Verify Golang installation + +This checks if Golang is installed correctly and shows the installed version. + +```console +go version +``` + +You should see an output similar to: + +```output +go version go1.25.0 linux/arm64 +``` +6. Check Golang environment settings + +This displays Golang’s environment variables (like GOROOT and GOPATH) to ensure they point to the correct installation. + +```console +go env +``` + +You should see an output similar to: + +```output +AR='ar' +CC='gcc' +CGO_CFLAGS='-O2 -g' +CGO_CPPFLAGS='' +CGO_CXXFLAGS='-O2 -g' +CGO_ENABLED='1' +CGO_FFLAGS='-O2 -g' +CGO_LDFLAGS='-O2 -g' +CXX='g++' +GCCGO='gccgo' +GO111MODULE='' +GOARCH='arm64' +GOARM64='v8.0' +GOAUTH='netrc' +GOBIN='' +GOCACHE='/home/ubuntu/.cache/go-build' +GOCACHEPROG='' +GODEBUG='' +GOENV='/home/ubuntu/.config/go/env' +GOEXE='' +GOEXPERIMENT='' +GOFIPS140='off' +GOFLAGS='' +GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build119388372=/tmp/go-build -gno-record-gcc-switches' +GOHOSTARCH='arm64' +GOHOSTOS='linux' +GOINSECURE='' +GOMOD='/dev/null' +GOMODCACHE='/home/ubuntu/go/pkg/mod' +GONOPROXY='' +GONOSUMDB='' +GOOS='linux' +GOPATH='/home/ubuntu/go' +GOPRIVATE='' +GOPROXY='https://proxy.golang.org,direct' +GOROOT='/usr/local/go' +GOSUMDB='sum.golang.org' +GOTELEMETRY='local' +GOTELEMETRYDIR='/home/ubuntu/.config/go/telemetry' +GOTMPDIR='' +GOTOOLCHAIN='auto' +GOTOOLDIR='/usr/local/go/pkg/tool/linux_arm64' +GOVCS='' +GOVERSION='go1.25.0' +GOWORK='' +PKG_CONFIG='pkg-config' +``` +Golang installation on Ubuntu Pro 24.04 LTS Arm64 is complete. You can now proceed with Golang development or baseline testing. diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/go-web.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/go-web.png new file mode 100644 index 0000000000..66618480f2 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/go-web.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/java-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/java-on-azure/_index.md index ff29c8cfcf..32091b4ff5 100644 --- a/content/learning-paths/servers-and-cloud-computing/java-on-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/java-on-azure/_index.md @@ -7,23 +7,21 @@ cascade: minutes_to_complete: 30 -who_is_this_for: This Learning Path introduces Java deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Java applications from x86_64 to Arm with minimal or no changes. +who_is_this_for: This is an introductory topic about Java deployment and benchmarking on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Java applications from x86_64 to Arm. learning_objectives: - - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. - - Deploy Java on the Ubuntu Pro virtual machine. - - Perform Java baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + - Provision an Azure Arm-based Cobalt 100 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. + - Deploy Java on the Azure Arm64 virtual machine. + - Perform Java baseline testing and benchmarking on the Arm64 virtual machines. prerequisites: - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). - - Basic understanding of Linux command line. - - Familiarity with the [Java platform](https://openjdk.org/) and deployment practices on Arm64 platforms. -author: Jason Andrews +author: Pareena Verma ### Tags -skilllevels: Advanced +skilllevels: Introductory subjects: Performance and Architecture cloud_service_providers: Microsoft Azure diff --git a/content/learning-paths/servers-and-cloud-computing/java-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/java-on-azure/baseline.md index 8625a0c451..11d22b0197 100644 --- a/content/learning-paths/servers-and-cloud-computing/java-on-azure/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/java-on-azure/baseline.md @@ -7,14 +7,20 @@ layout: learningpathall --- -### Deploy a Java application with Tomcat-like operation -Apache Tomcat is a Java-based web application server (technically, a Servlet container) that executes Java web applications. It's widely used to host Java servlets, JSP (JavaServer Pages), -and RESTful APIs written in Java. +### Deploy a Java application with a Tomcat-like operation +Apache Tomcat is a widely used Java web application server. Technically, it is a Servlet container, responsible for executing Java servlets and supporting technologies like: -The below Java class simulates the generation of a basic HTTP response and measures the time taken to construct it, mimicking a lightweight Tomcat-like operation. It measures how long it -takes to build the response string, helping evaluate raw Java execution efficiency before deploying heavier frameworks like Tomcat. + * JSP (JavaServer Pages): Java-based templates for dynamic web content. + * RESTful APIs: Lightweight endpoints for modern microservices. -Create a file named `HttpSingleRequestTest.java`, and add the below content to it: +In production, frameworks like Tomcat introduce additional complexity (request parsing, thread management, I/O handling). Before layering those components, it's useful to measure how efficiently raw Java executes simple request/response logic on Azure Cobalt 100 Arm-based instances. + +In this section, you will run a minimal Tomcat-like simulation. It won't launch a real server, but instead it: + * Constructs a basic HTTP response string in memory. + * Measures the time taken to build that response, acting as a microbenchmark. + * Provides a baseline for raw string and I/O handling performance in Java. + +Using a file editor of your choice create a file named `HttpSingleRequestTest.java`, and add the content below to it: ```java public class HttpSingleRequestTest { @@ -45,7 +51,7 @@ java -Xms128m -Xmx256m -XX:+UseG1GC HttpSingleRequestTest - -Xmx256m sets the maximum heap size for the JVM to 256 MB. - -XX:+UseG1GC enables the G1 Garbage Collector (Garbage First GC), designed for low pause times and better performance in large heaps. -You should see an output similar to: +You should output similar to: ```output java -Xms128m -Xmx256m -XX:+UseG1GC HttpSingleRequestTest Response Generated: @@ -56,8 +62,11 @@ Content-Length: 29 Tomcat baseline test on Arm64 Response generation took 12901.53 microseconds. ``` -Output summary: +Output breakdown: + +Generated Response: The program generates a fake HTTP 200 OK response with headers and a custom body string. +Timing Result: The program prints how long it took (in microseconds) to build that response. +In this example, it took ~12,901 µs (~12.9 ms). Your result will vary depending on CPU load, JVM warm-up, and environment. -- The program generated a fake HTTP 200 OK response with a custom message. -- It then measured and printed the time taken to generate that response (22125.79 microseconds). -- This serves as a basic baseline performance test of string formatting and memory handling on the JVM running on an Azure Arm64 instance. +This provides you with a baseline measurement of how Java handles simple string operations and memory allocation on Cobalt 100 (Arm64) instances. +It serves as a lightweight proxy for Tomcat-style request handling before adding the full complexity of a servlet container. diff --git a/content/learning-paths/servers-and-cloud-computing/java-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/java-on-azure/benchmarking.md index cf4105f0cc..99de415742 100644 --- a/content/learning-paths/servers-and-cloud-computing/java-on-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/java-on-azure/benchmarking.md @@ -6,10 +6,13 @@ weight: 6 layout: learningpathall --- -Now that you’ve built and run the Tomcat-like response, you can use it to test the JVM performance using JMH. You can also use it to test the performance difference between Cobalt 100 instances and other similar D series x86_64 based instances. -## Run the performance tests using JMH +Now that you have built and run a Tomcat-like response in Java, the next step is to benchmark it using a reliable, JVM-aware framework. -JMH (Java Microbenchmark Harness) is a Java benchmarking framework developed by the JVM team at Oracle to measure the performance of small code snippets with high precision. It accounts for JVM optimizations like JIT and warm-up to ensure accurate and reproducible results. It measures the throughput, average latency, or execution time. Below steps help benchmark the Tomcat-like operation: +## Run performance tests using JMH + +JMH (Java Microbenchmark Harness) is a Java benchmarking framework developed by the JVM team at Oracle to measure the performance of small code snippets with high precision. It accounts for JVM optimizations like JIT and warm-up to ensure accurate and reproducible results. You can measure throughput (ops/sec), average execution time, or percentiles for latency. + +Follow the steps to help benchmark the Tomcat-like operation with JMH: Install Maven: @@ -17,7 +20,7 @@ Install Maven: ```console sudo apt install maven -y ``` -Create Benchmark Project: +Once Maven is installed, create a JMH benchmark project using the official archetype provided by OpenJDK: ```console mvn archetype:generate \ @@ -30,8 +33,31 @@ mvn archetype:generate \ -Dversion=1.0 cd jmh-benchmark ``` +The output should look like: + +```output +[INFO] ---------------------------------------------------------------------------- +[INFO] Using following parameters for creating project from Archetype: jmh-java-benchmark-archetype:1.37 +[INFO] ---------------------------------------------------------------------------- +[INFO] Parameter: groupId, Value: com.example +[INFO] Parameter: artifactId, Value: jmh-benchmark +[INFO] Parameter: version, Value: 1.0 +[INFO] Parameter: package, Value: com.example +[INFO] Parameter: packageInPathFormat, Value: com/example +[INFO] Parameter: package, Value: com.example +[INFO] Parameter: groupId, Value: com.example +[INFO] Parameter: artifactId, Value: jmh-benchmark +[INFO] Parameter: version, Value: 1.0 +[INFO] Project created from Archetype in dir: /home/azureuser/jmh-benchmark +[INFO] ------------------------------------------------------------------------ +[INFO] BUILD SUCCESS +[INFO] ------------------------------------------------------------------------ +[INFO] Total time: 3.474 s +[INFO] Finished at: 2025-09-15T18:28:15Z +[INFO] ------------------------------------------------------------------------ +``` -Edit the `src/main/java/com/example/MyBenchmark.java` file and add the below code on it: +Now edit the `src/main/java/com/example/MyBenchmark.java` file in the generated project. Replace the placeholder `TestMethod()` function with the following code: ```java package com.example; @@ -56,23 +82,36 @@ public class MyBenchmark { } } ``` -This simulates HTTP response generation similar to Tomcat. +This mirrors the Tomcat-like simulation you created earlier but now runs under JMH. -Build the Benchmark: +Build the Benchmark JAR: ```console mvn clean install ``` -After the build is complete, the JMH benchmark jar will be in the target/ directory. +The output from this command should look like: + +```output +[INFO] Installing /home/azureuser/jmh-benchmark/target/jmh-benchmark-1.0.jar to /home/azureuser/.m2/repository/com/example/jmh-benchmark/1.0/jmh-benchmark-1.0.jar +[INFO] Installing /home/azureuser/jmh-benchmark/pom.xml to /home/azureuser/.m2/repository/com/example/jmh-benchmark/1.0/jmh-benchmark-1.0.pom +[INFO] ------------------------------------------------------------------------ +[INFO] BUILD SUCCESS +[INFO] ------------------------------------------------------------------------ +[INFO] Total time: 5.420 s +[INFO] Finished at: 2025-09-15T18:31:32Z +``` + +After the build is complete, the JMH benchmark JAR will be located in the target directory. Run the Benchmark: ```console java -jar target/benchmarks.jar ``` +This will execute the benchmarkHttpResponse() method under JMH, showing average time per operation. -You should see an output similar to: +You should see output similar to: ```output # JMH version: 1.37 # VM version: JDK 21.0.8, OpenJDK 64-Bit Server VM, 21.0.8+9-Ubuntu-0ubuntu124.04.1 @@ -160,6 +199,9 @@ Result "com.example.MyBenchmark.benchmarkHttpResponse": # Run complete. Total time: 00:08:21 +JMH runs warmup iterations so the JVM has a chance to JIT-compile and optimize the code before the real measurement begins. +Each iteration shows how many times per second your `benchmarkHttpResponse()` method ran. You get an aggregate summary of the result at the end. In this example, on average the JVM executed ~35.6 million response constructions per second. + REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure @@ -178,12 +220,12 @@ MyBenchmark.benchmarkHttpResponse thrpt 25 35659618.044 ± 686946.011 ops/s ### Benchmark Metrics Explained -- **Run Count**: The total number of benchmark iterations executed. A higher run count increases statistical reliability and reduces the effect of outliers. -- **Average Throughput**: The mean number of operations executed per second across all iterations. This metric represents the overall sustained performance of the benchmarked workload. +- **Run Count**: The total number of benchmark iterations that JMH executed. More runs improve statistical reliability and help smooth out anomalies caused by the JVM or OS. +- **Average Throughput**: The mean number of operations completed per second across all measured iterations. This is the primary indicator of sustained performance for the benchmarked code. - **Standard Deviation**: Indicates the amount of variation or dispersion from the average throughput. A smaller standard deviation means more consistent performance. -- **Confidence Interval (99.9%)**: The statistical range within which the true average throughput is expected to fall, with 99.9% certainty. Narrow intervals imply more reliable results. -- **Min Throughput**: The lowest throughput observed across all iterations, reflecting the worst-case performance scenario. -- **Max Throughput**: The highest throughput observed across all iterations, reflecting the best-case performance scenario. +- **Confidence Interval (99.9%)**: The statistical range in which the true average throughput is expected to fall with 99.9% certainty. Narrow confidence intervals suggest more reliable and repeatable measurements. +- **Min Throughput**: The lowest observed throughput across all iterations, representing a worst-case scenario under the current test conditions. +- **Max Throughput**: The highest observed throughput across all iterations, representing the best-case performance under the current test conditions. ### Benchmark summary on Arm64 @@ -198,26 +240,12 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr | **Min Throughput** | 33.53M ops/sec | | **Max Throughput** | 36.99M ops/sec | -### Benchmark summary on x86 - -Here is a summary of benchmark results collected on x86 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. - -| Metric | Value | -|--------------------------------|---------------------------| -| **Java Version** | OpenJDK 21.0.8 | -| **Run Count** | 25 iterations | -| **Average Throughput** | 16.78M ops/sec | -| **Standard Deviation** | ±0.06M ops/sec | -| **Confidence Interval (99.9%)**| [16.74M, 16.83M] ops/sec | -| **Min Throughput** | 16.64M ops/sec | -| **Max Throughput** | 16.88M ops/sec | - -### Benchmark comparison insights -When comparing the results on Arm64 vs x86_64 virtual machines: +### Key insights from the results -- **High Throughput:** Achieved an average of **35.66M ops/sec**, with peak performance reaching **36.99M ops/sec**. -- **Stable Performance:** Standard deviation of **±0.92M ops/sec**, with results tightly bounded within the 99.9% confidence interval **[34.97M, 36.34M]**. -- **Consistent Efficiency:** Demonstrates the reliability of Arm64 architecture for sustaining high-throughput Java workloads on Azure Ubuntu Pro environments. +- **Strong throughput performance** The benchmark sustained around 35.6 million operations per second, demonstrating efficient string construction and memory handling on the Arm64 JVM. +- **Consistency across runs** With a standard deviation under 1 million ops/sec, results were tightly clustered. This suggests stable system performance without significant noise from background processes. +- **High statistical confidence** The narrow 99.9% confidence interval ([34.97M, 36.34M]) indicates reliable, repeatable results. +- **Predictable performance envelope** The difference between min (33.5M) and max (37.0M) throughput is modest (~10%), suggests the workload performed consistently without extreme slowdowns or spikes. -You have now benchmarked Java on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. +The Arm-based Azure `D4ps_v6` VM provides stable and efficient performance for Java workloads, even in microbenchmark scenarios. These results establish a baseline you can now compare directly against x86_64 instances to evaluate relative performance. diff --git a/content/learning-paths/servers-and-cloud-computing/java-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/java-on-azure/create-instance.md index 9571395aa2..87ecb87ef0 100644 --- a/content/learning-paths/servers-and-cloud-computing/java-on-azure/create-instance.md +++ b/content/learning-paths/servers-and-cloud-computing/java-on-azure/create-instance.md @@ -6,15 +6,13 @@ weight: 3 layout: learningpathall --- -## Introduction +## Create an Azure Cobalt 100 Arm64 VM using the Azure portal -There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. +You can create an Azure Cobalt 100 Arm64 virtual machine in several ways, including the Azure portal, the Azure CLI, or an Infrastructure as Code (IaC) tool. -This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. +In this Learning Path, you’ll use the Azure portal to create a VM with the Cobalt 100 processor, following a process similar to creating any other virtual machine in Azure. -If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). - -#### Create an Arm-based Azure Virtual Machine +## Step-by-step: create the virtual machine Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". 1. Select "Create", and click on "Virtual Machine" from the drop-down list. @@ -35,7 +33,7 @@ Creating a virtual machine based on Azure Cobalt 100 is no different from creati ![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") -10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. +10. Finally, click on the "Create" button, and click on the "Download Private key and Create Resources" button. ![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") diff --git a/content/learning-paths/servers-and-cloud-computing/java-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/java-on-azure/deploy.md index 0a0096f224..3508bc16bd 100644 --- a/content/learning-paths/servers-and-cloud-computing/java-on-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/java-on-azure/deploy.md @@ -9,25 +9,24 @@ layout: learningpathall ## Java Installation on Azure Ubuntu Pro virtual machine -Install Java on Ubuntu Pro virtual machine by updating the system and installing `default-jdk`, which includes both JRE and JDK. Verify the installation using `java -version` and `javac -version`, then set the `JAVA_HOME` environment variable for Arm-based systems. +In this section, you will install Java on your Arm-based Ubuntu Pro virtual machine. The goal is to ensure you have both the Java Runtime Environment (JRE) for running Java applications and the Java Development Kit (JDK) for compiling code and running benchmarks. ### Install Java +You will install Java using the Ubuntu package manager. `default-jdk` installs both the default JRE and JDK provided by Azure Ubuntu Pro machine. ```console sudo apt update sudo apt install -y default-jdk ``` -`default-jdk` installs both the default JRE and JDK provided by Azure Ubuntu Pro machine. - -Check to ensure that the JRE is properly installed: +Verify your JRE installation: ```console java -version ``` -You should see an output similar to: +You should the JRE version printed: ```output openjdk version "21.0.8" 2025-07-15 @@ -40,13 +39,13 @@ Check to ensure that the JDK is properly installed: ```console javac -version ``` -You should see an output similar to: +The output should look similar to: ```output javac 21.0.8 ``` -Set Java Environment Variable for Arm: +Set the Java Environment Variables to point to the root directory of your JDK installation: ```console export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-arm64 @@ -58,7 +57,7 @@ source ~/.bashrc Ubuntu Pro 24.04 LTS offers the default JDK version 21.0.8. It’s important to ensure that your version of OpenJDK for Arm is at least 11.0.9, or above. There is a large performance gap between OpenJDK-11.0.8 and OpenJDK 11.0.9. A patch added in 11.0.9 reduces false-sharing cache contention. For more information, you can view this [Arm community blog](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/java-performance-on-neoverse-n1). -The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) also recommends Java/OpenJDK version 11.0.9 as minimum recommended on the Arm platforms. +You can also refer to the [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) for software package version recommendations on Arm Neoverse Linux machines. {{% /notice %}} -Java installation is complete. You can now proceed with the baseline testing. +Your Java environment has been successfully configured. You may now proceed with baseline testing. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md new file mode 100644 index 0000000000..1f64240ea5 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md @@ -0,0 +1,59 @@ +--- +title: Deploy NGINX on the Microsoft Azure Cobalt 100 processors + +draft: true +cascade: + draft: true + +minutes_to_complete: 30 + +who_is_this_for: This Learning Path introduces NGINX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machine. It is intended for system administrators and developers looking to deploy and benchmark NGINX on Arm-based instances. + +learning_objectives: + - Start an Azure Arm64 virtual machine using the Azure console and Ubuntu Pro 24.04 LTS as the base image. + - Deploy the NGINX web server on the Azure Arm64 virtual machine. + - Configure and test a static website using NGINX on the virtual machine. + - Perform baseline testing and benchmarking of NGINX in the Ubuntu Pro 24.04 LTS Arm64 virtual machine environment. + + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). + +author: Pareena Verma + +### Tags +skilllevels: Introductory +subjects: Web +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - NGINX + - Apache Bench + +operatingsystems: + - Linux + +further_reading: + - resource: + title: NGINX official documentation + link: https://nginx.org/en/docs/ + type: documentation + - resource: + title: Apache Bench official documentation + link: https://httpd.apache.org/docs/2.4/programs/ab.html + type: documentation + - resource: + title: NGINX on Azure + link: https://docs.nginx.com/nginx/deployment-guides/microsoft-azure/virtual-machines-for-nginx/ + type: documentation + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md new file mode 100644 index 0000000000..9363127800 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md @@ -0,0 +1,22 @@ +--- +title: "Overview" + +weight: 2 + +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## NGINX + +NGINX is a high-performance, open-source web server, reverse proxy, load balancer, and HTTP cache. Originally developed by Igor Sysoev, NGINX is known for its event-driven, asynchronous architecture, which enables it to handle high concurrency with low resource usage. + +There are three main variants of NGINX: +- **NGINX Open Source**– Free and [open-source version available at nginx.org](https://nginx.org) +- **NGINX Plus**- [Commercial edition of NGINX](https://www.nginx.com/products/nginx/) with features like dynamic reconfig, active health checks, and monitoring. +- **NGINX Unit**- A lightweight, dynamic application server that complements NGINX. [Learn more at unit.nginx.org](https://unit.nginx.org/). diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md new file mode 100644 index 0000000000..1119102c40 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md @@ -0,0 +1,150 @@ +--- +title: NGINX Baseline Testing +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +### Baseline testing with a static website on NGINX +Once NGINX is installed and serving the default welcome page, the next step is to verify that it can serve your own content. A baseline test using a simple static HTML site ensures that NGINX is correctly configured and working as expected on your Ubuntu Pro 24.04 LTS virtual machine. + +1. Create a Static Website Directory: + +Prepare a folder to host your HTML content. +```console +mkdir -p /var/www/my-static-site +cd /var/www/my-static-site +``` +2. Create an HTML file and Web page: + +Create a simple HTML file to replace the default NGINX welcome page. Using a file editor of your choice create the file `index.html` with the content below: + +```html + + + + + Welcome to NGINX on Azure Ubuntu Pro + + + +
+

Welcome to NGINX on Azure Ubuntu Pro 24.04 LTS!

+

Your static site is running beautifully on ARM64

+
+ + +``` +3. Adjust Permissions: + +Ensure that NGINX (running as the www-data user) can read the files in your custom site directory: + +```console +sudo chown -R www-data:www-data /var/www/my-static-site +``` +This sets the ownership of the directory and files so that the NGINX process can serve them without permission issues. + +4. Update NGINX Configuration: + +Point NGINX to serve files from your new directory by creating a dedicated configuration file under /etc/nginx/conf.d/. + +```console +sudo nano /etc/nginx/conf.d/static-site.conf +``` +Add the following configuration to it: + +```console +server { + listen 80 default_server; + listen [::]:80 default_server; + server_name _; + + root /var/www/my-static-site; + index index.html; + + location / { + try_files $uri $uri/ =404; + } + + access_log /var/log/nginx/static-access.log; + error_log /var/log/nginx/static-error.log; +} +``` +This configuration block tells NGINX to: + - Listen on port 80 (both IPv4 and IPv6). + - Serve files from /var/www/my-static-site. + - Use index.html as the default page. + - Log access and errors to dedicated log files for easier troubleshooting. + +Make sure the path to your `index.html` file is correct before saving. + +5. Disable the default site: + +By default, NGINX comes with a packaged default site configuration. Since you have created a custom config, it is good practice to disable the default to avoid conflicts: + +```console +sudo unlink /etc/nginx/sites-enabled/default +``` + +6. Test the NGINX Configuration: + +Before applying your changes, always test the configuration to make sure there are no syntax errors: + +```console +sudo nginx -t +``` +You should see output similar to: +```output +nginx: the configuration file /etc/nginx/nginx.conf syntax is ok +nginx: configuration file /etc/nginx/nginx.conf test is successful +``` +If you see both lines, your configuration is valid. + +7. Reload or Restart NGINX: + +After testing the configuration, apply your changes by reloading or restarting the NGINX service: +```console +sudo nginx -s reload +sudo systemctl restart nginx +``` + +8. Test the Static Website in a browser: + +Access your website at your public IP on port 80. + +```console +http:/// +``` + +9. You should see the NGINX welcome page confirming a successful deployment: + +![Static Website Screenshot](images/nginx-web.png) + +This verifies the basic functionality of NGINX installation and you can now proceed to benchmarking the performance of NGINX on your Arm-based Azure VM. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md new file mode 100644 index 0000000000..1e9b3a0129 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md @@ -0,0 +1,150 @@ +--- +title: NGINX Benchmarking +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## NGINX Benchmarking by ApacheBench + +To understand how your NGINX deployment performs under load, you can benchmark it using ApacheBench (ab). ApacheBench is a lightweight command-line tool for benchmarking HTTP servers. It measures performance metrics like requests per second, response time, and throughput under concurrent load. + + +1. Install ApacheBench + +On **Ubuntu Pro 24.04 LTS**, ApacheBench is available as part of the `apache2-utils` package: +```console +sudo apt update +sudo apt install apache2-utils -y +``` + +2. Verify Installation + +```console +ab -V +``` +You should see output similar to: + +```output +This is ApacheBench, Version 2.3 <$Revision: 1923142 $> +Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ +Licensed to The Apache Software Foundation, http://www.apache.org/ +``` + +3. Basic Benchmark Syntax + +The general syntax for running an ApacheBench test is: + +```console +ab -n -c +``` + +Now run an example: + +```console +ab -n 1000 -c 50 http://localhost/ +``` +This sends **1000 total requests** with **50 concurrent connections** to `http://localhost/`. + +You should see a output similar to: +```output +This is ApacheBench, Version 2.3 <$Revision: 1903618 $> +Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ +Licensed to The Apache Software Foundation, http://www.apache.org/ + +Benchmarking localhost (be patient) +Completed 100 requests +Completed 200 requests +Completed 300 requests +Completed 400 requests +Completed 500 requests +Completed 600 requests +Completed 700 requests +Completed 800 requests +Completed 900 requests +Completed 1000 requests +Finished 1000 requests + + +Server Software: nginx/1.24.0 +Server Hostname: localhost +Server Port: 80 + +Document Path: / +Document Length: 890 bytes + +Concurrency Level: 50 +Time taken for tests: 0.032 seconds +Complete requests: 1000 +Failed requests: 0 +Total transferred: 1132000 bytes +HTML transferred: 890000 bytes +Requests per second: 31523.86 [#/sec] (mean) +Time per request: 1.586 [ms] (mean) +Time per request: 0.032 [ms] (mean, across all concurrent requests) +Transfer rate: 34848.65 [Kbytes/sec] received + +Connection Times (ms) + min mean[+/-sd] median max +Connect: 0 1 0.1 1 1 +Processing: 0 1 0.1 1 1 +Waiting: 0 1 0.2 1 1 +Total: 1 2 0.1 2 2 + +Percentage of the requests served within a certain time (ms) + 50% 2 + 66% 2 + 75% 2 + 80% 2 + 90% 2 + 95% 2 + 98% 2 + 99% 2 + 100% 2 (longest request) +``` + +### Interpret Benchmark Results: + +ApacheBench outputs several metrics. Key ones to focus on include: + + - Requests per second: Average throughput. + - Time per request: Latency per request. + - Failed request: Should ideally be zero. + - Transfer rate: Bandwidth used by the responses. + +### Benchmark summary on Arm64: +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| **Category** | **Metric** | **Value** | +|---------------------------|-------------------------------------------------|-------------------------------| +| **General Info** | Server Software | nginx/1.24.0 | +| | Server Hostname | localhost | +| | Server Port | 80 | +| | Document Path | / | +| | Document Length | 890 bytes | +| **Test Setup** | Concurrency Level | 50 | +| | Time Taken for Tests | 0.032 sec | +| | Complete Requests | 1000 | +| | Failed Requests | 0 | +| **Transfer Stats** | Total Transferred | 1,132,000 bytes | +| | HTML Transferred | 890,000 bytes | +| | Requests per Second | 31,523.86 [#/sec] | +| | Time per Request (mean) | 1.586 ms | +| | Time per Request (across all) | 0.032 ms | +| | Transfer Rate | 34,848.65 KB/sec | +| **Connection Times (ms)** | Connect (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | +| | Processing (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | +| | Waiting (min / mean / stdev / median / max) | 0 / 1 / 0.2 / 1 / 1 | +| | Total (min / mean / stdev / median / max) | 1 / 2 / 0.1 / 2 / 2 | + +### Analysis of results from NGINX benchmarking on Arm-based Azure Cobalt-100 + +These benchmark results highlight the strong performance characteristics of NGINX running on Arm64-based Azure VMs (such as the D4ps_v6 instance type): + +- High Requests Per second(31,523.86 requests/sec), demonstrating high throughput under concurrent load. +- Response time per request averaged **1.586 ms**, indicating efficient handling of requests with minimal delay. +- **Zero failed requests**, confirming stability and reliability during testing. +- Consistently low **connection and processing times** (mean ≈ 1 ms), ensuring smooth performance. + +Overall, these results illustrate that NGINX on Arm64 machines provides a highly performant solution for web workloads on Azure. You can also use the same benchmarking framework to compare results on equivalent x86-based Azure instances, which provides useful insight into relative performance and cost efficiency across architectures. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md new file mode 100644 index 0000000000..97fb26ce45 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md @@ -0,0 +1,147 @@ +--- +title: Install NGINX +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + + +## NGINX Installation on Ubuntu Pro 24.04 LTS + +In this section, you will install and configure NGINX, a high-performance web server and reverse proxy on your Arm-based Azure instance. NGINX is widely used to serve static content, handle large volumes of connections efficiently, and act as a load balancer. Running it on your Azure Cobalt-100 virtual machine will allow you to serve web traffic securely and reliably. + +### Install NGINX + +Run the following commands to install and enable NGINX: + +```console +sudo apt update +sudo apt install -y nginx +sudo systemctl enable nginx +sudo systemctl start nginx +``` + +### Verify NGINX + +Check the installed version of NGINX: + +```console +nginx -v +``` +The output should look like: + +```output +nginx version: nginx/1.24.0 (Ubuntu) +``` +{{% notice Note %}} + +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends NGINX version 1.20.1 as the minimum recommended on the Arm platforms. + +{{% /notice %}} + +You can confirm that NGINX is running correctly by checking its systemd service status: +```console +sudo systemctl status nginx +``` +You should see output similar to: + +```output +● nginx.service - A high performance web server and a reverse proxy server + Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: enabled) + Active: active (running) since Mon 2025-09-08 04:26:39 UTC; 20s ago + Docs: man:nginx(8) + Main PID: 1940 (nginx) + Tasks: 5 (limit: 19099) + Memory: 3.6M (peak: 8.1M) + CPU: 23ms + CGroup: /system.slice/nginx.service + ├─1940 "nginx: master process /usr/sbin/nginx -g daemon on; master_process on;" + ├─1942 "nginx: worker process" + ├─1943 "nginx: worker process" + ├─1944 "nginx: worker process" + └─1945 "nginx: worker process" +``` +If you see Active: active (running), NGINX is successfully installed and running. + + +### Validation with curl +Validation with `curl` confirms that NGINX is correctly installed, running, and serving **HTTP** responses. + +Run the following command to send a HEAD request to the local NGINX server: +```console +curl -I http://localhost/ +``` +The -I option tells curl to request only the HTTP response headers, without downloading the page body. + +You should see output similar to: + +```output +HTTP/1.1 200 OK +Server: nginx/1.24.0 (Ubuntu) +Date: Mon, 08 Sep 2025 04:27:20 GMT +Content-Type: text/html +Content-Length: 615 +Last-Modified: Mon, 08 Sep 2025 04:26:39 GMT +Connection: keep-alive +ETag: "68be5aff-267" +Accept-Ranges: bytes +``` + +Output summary: +- HTTP/1.1 200 OK: Confirms that NGINX is responding successfully. +- Server: nginx/1.24.0: Shows that the server is powered by NGINX. +- Content-Type, Content-Length, Last-Modified, ETag: Provide details about the served file and its metadata. + +This step verifies that your NGINX installation is functional at the system level, even before exposing it to external traffic. It’s a quick diagnostic check that is useful when troubleshooting connectivity issues. + +### Allowing HTTP Traffic + +When you created your VM instance earlier, you configured the Azure Network Security Group (NSG) to allow inbound HTTP (port 80) traffic. This means the Azure-side firewall is already open for web requests. +On the VM itself, you still need to make sure that the Uncomplicated firewall (UFW) which is used to manage firewall rules on Ubuntu allows web traffic. Run: + + +```console +sudo ufw allow 80/tcp +sudo ufw enable +``` +The output from this command should look like: + +```output +sudo ufw enable +Rules updated +Rules updated (v6) +Command may disrupt existing ssh connections. Proceed with operation (y|n)? y +Firewall is active and enabled on system startup +``` +You can verify that HTTP is now allowed with: + +```console +sudo ufw status +``` +You should see an output similar to: +```output +Status: active + +To Action From +-- ------ ---- +8080/tcp ALLOW Anywhere +80/tcp ALLOW Anywhere +8080/tcp (v6) ALLOW Anywhere (v6) +80/tcp (v6) ALLOW Anywhere (v6) +``` +This ensures that both Azure and the VM-level firewalls are aligned to permit HTTP requests. + +### Accessing the NGINX Default Page + +You can now access the NGINX default page from your Virtual machine’s public IP address. Run the following command to display your public URL: + +```console +echo "http://$(curl -s ifconfig.me)/" +``` +Copy the printed URL and open it in your browser. You should see the default NGINX welcome page, which confirms a successful installation and that HTTP traffic is reaching your VM. + +![NGINX](images/nginx-browser.png) + +At this stage, your NGINX installation is complete. You are now ready to proceed with baseline testing and further configuration. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-browser.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-browser.png new file mode 100644 index 0000000000..6a97f1e10b Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-browser.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-web.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-web.png new file mode 100644 index 0000000000..152bf727a6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/nginx-web.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md new file mode 100644 index 0000000000..16d2b8546f --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md @@ -0,0 +1,50 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). In this section, you will use the Azure console to create a virtual machine with Arm-based Azure Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +While the steps to create this instance are included here for your convenience, you can also refer to the [Deploy a Cobalt 100 Virtual Machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/) + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". +1. Select "Create", and click on "Virtual Machine" from the drop-down list. +2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". +3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. +4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") + +5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +6. Fill in the Administrator username for your VM. +7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. +8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. The default port for NGINX when handling standard web traffic (HTTP) is 80. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") + +9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") + +10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") + +11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/postgresql_tune/kernel_comp_lib.md b/content/learning-paths/servers-and-cloud-computing/postgresql_tune/kernel_comp_lib.md index b13cf53642..21c6db977c 100644 --- a/content/learning-paths/servers-and-cloud-computing/postgresql_tune/kernel_comp_lib.md +++ b/content/learning-paths/servers-and-cloud-computing/postgresql_tune/kernel_comp_lib.md @@ -12,7 +12,7 @@ layout: "learningpathall" The underlying storage technology and the file system format can impact performance significantly. In general, locally attached SSD storage will perform best. However, network based storage systems can perform well. As always, performance is dependent on the request profile coming from clients. You should spend some time studying and experimenting with different storage technologies and configuration options. -Aside from the storage technology, it is also worth testing different file system formats with `PostgreSQL`. The `xfs` file system is a good starting point. The `ext4` file system is another good alternative. +Aside from the storage technology, it is also worth testing different file system formats with `PostgreSQL`. The `xfs` file system is a good starting point. The `ext4` file system is another good alternative. ## Kernel configuration @@ -20,7 +20,7 @@ Aside from the storage technology, it is also worth testing different file syste ### Linux-PAM limits -Linux-PAM limits can be changed in the `/etc/security/limits.conf` file, or by using the `ulimit` command. +Linux-PAM limits can be changed in the `/etc/security/limits.conf` file, or by using the `ulimit` command. If you want more information about how to display and modify parameters check the documentation of the `ulimit` command. @@ -41,9 +41,9 @@ Some of the key settings that can affect performance are: ### Linux virtual memory subsystem -Making changes to the Linux Virtual Memory subsystem can also improve performance. +Making changes to the Linux Virtual Memory subsystem can also improve performance. -These settings can be changed in the `/etc/sysctl.conf` file, or by using the `sysctl` command. +These settings can be changed in the `/etc/sysctl.conf` file, or by using the `sysctl` command. If you want more information about how to display and modify virtual memory parameters check the documentation of the `sysctl` command. @@ -57,9 +57,9 @@ sudo sysctl -a ### Overcommit memory -The overcommit policy is set via the sysctl `vm.overcommit_memory' setting. +The overcommit policy is set via the sysctl `vm.overcommit_memory' setting. -The recommended setting for `vm.overcommit_memory` is 2 according to the [PostgreSQL documentation](https://www.postgresql.org/docs/15/kernel-resources.html). +The recommended setting for `vm.overcommit_memory` is 2 according to the [PostgreSQL documentation](https://www.postgresql.org/docs/15/kernel-resources.html). To set the overcommit_memory parameter to 2 temporarily, run the following command: @@ -75,7 +75,7 @@ This tells Linux to never over commit memory. Setting `vm.overcommit_memory` to ### Huge memory pages -`PostgreSQL` benefits from using huge memory pages. Huge pages reduce how often virtual memory pages are mapped to physical memory. +`PostgreSQL` benefits from using huge memory pages. Huge pages reduce how often virtual memory pages are mapped to physical memory. To see the current memory page configuration, run the following command on the host: @@ -94,9 +94,9 @@ Hugepagesize: 2048 kB Hugetlb: 0 kB ``` -Huge pages are not being used if `HugePages_Total` is 0 (this is the default). +Huge pages are not being used if `HugePages_Total` is 0 (this is the default). -Also note that `Hugepagesize` is 2MB which is the typical default for huge pages on Linux. +Also note that `Hugepagesize` is 2MB which is the typical default for huge pages on Linux. You can modify the huge page values. @@ -106,9 +106,9 @@ The setting that enables huge pages is shown below: vm.nr_hugepages ``` -This parameter sets the number of huge pages you want the kernel to make available to applications. +This parameter sets the number of huge pages you want the kernel to make available to applications. -The total amount of memory that will be used for huge pages will be this number (defaulted to 0) times the `Hugepagesize`. +The total amount of memory that will be used for huge pages will be this number (defaulted to 0) times the `Hugepagesize`. As an example, if you want a total of 1GB of huge page space, then you should set `vm.nr_hugepages` to 500 (500x2MB=1GB). @@ -124,7 +124,7 @@ sudo sh -c 'echo "vm.nr_hugepages=500" >> /etc/sysctl.conf' ### Selecting the number of huge pages to use -You should set `vm.nr_hugepages` to a value that gives a total huge page space slightly bigger than the `PostgreSQL` shared buffer size (discussed later). +You should set `vm.nr_hugepages` to a value that gives a total huge page space slightly bigger than the `PostgreSQL` shared buffer size (discussed later). Make it slightly larger than the shared buffer because `PostgreSQL` will use additional memory for things like connection management. @@ -135,19 +135,15 @@ More information on the different parameters that affect the configuration of hu `PostgreSQL` writes data to files like any Linux process does. The behavior of the page cache can affect performance. There are two sysctl that parameters control how often the kernel flushes the page cache data to disk. - `vm.dirty_background_ratio=5` -- `vm.dirty_ratio=80` +- `vm.dirty_ratio=20` -The `vm.dirty_background_ratio` sets the percentage of the page cache that needs to be dirty in order for a flush to disk to start in the background. +The `vm.dirty_background_ratio` sets the percentage of the page cache that needs to be dirty in order for a flush to disk to start in the background. -Setting this value to lower than the default (typically 10) helps write heavy workloads. This is because by lowering this threshold, you are spreading writes to storage over time. This reduces the probability of saturating storage. +Setting this value to lower than the default (typically 10) helps write heavy workloads. This is because by lowering this threshold, you are spreading writes to storage over time. This reduces the probability of saturating storage performance. Setting this value to 5 can improve performance. -Setting this value to 5 can improve performance. +The `vm.dirty_ratio` sets the percentage of the page cache that needs to be dirty in order for threads that are writing to storage to be paused to allow flushing to catch up. -The `vm.dirty_ratio` sets the percentage of the page cache that needs to be dirty in order for threads that are writing to storage to be paused to allow flushing to catch up. - -Setting this value higher than default (typically 10-20) helps performance when disk writes are bursty. A higher value gives the background flusher (controlled by `vm.dirty_background_ratio`) more time to catch up. - -Setting this as high as 80 can improve performance. +This should be set higher than `vm.dirty_background_ratio`. The OS default is typically around 20 which is usually good. In some cases it may be beneficial to set this value to be higher, it gives the background flusher (controlled by `vm.dirty_background_ratio`) more time to catch up if disk writes are very bursty. In general, you can leave this to the OS default, but it may be worth experimenting with higher values on specific workload profiles. ## Compiler Considerations diff --git a/content/learning-paths/servers-and-cloud-computing/postgresql_tune/tuning.md b/content/learning-paths/servers-and-cloud-computing/postgresql_tune/tuning.md index d919604cb4..817c7d5804 100644 --- a/content/learning-paths/servers-and-cloud-computing/postgresql_tune/tuning.md +++ b/content/learning-paths/servers-and-cloud-computing/postgresql_tune/tuning.md @@ -10,9 +10,9 @@ layout: "learningpathall" ## PostgreSQL configuration -There are different ways to set configuration parameters for `PostgreSQL`. +There are different ways to set configuration parameters for `PostgreSQL`. -This is discussed in the [Setting Parameters documentation](https://www.postgresql.org/docs/current/config-setting.html). +This is discussed in the [Setting Parameters documentation](https://www.postgresql.org/docs/current/config-setting.html). The configurations below can be directly pasted into a `PostgreSQL` configuration file. @@ -27,11 +27,11 @@ max_prepared_transactions = 1000 # Default 0 Keep in mind that more client connections means more resources will be consumed (especially memory). Setting this to something higher is completely dependent on use case and requirements. -`max_prepared_transactions` is 0 by default. +`max_prepared_transactions` is 0 by default. This means that stored procedures and functions cannot be used out of the box. It must be enabled by setting `max_prepared_transactions` to a value greater than 0. If this is set to a number larger than 0, a good number to start with would be at least as large as `max_connections`. In a test or development environment, it doesn't hurt to set it to an even larger value(10000) to avoid errors. -Using procedures and functions can greatly improve performance. +Using procedures and functions can greatly improve performance. ### Memory related configuration @@ -42,7 +42,7 @@ work_mem = 32MB # default is 4MB maintenance_work_mem = 2GB # Default is 64MB ``` -Turning on `huge_pages` is not required because the default is `try`. +Turning on `huge_pages` is not required because the default is `try`. However, you can explicitly set it to `on` because errors will be produced if huge pages are not enabled in Linux. @@ -59,7 +59,7 @@ deadlock_timeout = 10s # Default is 1s max_worker_processes = # Default is 8 ``` -`deadlock_timeout` sets a polling interval for checking locks. The [documentation](https://www.postgresql.org/docs/15/runtime-config-locks.html) states that this check is expensive from a CPU cycles standpoint, and that the default of 1s is probably the smallest that should be used. Consider raising this timeout much higher to save some CPU cycles. +`deadlock_timeout` sets a polling interval for checking locks. The [documentation](https://www.postgresql.org/docs/15/runtime-config-locks.html) states that this check is expensive from a CPU cycles standpoint, and that the default of 1s is probably the smallest that should be used. Consider raising this timeout much higher to save some CPU cycles. `max_worker_processes` is a key parameter for performance. It's the number of total background processes allowed. A good starting point is to set this to the number of cores present on the PostgreSQL node. @@ -69,18 +69,15 @@ max_worker_processes = # Default is 8 synchronous_commit = off # Default is on max_wal_size = 20GB # Default is 1GB min_wal_size = 1GB # Default is 80MB -wal_recycle = off # Default is on ``` If `synchronous_commit` is on (default), it tells the WAL processor to wait until more of the log is applied before reporting success to clients. Turning this off means that the PostgreSQL instance will report success to clients sooner. This will result in a performance improvement. It is safe to turn this off in most cases, but keep in mind that it will increase the risk of losing transactions if there is a crash. However, it will not increase the risk of data corruption. In high load scenarios, check pointing can happen very often. In fact, in testing with HammerDB, there may be so much check pointing that PostgreSQL reports warnings. One way to reduce how often check pointing occurs is to increase the `max_wal_size` of the WAL log. Setting it to 20GB can make the excessive check pointing warnings go away. `min_wal_size` can also be increased to help absorb spikes in WAL log usage under high load. -`wal_recycle` does not impact performance. However, in scenarios where a large amount of data is being loaded (for example, restoring a database), turning this off will speed up the data load and reduce the chances of replication errors to occur if streaming replication is used. - ### Planner/Optimizer configuration -The optimizer (also called planner) is responsible for taking statistics about the execution of previous queries, and using that information to figure out what is the fastest way to process new queries. Some of these statistics include shared buffer hit/miss rate, execution time of sequential scans, and execution time of index scans. Below are some parameters that affect the optimizer. +The optimizer (also called planner) is responsible for taking statistics about the execution of previous queries, and using that information to figure out what is the fastest way to process new queries. Some of these statistics include shared buffer hit/miss rate, execution time of sequential scans, and execution time of index scans. Below are some parameters that affect the optimizer. ```output effective_cache_size = <80% of system memory> # Default is 4GB @@ -91,7 +88,7 @@ One key piece of information that a `PostgreSQL` instance will not have access t **How does `effective_cache_size` affect the optimizer and help performance?** -When data is loaded into the PostgreSQL shared buffer, the same data may also be present in the page cache. It is also possible that data that isn't in the shared buffer is present in the page cache. This second case creates a scenario where tuning `effective_cache_size` can help improve performance. +When data is loaded into the PostgreSQL shared buffer, the same data may also be present in the page cache. It is also possible that data that isn't in the shared buffer is present in the page cache. This second case creates a scenario where tuning `effective_cache_size` can help improve performance. Sometimes `PostgreSQL` needs to read data that is not in the shared buffer, but it is in the page cache. From the perspective of `PostgreSQL`, there will be a shared buffer miss when it tries to read the data. When this happens, the `PostgreSQL` instance will assume that reading this data will be slow because it will come from disk. It assumes the data will come from disk because `PostgreSQL` has no way to know if the data is in the page cache. However, if it turns out that the data is present in the page cache, the data will be read faster than if it was read from disk. diff --git a/data/stats_current_test_info.yml b/data/stats_current_test_info.yml index 38e4eef117..5349ad49e1 100644 --- a/data/stats_current_test_info.yml +++ b/data/stats_current_test_info.yml @@ -182,7 +182,8 @@ sw_categories: tests_and_status: [] postgresql_tune: readable_title: Learn how to Tune PostgreSQL - tests_and_status: [] + tests_and_status: + - ubuntu:latest: passed ran: readable_title: Get started with the Arm 5G RAN Acceleration Library (ArmRAL) tests_and_status: []