diff --git a/.github/workflows/test-lp.yml b/.github/workflows/test-lp.yml index adb71197d7..47b3f60270 100644 --- a/.github/workflows/test-lp.yml +++ b/.github/workflows/test-lp.yml @@ -17,7 +17,7 @@ jobs: uses: tj-actions/changed-files@v46 with: files: | - **.md + content/**/**.md - name: Check for capital letters or spaces in content directory run: | echo "Checking for capital letters or spaces in content directory paths (excluding file extensions)..." diff --git a/content/install-guides/dcperf.md b/content/install-guides/dcperf.md index 42f21f6f3e..024965bb54 100644 --- a/content/install-guides/dcperf.md +++ b/content/install-guides/dcperf.md @@ -183,7 +183,7 @@ These metrics help you evaluate the performance and reliability of the system un ## Next steps -These are some activites you might like to try next: +These are some activities you might like to try next: * Use the results to compare performance across different systems, hardware configurations, or after making system changes, such as kernel, compiler, or driver updates. diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md index db39cfc760..4ec73b992b 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md @@ -24,6 +24,7 @@ subjects: ML armips: - Cortex-A - Cortex-M + - Ethos-U operatingsystems: - Linux diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_index.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_index.md index 59fa4fd0e8..9f90e9f987 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_index.md @@ -1,24 +1,20 @@ --- -title: Run and Debug a Linux Software Stack on Arm Virtual Platforms +title: Debug Trusted Firmware-A and the Linux kernel on Arm FVP with Arm Development Studio -minutes_to_complete: 180 +minutes_to_complete: 60 -who_is_this_for: This introductory topic is designed for developers interested in running Linux on Arm Fixed Virtual Platforms (FVPs) and debugging Trusted Firmware-A and the Linux Kernel using Arm Development Studio. +who_is_this_for: This topic is for developers who want to run Linux on Arm Fixed Virtual Platforms (FVPs) and debug both Trusted Firmware-A and the Linux kernel using Arm Development Studio. learning_objectives: - - Run a Linux software stack using Arm Fixed Virtual Platforms. - - Debug the firmware and Linux kernel using Arm Development Studio. - + - Boot and run a Linux software stack on an Arm Fixed Virtual Platform (FVP). + - Debug Trusted Firmware-A and the Linux kernel using Arm Development Studio. prerequisites: - - A Linux computer with Arm Development Studio installed (works only on x86-64). - - Basic knowledge of Assembly and C language. + - A Linux-based x86-64 host computer with Arm Development Studio installed. + - Basic understanding of Assembly and C programming. + author: Qixiang Xu -draft: true -cascade: - draft: true - ### Tags skilllevels: Introductory subjects: Embedded Linux diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_next-steps.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_next-steps.md index c3db0de5a2..10f093701b 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_next-steps.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/_next-steps.md @@ -3,6 +3,6 @@ # FIXED, DO NOT MODIFY THIS FILE # ================================================================================ weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. -title: "Next Steps" # Always the same, html page title. +title: "Next steps" # Always the same, html page title. layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. --- diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/debug.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/debug.md index cfde22743a..a064b3006c 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/debug.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/debug.md @@ -1,20 +1,20 @@ --- -title: Debug Software Stack +title: Debug the software stack weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Debug the Software Stack with Arm Development Studio +## Debug the software stack with Arm Development Studio Once your software stack is running on the FVP, you can debug Trusted Firmware-A and the Linux kernel using Arm Development Studio (Arm DS). -### Step 1: Install Arm Development Studio +## Install Arm Development Studio Download and install the latest version from the [Arm Development Studio download page](https://developer.arm.com/downloads/view/DS000B). -DWARF 5 is enabled by default in GCC 11 and later. Arm DS v2022.2 or newer is recommended to support DWARF 5 debug information. +DWARF 5 is enabled by default in GCC 11 and later. Arm DS v2022.2 or later is recommended to support DWARF 5 debug information. @@ -24,11 +24,14 @@ Launch Arm DS: ``` -### Step 2: Create a Debug Configuration -1. Open Arm DS, go to Run > Debug Configurations. -2. Select Generic Arm C/C++ Application and create a new configuration. -3. In the Connection tab: - - Choose your FVP model (e.g., Base_A55x4). +## Create a debug configuration + +To create a debug configuration, follow these steps: + +1. Open Arm DS, go to **Run** > **Debug Configurations**. +2. Select **Generic Arm C/C++ Application** and create a new configuration. +3. In the **Connection** tab: + - Choose your FVP model (for example, Base_A55x4) - Enter model parameters: ```output @@ -50,11 +53,11 @@ Launch Arm DS: --data cluster0.cpu0=/output/aemfvp-a/aemfvp-a/fvp-base-revc.dtb@0x83000000 ``` -### Step 3: Load Debug Symbols +## Load debug symbols -In the Debugger tab: -- Select “Connect only to the target.” -- Enable Execute debugger commands and add: +In the **Debugger** tab: +- Select **Connect only to the target**. +- Enable Execute debugger commands, and add: ```output add-symbol-file "~/arm/sw/cpufvp-a/arm-tf/build/fvp/debug/bl1/bl1.elf" EL3:0 add-symbol-file "~/arm/sw/cpufvp-a/arm-tf/build/fvp/debug/bl2/bl2.elf" EL1S:0 @@ -62,17 +65,17 @@ add-symbol-file "~/arm/sw/cpufvp-a/arm-tf/build/fvp/debug/bl31/bl31.elf" EL3:0 add-symbol-file "~/arm/sw/cpufvp-a/linux/out/aemfvp-a/defconfig/vmlinux" EL2N:0 ``` -Click Apply and then Close. +Select **Apply** and then **Close**. -### Step 4: Start Debugging +## Start debugging -1. In the Debug Control view, double-click your new configuration. -2. Wait for the target to connect and symbols to load. +1. In the **Debug Control** view, double-click your new configuration. +2. Wait for the target to connect and the symbols to load. 3. Set breakpoints, step through code, and inspect registers or memory. -You might get the following error when starting the debug connection. +You might get the following error when starting the debug connection: -![Connection Failed Screen #center](failed.png) +![Connection Failed Screen #center](failed.png "Connection Failed error message") This means your Arm FVP is not provided by default in the Arm DS installation. Set the `PATH` in this case: @@ -86,4 +89,4 @@ Ensure your FVP instance is running and matches the model and parameters selecte After these steps, you can debug the software stack as shown in the following figure: -![FVP running #center](Select_target.png) +![FVP running #center](Select_target.png "Debug interface in GUI") diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/intro.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/intro.md index 5845609060..e9590f5331 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/intro.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/intro.md @@ -6,29 +6,41 @@ weight: 2 layout: learningpathall --- -Arm Fixed Virtual Platforms (FVPs) are simulation models that let you run and test full software stacks on Arm systems before physical hardware is available. They replicate the behavior of Arm CPUs, memory, and peripherals using fast binary translation. +## What are Arm Fixed Virtual Platforms (FVPs)? -### Why Use FVPs? -FVPs are useful for developers who want to: -- Prototype software before silicon availability -- Debug firmware and kernel issues -- Simulate multicore systems +Arm Fixed Virtual Platforms (FVPs) are fast, functional simulation models of Arm hardware. They give you the ability to develop, test, and debug full software stacks. This includes firmware, bootloaders, and operating systems - all without the need for access to physical Arm silicon. FVPs replicate Arm CPU behavior, memory, and peripherals using fast binary translation. -FVPs provide a programmer's view of the hardware, making them ideal for system bring-up, kernel porting, and low-level debugging. +## Why use FVPs? +FVPs are ideal for early software development and system debugging. -### Freely Available Arm Ecosystem FVPs -Several pre-built Armv8-A FVPs can be downloaded for free from the [Arm Ecosystem Models](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms#Downloads) page. Categories include: +Developers can use FVPs to do the following tasks: + +- Prototype firmware and OS code before silicon is available +- Debug complex boot sequences and kernel issues +- Simulate multi-core systems to analyze performance and thread scheduling + +FVPs provide a programmer's view of the hardware, making them ideal for the following: + +* System bring-up +* Kernel porting +* Low-level debug tasks + +## How can I get access to the Arm FVPs? + +You can download prebuilt Armv8-A FVPs at no cost from the [Arm Ecosystem Models](https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms#Downloads) page. + +Available categories include: - Architecture - Automotive - Infrastructure - IoT -A popular model is the **AEMv8-A Base Platform RevC**, which supports Armv8.7 and Armv9-A. The [Arm reference software stack](https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/aemfvp-a/user-guide.rst) is designed for this model. +A popular model is AEMv8-A Base Platform RevC, which simulates generic Armv8.7-A and Armv9-A CPUs and is fully supported by Arm's open-source [reference software stack](https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/aemfvp-a/user-guide.rst). -### CPU-Specific Arm Base FVPs -Other FVPs target specific CPU types and come pre-configured with a fixed number of cores. These are often called **CPU FVPs**. +## CPU-specific Arm-based FVPs +Some FVPs target specific CPU implementations and include fixed core counts. These are known as CPU FVPs. -Here are some examples: +Examples include: - FVP_Base_Cortex-A55x4 - FVP_Base_Cortex-A72x4 - FVP_Base_Cortex-A78x4 @@ -36,11 +48,18 @@ Here are some examples: To use these, request access via [support@arm.com](mailto:support@arm.com). -### Setting Up Your Environment -This Learning Path uses the [Arm reference software stack](https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/aemfvp-a/user-guide.rst). +## Set up your environment +This Learning Path uses the open-source Arm reference software stack, which includes the following: + +* Prebuilt Linux images +* Firmware +* Configuration files To get started: -1. Follow the software user guide to download the stack. -2. Set up the required toolchain and environment variables. -Once configured, you’ll be ready to run and debug Linux on your selected Arm FVP model. \ No newline at end of file +* Follow the [software user guide](https://gitlab.arm.com/arm-reference-solutions/arm-reference-solutions-docs/-/blob/master/docs/aemfvp-a/user-guide.rst) to download the stack. +* Set up your toolchain +* Export environment variables +* Verify your build dependencies + +Once setup is complete, you’ll be ready to boot and debug Linux on your selected Arm FVP model. \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/modify.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/modify.md index 7a4f4ea097..74c2557e78 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/modify.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/modify.md @@ -1,39 +1,45 @@ --- -title: Modify device tree for Linux +title: Modify the device tree for CPU FVPs weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Modify the Device Tree for CPU FVPs +## Ensure the device tree matches your FVP model -To run Linux on Arm CPU FVPs, you need to adjust the device tree to match the hardware features of these platforms. This involves removing unsupported nodes (like SMMU (System Memory Management Unit)and PCI (Peripheral Component Interconnect)) and ensuring CPU affinity values are set correctly. +To run Linux on Arm CPU FVPs, you need to adjust the device tree to match the hardware features of these platforms. This involves removing unsupported nodes, such as the System Memory Management Unit (SMMU) and Peripheral Component Interconnect (PCI), and ensuring that the CPU affinity values are set correctly. -### Step 1: Remove PCI and SMMU Nodes +### Remove PCI and SMMU nodes -CPU FVPs don't support PCI and SMMU. If you don't remove these nodes, Linux will crash at boot with a kernel panic. +CPU FVPs don't support PCI or SMMU. If you leave these nodes in the device tree, Linux will crash at boot with a kernel panic. + +So to workaround this, you need to remove PCI and SMMU nodes: + +Open the device tree file in a text editor: -1. Open the device tree file in a text editor: ```bash vim linux/arch/arm64/boot/dts/arm/fvp-base-revc.dts ``` -2. Delete the following two blocks: +Remove the following nodes: + - `pci@40000000` - `iommu@2b400000` -{{% notice warning %}} -If you skip this, you’ll get an error like: +{{% notice Warning %}} +If you skip this step, you might encounter an error like: ```output Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ``` {{% /notice %}} -### Step 2: Set CPU Affinity Values +### Set CPU affinity values + +Each FVP model uses specific CPU affinity values. If these don’t match the values in the device tree, some of the CPU cores won’t boot. + +Find the correct affinities: -Each FVP model uses specific CPU affinity values. If these don’t match what’s in the device tree, some CPU cores won’t boot. -1. Find the correct affinities: ```bash FVP_Base_Cortex-A55x4 -l | grep pctl.CPU-affinities ``` @@ -43,19 +49,19 @@ Example output: pctl.CPU-affinities=0.0.0.0, 0.0.1.0, 0.0.2.0, 0.0.3.0 ``` -2. Convert each to hex for the reg field: +Convert each to hex for the `reg` field: ```output 0x0, 0x0100, 0x0200, 0x0300 ``` -3. Update the CPU nodes in your device tree file to use these reg values. +Update the CPU nodes in your device tree file to use these `reg` values. -{{% notice tip %}} -To avoid boot errors like psci: failed to boot CPUx (-22), make sure every cpu@xxx entry matches the FVP layout. +{{% notice Tip %}} +To avoid boot errors such as `psci: failed to boot CPUx (-22)`, make sure every `cpu@xxx` entry matches the FVP layout. {{% /notice %}} -### Step 3: Rebuild Linux +### Rebuild Linux After editing the device tree, rebuild Linux: diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/run.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/run.md index 4c5f4ac836..cdbfda2db0 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/run.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/run.md @@ -1,5 +1,5 @@ --- -title: Run Software Stack +title: Run the Linux software stack on an FVP weight: 5 ### FIXED, DO NOT MODIFY @@ -7,18 +7,20 @@ layout: learningpathall --- -## Run the Linux Software Stack on an FVP +## Launch the Linux software stack on an FVP Once you've built the Linux stack with the correct configuration, you're ready to run it on an Arm CPU Fixed Virtual Platform (FVP). -### Step 1: Verify the Build Output +Replace with the root path to your workspace, and with the location where you want to save the UART output logs. + +## Verify the build output After building, check the output directory to make sure the expected files were generated: ```bash tree output/aemfvp-a/aemfvp-a/ ``` -Expected output: +The expected output is: ```output output/aemfvp-a/aemfvp-a/ @@ -35,9 +37,10 @@ output/aemfvp-a/aemfvp-a/ └── uefi.bin -> ../components/aemfvp-a/uefi.bin ``` -### Step 2: Run the Software Stack +## Run the software stack To launch the software stack on the FVP, use a command like the following: + ```bash FVP_Base_Cortex-A55x4 \ -C pctl.startup=0.0.0.0 \ @@ -59,15 +62,19 @@ FVP_Base_Cortex-A55x4 \ ``` This will boot Trusted Firmware-A, UEFI/U-Boot, Linux, and BusyBox in sequence. -### Step 3: Troubleshoot FVP Launch Issues +## Troubleshoot FVP launch issues -Different FVP models use different CPU instance names. If you see an error like: +Different FVP models use different CPU instance names. + +If you see an error like: ```output -Warning: target instance not found: 'FVP_Base_Cortex_A65AEx4_Cortex_A76AEx4.cluster0.cpu0' (data: 'output/aemfvp-a/aemfvp-a/fvp-base-revc.dtb') +Warning: target instance not found: 'FVP_Base_Cortex_A65AEx4_Cortex_A76AEx4.cluster0.cpu0' (data: 'output/aemfvp-a/aemfvp-afvp-base-revc.dtb') ``` -You need to identify the correct instance name for your platform. Run: +Identify the correct instance name for your platform. + +Run: ```bash FVP_Base_Cortex-A65AEx4+Cortex-A76AEx4 -l | grep RVBARADDR | grep cpu0 @@ -86,12 +93,12 @@ Update your --data parameters accordingly: --data cluster0.subcluster0.cpu0.thread0=/output/aemfvp-a/aemfvp-a/fvp-base-revc.dtb@0x83000000 ``` -{{% notice tip %}} -Always confirm the CPU instance name when switching between different FVP models. +{{% notice Tip %}} +Always check the name of the CPU instance when switching between different FVP models. {{% /notice %}} -### Optional: Use the GUI +## Use the GUI (optional) You can also run the FVP using its graphical user interface: -![GUI #center](FVP.png) +![GUI #center](FVP.png "View of the FVP GUI") diff --git a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/steps.md b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/steps.md index 3ab3891c1d..c7f2e46a82 100644 --- a/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/steps.md +++ b/content/learning-paths/embedded-and-microcontrollers/linux-on-fvp/steps.md @@ -1,43 +1,43 @@ --- -title: Use TF-A extra build options to build cpu_ops into images +title: Configure Trusted Firmware-A build flags to include cpu_ops support weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Build TF-A with CPU Operations Support +## Build TF-A with cpu_ops support -Some Arm FVPs require CPU-specific initialization routines to boot properly. These routines are part of the TF-A `cpu_ops` framework. +Some Arm Fixed Virtual Platforms (FVPs) require CPU-specific initialization routines to boot Linux successfully. The Trusted Firmware-A `cpu_ops` framework provides these routines. -### What are cpu_ops? +## What are cpu_ops? The `cpu_ops` framework in Trusted Firmware-A contains functions to: + - Handle CPU resets - Manage power states - Apply errata workarounds -Each CPU type has its own implementation, defined in files like: +Each CPU type has its own implementation, defined in files such as: ```output lib/cpus/aarch64/cortex_a55.S lib/cpus/aarch64/cortex_a53.S -... etc. ``` -## Why you need this +## Why are cpu_ops required? -If the firmware is built without proper cpu_ops, you’ll hit an assertion failure like: +If the firmware is built without the proper cpu_ops, you’ll hit an assertion failure like: ```output ASSERT: File lib/cpus/aarch64/cpu_helpers.S Line 00035 ``` -This means the required CPU operation routines are missing from the build. +This means that the required CPU operation routines are missing from the build. -## Step-by-Step: Add TF-A Build Flags +## How do I include the correct cpu_ops? -To include the correct `cpu_ops`, you need to set TF-A build options depending on the CPU. +To include the correct `cpu_ops`, you need to set TF-A build options depending on the CPU, using the build flags. -### Example: A55 CPU FVP +### For the A55 CPU FVP Add the following line to your TF-A build script: @@ -45,7 +45,12 @@ Add the following line to your TF-A build script: ARM_TF_BUILD_FLAGS="$ARM_TF_BUILD_FLAGS HW_ASSISTED_COHERENCY=1 USE_COHERENT_MEM=0" ``` -### Example: A78 CPU FVP +These flags enable hardware-assisted cache coherency and disable use of coherent memory, which is typical for Cortex-A55 FVPs. + +### For the A78 CPU FVP + +Add the following line to your TF-A build script: + ```output ARM_TF_BUILD_FLAGS="$ARM_TF_BUILD_FLAGS HW_ASSISTED_COHERENCY=1 USE_COHERENT_MEM=0 CTX_INCLUDE_AARCH32_REGS=0" ``` @@ -53,7 +58,9 @@ ARM_TF_BUILD_FLAGS="$ARM_TF_BUILD_FLAGS HW_ASSISTED_COHERENCY=1 USE_COHERENT_MEM USE_COHERENT_MEM=1 cannot be used with HW_ASSISTED_COHERENCY=1. {{% /notice %}} -## Rebuild and Package +This configuration disables 32-bit context registers (specific to AArch64-only CPUs like Cortex-A78) and applies the same coherency settings as above. + +## Rebuild and package Run the following commands to rebuild TF-A and integrate it into the BusyBox image: ```bash @@ -61,3 +68,6 @@ Run the following commands to rebuild TF-A and integrate it into the BusyBox ima ./build-scripts/build-arm-tf.sh -p aemfvp-a -f busybox build ./build-scripts/aemfvp-a/build-test-busybox.sh -p aemfvp-a package ``` +Once the build completes, your firmware will include the correct CPU operation routines, allowing Linux to boot correctly on the target FVP. + +After packaging, you can boot the updated firmware on your FVP and verify that Linux reaches userspace without triggering early boot assertions. \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/yocto_qemu/yocto_build.md b/content/learning-paths/embedded-and-microcontrollers/yocto_qemu/yocto_build.md index 39d3afb11e..8fe1a8777a 100644 --- a/content/learning-paths/embedded-and-microcontrollers/yocto_qemu/yocto_build.md +++ b/content/learning-paths/embedded-and-microcontrollers/yocto_qemu/yocto_build.md @@ -17,19 +17,23 @@ Developers can configure their custom builds of Yocto using a set of recipes. In Poky is a reference distribution of the Yocto Project. It is a great starting point to build your own custom distribution as it contains both the build system and the the baseline functional distribution. Along with containing recipes for real target boards, it also contains the recipes for building the image, for example 64-bit Arm machines supported in QEMU. The example 64-bit machine emulated by QEMU does not emulate any particular board but is a great starting point to learn and try the basics of running this distribution. -The first step is to install the packages required to build and run Yocto: +The first step is to install the packages required to build and run Yocto. + +For Ubuntu 22.04 and later: ```bash sudo apt update -sudo apt-get install -y gawk wget git-core diffstat unzip texinfo build-essential chrpath socat cpio python3 python3-pip python3-pexpect xz-utils debianutils iputils-ping python3-git python3-jinja2 libegl1-mesa libsdl1.2-dev pylint xterm python3-subunit mesa-common-dev lz4 +sudo apt-get install -y gawk wget git-core diffstat unzip texinfo build-essential chrpath socat cpio python3 python3-pip python3-pexpect xz-utils debianutils iputils-ping python3-git python3-jinja2 libgl1 libglx-mesa0 libsdl1.2-dev pylint xterm python3-subunit mesa-common-dev lz4 ``` -Now download the Poky reference distribution and checkout the branch/tag you wish to build. You will build `yocto-4.0.6` in this example. + +Now download the Poky reference distribution and checkout the branch/tag you wish to build. You will build `yocto-5.0.10` in this example. ```bash git clone git://git.yoctoproject.org/poky cd poky -git checkout tags/yocto-4.0.6 -b yocto-4.0.6-local +git checkout tags/yocto-5.0.10 -b yocto-5.0.10-local ``` + Next source the script as shown below to initialize your build environment for your 64-bit Arm example machine QEMU target: ```bash @@ -80,6 +84,14 @@ You will now be in the `build-qemu-arm64` directory which is your build director ```bash { cwd="poky" } sed -i '/qemuarm64/s/^#//g' conf/local.conf ``` +{{% notice Note %}} +On Ubuntu systems with apparmor you will need to allow unprivileged users to run bitbake. +{{% /notice %}} + +```bash +echo 0 | sudo tee /proc/sys/kernel/apparmor_restrict_unprivileged_userns +``` + With the right machine now selected, proceed to building the minimal core image for your target. ```bash { cwd="poky",env_source="poky/oe-init-build-env build-qemu-arm64" } diff --git a/content/learning-paths/laptops-and-desktops/_index.md b/content/learning-paths/laptops-and-desktops/_index.md index 045d90596b..b3aa2da78f 100644 --- a/content/learning-paths/laptops-and-desktops/_index.md +++ b/content/learning-paths/laptops-and-desktops/_index.md @@ -34,8 +34,8 @@ tools_software_languages_filter: - C/C++: 4 - CCA: 1 - Clang: 11 -- CMake: 2 - cmake: 1 +- CMake: 2 - Coding: 16 - CSS: 1 - Daytona: 1 diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md index 2fb1f02f50..f6de0e33c7 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md @@ -12,7 +12,7 @@ learning_objectives: prerequisites: - A computer running Arm Linux or macOS with Docker installed. -author: Arnaud de Grandmaison. +author: Arnaud de Grandmaison test_images: - ubuntu:latest diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index 94addb9e43..b58279463c 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -8,7 +8,7 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 2 -- Linux: 150 +- Linux: 152 - macOS: 10 - Windows: 14 pinned_modules: @@ -18,12 +18,12 @@ pinned_modules: - providers - migration subjects_filter: -- CI-CD: 5 +- CI-CD: 6 - Containers and Virtualization: 28 - Databases: 15 - Libraries: 9 - ML: 27 -- Performance and Architecture: 59 +- Performance and Architecture: 60 - Storage: 1 - Web: 10 subtitle: Optimize cloud native apps on Arm for performance and cost @@ -39,6 +39,7 @@ tools_software_languages_filter: - Arm Compiler for Linux: 1 - Arm Development Studio: 3 - Arm ISA: 1 +- Arm Performance Libraries: 1 - armclang: 1 - armie: 1 - ArmRAL: 1 @@ -53,8 +54,8 @@ tools_software_languages_filter: - AWS Graviton: 1 - Azure CLI: 1 - Azure Portal: 1 -- bash: 2 - Bash: 1 +- bash: 2 - Bastion: 3 - BOLT: 2 - bpftool: 1 @@ -83,7 +84,7 @@ tools_software_languages_filter: - Fortran: 1 - FunASR: 1 - FVP: 4 -- GCC: 21 +- GCC: 22 - gdb: 1 - Geekbench: 1 - GenAI: 11 @@ -108,6 +109,7 @@ tools_software_languages_filter: - Kubernetes: 10 - Lambda: 1 - libbpf: 1 +- Libmath: 1 - Linaro Forge: 1 - Litmus7: 1 - LLM: 9 @@ -128,6 +130,7 @@ tools_software_languages_filter: - Ollama: 1 - ONNX Runtime: 1 - OpenBLAS: 1 +- OpenShift: 1 - OrchardCore: 1 - PAPI: 1 - perf: 5 @@ -150,6 +153,7 @@ tools_software_languages_filter: - SVE: 5 - SVE2: 2 - Sysbench: 1 +- Tekton: 1 - Telemetry: 1 - TensorFlow: 2 - Terraform: 11 diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_index.md b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_index.md new file mode 100644 index 0000000000..c73b42c745 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_index.md @@ -0,0 +1,62 @@ +--- +title: Build multi-architecture applications with Red Hat OpenShift Pipelines on AWS + +draft: true +cascade: + draft: true + +minutes_to_complete: 30 + +who_is_this_for: This topic is for OpenShift administrators interested in migrating their applications to Arm. + + +learning_objectives: + - Migrate existing OpenShift applications to Arm. + +prerequisites: + - An AWS account with an OpenShift 4.18 cluster with x86 nodes installed and configured. + - Red Hat OpenShift Pipelines (Tekton) operator installed in your cluster. + - Familiarity with Red Hat OpenShift (oc CLI), container concepts, and basic Tekton principles (Task, Pipeline, PipelineRun). + - Access to your Red Hat OpenShift cluster with cluster-admin or equivalent privileges for node configuration and pipeline setup. + +author: Jeff Young + +### Tags +skilllevels: Advanced +subjects: CI-CD +armips: + - Neoverse +tools_software_languages: + - Tekton + - OpenShift +operatingsystems: + - Linux + +further_reading: + - resource: + title: Red Hat OpenShift Documentation + link: https://docs.openshift.com/container-platform/latest/welcome/index.html + type: documentation + - resource: + title: OpenShift Pipelines (Tekton) Documentation + link: https://docs.openshift.com/container-platform/latest/cicd/pipelines/understanding-openshift-pipelines.html + type: documentation + - resource: + title: OpenShift Multi-Architecture Compute Machines + link: https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/postinstallation_configuration/configuring-multi-architecture-compute-machines-on-an-openshift-cluster + type: documentation + - resource: + title: OpenShift ImageStreams Documentation + link: https://docs.openshift.com/container-platform/latest/openshift_images/image-streams-managing.html + type: documentation + - resource: + title: Migrating to Multi-Architecture Compute Machines + link: https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html-single/updating_clusters/#migrating-to-multi-payload + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/migrationsteps.md b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/migrationsteps.md new file mode 100644 index 0000000000..3ebb0a6ecf --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/arm-cost-savings/migrationsteps.md @@ -0,0 +1,207 @@ +--- +title: Migrate an x86 workload to Arm on AWS +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Migrating workloads from x86 to Arm on AWS can help reduce costs and improve performance. The steps below explain how to assess your workload compatibility, enable multi-architecture support in Red Hat OpenShift, configure arm64 nodes, rebuild and verify container images, and safely transition your deployments. + +This example assumes you have the [OpenShift Pipelines Tutorial](https://github.com/openshift/pipelines-tutorial) built and running on x86. + +### 1. Assess Workload Compatibility + +Before migrating, determine whether your applications can run on 64-bit Arm architecture. Most modern applications built with portable runtimes such as Java, Go, Python, or Node.js can run seamlessly on 64-bit Arm with little or no modifications. Check your container images and dependencies for 64-bit Arm compatibility. + +To check if your container images support multiple architectures (such as arm64 and amd64), you can use tools like [KubeArchInspect](https://learn.arm.com/learning-paths/servers-and-cloud-computing/kubearchinspect/) to analyze images in your Kubernetes cluster. + +Additionally, you can use the Python script provided in [Learn how to use Docker](https://learn.arm.com/learning-paths/cross-platform/docker/check-images/) to inspect images for multi-architecture support. + +The OpenShift Pipelines Tutorial supports arm64 and doesn't have any architecture restrictions. + +### 2. Enable Multi-Arch Support in Red Hat OpenShift + +Red Hat OpenShift supports multi-architecture workloads, allowing you to run both x86 and Arm based nodes in the same cluster. + +Red Hat OpenShift's [documentation](https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/postinstallation_configuration/configuring-multi-architecture-compute-machines-on-an-openshift-cluster#multi-architecture-verifying-cluster-compatibility_creating-multi-arch-compute-nodes-aws) provides full details for the process. + +To check if your cluster is multi-architecture compatible, use the OpenShift CLI and run: + +```bash +oc adm release info -o jsonpath="{ .metadata.metadata }" +``` + +If the output includes `"release.openshift.io/architecture": "multi"`, your cluster supports multi-architecture compute nodes. + +If your cluster is not multi-architecture compatible, you must migrate it to use the multi-architecture release payload. This involves updating your OpenShift cluster to a version that supports multi-architecture and switching to the multi-architecture payload. For step-by-step instructions, see the OpenShift documentation on [migrating to a cluster with multi-architecture compute machines](https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html-single/updating_clusters/#migrating-to-multi-payload). Once migration is complete, you can add compute nodes with different architectures. + +### 3. Add 64-bit Arm MachineSets + +To take advantage of Arm-based compute, you need to add new MachineSets to your OpenShift cluster that use Arm (Graviton) EC2 instances. This step enables your cluster to schedule workloads on Arm nodes, allowing you to run and test your applications on the target architecture while maintaining your existing x86 nodes for a smooth migration. + +#### Decide on a scheduling strategy + +When introducing Arm nodes into your OpenShift cluster, you need to control which workloads are scheduled onto these new nodes. There are two main approaches: + +- **Manual scheduling with Taints and Tolerations:** By applying a taint to your Arm nodes, you ensure that only workloads with a matching toleration are scheduled there. This gives you precise control over which applications run on Arm, making it easier to test and migrate workloads incrementally. +- **Automated scheduling with the Multiarch Tuning Operator:** This operator helps automate the placement of workloads on the appropriate architecture by managing node affinity and tolerations for you. This is useful for larger environments or when you want to simplify multi-architecture workload management. + +For scenarios with a single workload in the build pipeline, the manual taint and toleration method can be used. The following taint can be added to new Arm machine sets: + +``` + taints: + - effect: NoSchedule + key: newarch + value: arm64 +``` + +This prevents existing x86 workloads from being scheduled to the Arm nodes, ensuring only workloads that explicitly tolerate this taint will run on Arm. + +#### Reimport needed ImageStreams with import-mode set to PreserveOriginal + +When running workloads on Arm nodes, you may need to ensure that the required container images are available for the arm64 architecture. OpenShift uses ImageStreams to manage container images, and by default, these may only include x86 (amd64) images. + +To make arm64 images available, you should reimport the necessary ImageStreams with the `PreserveOriginal` import mode. This ensures that all available architectures for an image are imported and preserved, allowing your Arm nodes to pull the correct image variant. + +For example, to reimport the `php` and `python` ImageStreams: + +```bash +oc import-image php -n openshift --all --confirm --import-mode='PreserveOriginal' +oc import-image python -n openshift --all --confirm --import-mode='PreserveOriginal' +``` + +This step is important to avoid image pull errors when deploying workloads to Arm nodes. + +### 4. Rebuild and Verify Container Images + +To build 64-bit Arm compatible images, the OpenShift Pipelines Tutorial has been modified to patch deployments with the Tekton Task's podTemplate information. This will allow you to pass a podTemplate for building and deploying your newly built application on the target architecture. It also makes it easy to revert back to 64-bit x86 by re-running the pipeline without the template. + +{{% notice Note %}} +Red Hat OpenShift only supports native architecture container builds. Cross-architecture container builds are not supported. +{{% /notice %}} + +Create a podTemplate defining a toleration and a node affinity to make the builds deploy on Arm machines. + +Save the code below in a file named `arm64.yaml` + +```yaml +tolerations: + - key: "newarch" + value: "arm64" + operator: "Equal" + effect: "NoSchedule" +affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: "kubernetes.io/arch" + operator: "In" + values: + - "arm64" + - key: "kubernetes.io/os" + operator: "In" + values: + - "linux" +``` + +Next the `02_update_deployment_task.yaml` file needs to be updated. This includes extract patching to include the podTemplate's nodeAffinity/tolerations. + +Modify the file `02_update_deployment_task.yaml` to contain the information below: + +```yaml +apiVersion: tekton.dev/v1 +kind: Task +metadata: +name: update-deployment +spec: +params: +- name: deployment + description: The name of the deployment patch the image + type: string +- name: IMAGE + description: Location of image to be patched with + type: string +- name: taskrun-name + type: string + description: Name of the current TaskRun (injected from context) + steps: + - name: patch + image: image-registry.openshift-image-registry.svc:5000/openshift/cli:latest + command: \["/bin/bash", "-c"\] + args: + - |- + oc patch deployment $(inputs.params.deployment) --patch='{"spec":{"template":{"spec":{ + "containers":\[{ + "name": "$(inputs.params.deployment)", + "image":"$(inputs.params.IMAGE)" + }\] + }}}}' + \# Find my own TaskRun name + MY\_TASKRUN\_NAME="$(params.taskrun-name)" + echo "TaskRun name: $MY\_TASKRUN\_NAME" + \# Fetch the podTemplate + PODTEMPLATE\_JSON=$(kubectl get taskrun "$MY\_TASKRUN\_NAME" -o jsonpath='{.spec.podTemplate}') + if \[ -z "$PODTEMPLATE\_JSON" \]; then + echo "No podTemplate found in TaskRun...Remove tolerations and affinity." + oc patch deployment "$(inputs.params.deployment)" \\\\ + --type merge \\\\ + -p "{\\"spec\\": {\\"template\\": {\\"spec\\": {\\"tolerations\\": null, \\"affinity\\": null}}}}" + else + echo "Found podTemplate:" + echo "$PODTEMPLATE\_JSON" + oc patch deployment "$(inputs.params.deployment)" \\\\ + --type merge \\\\ + -p "{\\"spec\\": {\\"template\\": {\\"spec\\": $PODTEMPLATE\_JSON }}}" + fi + \# issue: https://issues.redhat.com/browse/SRVKP-2387 + \# images are deployed with tag. on rebuild of the image tags are not updated, hence redeploy is not happening + \# as a workaround update a label in template, which triggers redeploy pods + \# target label: "spec.template.metadata.labels.patched\_at" + \# NOTE: this workaround works only if the pod spec has imagePullPolicy: Always + patched\_at\_timestamp=\`date +%s\` + oc patch deployment $(inputs.params.deployment) --patch='{"spec":{"template":{"metadata":{ + "labels":{ + "patched\_at": '\\"$patched\_at\_timestamp\\"' + } + }}}}' +``` + +Finally the `04_pipeline.yaml` needs to be updated to pass the taskrun-name to the update-deployment task. The modifications are below: + +```yaml +- name: update-deployment + taskRef: + name: update-deployment + params: + - name: deployment + value: $(params.deployment-name) + - name: IMAGE + value: $(params.IMAGE) + - name: taskrun-name //add these + value: $(context.taskRun.name) //lines +``` + +Now the UI and API can be redeployed using the `arm64.yaml` podTemplate. This will force all parts of the build pipeline and deployment to the tainted Arm nodes. + +Run the `tkn` command: + +```bash +tkn pipeline start build-and-deploy \\\\ +--prefix-name build-deploy-api-pipelinerun-arm64 \\\\ +-w name=shared-workspace,volumeClaimTemplateFile=https://raw.githubusercontent.com/openshift/pipelines-tutorial/master/01\_pipeline/03\_persistent\_volume\_claim.yaml \\\\ +-p deployment-name=pipelines-vote-api \\\\ +-p git-url=https://github.com/openshift/pipelines-vote-api.git \\\\ +-p IMAGE=image-registry.openshift-image-registry.svc:5000/pipelines-tutorial/pipelines-vote-api-arm64 +--use-param-defaults \\\\ +--pod-template arm64.yaml +``` + +Once the pods are up and running, you can safely remove the x86 worker nodes from the cluster, and remove the taints from the Arm worker nodes (if you choose to do so). + +### Conclusion + +Automating native builds for different architectures using Red Hat OpenShift Pipelines on Red Hat OpenShift 4.18 on AWS streamlines the development and deployment of versatile applications. + +By setting up distinct pipelines that leverage nodeSelector to build on x86 and Arm nodes, you ensure that your application components are optimized for their target environments. This approach provides a clear and manageable way to embrace multi-architecture computing in the cloud. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md index 8bb5274402..69a7b05852 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md @@ -34,8 +34,8 @@ Configure the build for debug: ```bash mkdir build && cd build -cmake .. -DCMAKE_C_FLAGS="-O3 -mcpu=neoverse-n2 -Wno-enum-constexpr-conversion -fno-reorder-blocks-and-partition" \ - -DCMAKE_CXX_FLAGS="-O3 -mcpu=neoverse-n2 -Wno-enum-constexpr-conversion -fno-reorder-blocks-and-partition" \ +cmake .. -DCMAKE_C_FLAGS="-O3 -march=native -Wno-enum-constexpr-conversion -fno-reorder-blocks-and-partition" \ + -DCMAKE_CXX_FLAGS="-O3 -march=native -Wno-enum-constexpr-conversion -fno-reorder-blocks-and-partition" \ -DCMAKE_CXX_LINK_FLAGS="-Wl,--emit-relocs" -DCMAKE_C_LINK_FLAGS="-Wl,--emit-relocs" -G Ninja \ -DWITH_BOOST=$HOME/boost -DDOWNLOAD_BOOST=On -DWITH_ZLIB=bundled -DWITH_LZ4=system -DWITH_SSL=system ``` @@ -91,83 +91,104 @@ The partial output is: If the symbols are missing, rebuild the binary with debug info and no stripping. -## Instrument the binary with BOLT -Use `llvm-bolt` to create an instrumented version of the binary: +## Prepare MySQL server before running workloads + +Before running the workload, you may need to initialize a new data directory if this is your first run: ```bash -llvm-bolt $HOME/mysql-server/build/runtime_output_directory/mysqld \ - -instrument \ - -o $HOME/mysql-server/build/runtime_output_directory/mysqld.instrumented \ - --instrumentation-file=$HOME/mysql-server/build/profile-readonly.fdata \ - --instrumentation-sleep-time=5 \ - --instrumentation-no-counters-clear \ - --instrumentation-wait-forks +# Initialize a new data directory +# Run this from the root of your MySQL source directory (e.g. $HOME/mysql-server). This creates an empty database in the data/ directory. +bin/mysqld --initialize-insecure --datadir=data ``` -### Explanation of key options - -- `-instrument`: Enables profile generation instrumentation -- `--instrumentation-file`: Path where the profile output will be saved -- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes) - - -## Start the instrumented MySQL server - -Before running the workload, start the instrumented MySQL server in a separate terminal. You may need to initialize a new data directory if this is your first run: +Start the instrumented server. On an 8-core system, use available cores (e.g., 2 for mysqld, 7 for sysbench). Run the command from build directory. ```bash -# Initialize a new data directory (if needed) -$HOME/mysql-server/build/runtime_output_directory/mysqld.instrumented --initialize-insecure --datadir=$HOME/mysql-bolt-data - -# Start the instrumented server -# On an 8-core system, use available cores (e.g., 6 for mysqld, 7 for sysbench) -taskset -c 6 $HOME/mysql-server/build/runtime_output_directory/mysqld.instrumented \ - --datadir=$HOME/mysql-bolt-data \ - --socket=$HOME/mysql-bolt.sock \ - --port=3306 \ - --user=$(whoami) & +taskset -c 2 ./bin/mysqld \ + --datadir=data \ + --max-connections=64 \ + --back-log=10000 \ + --innodb-buffer-pool-instances=128 \ + --innodb-file-per-table \ + --innodb-sync-array-size=1024 \ + --innodb-flush-log-at-trx-commit=1 \ + --innodb-io-capacity=5000 \ + --innodb-io-capacity-max=10000 \ + --tmp-table-size=16M \ + --max-heap-table-size=16M \ + --log-bin=1 \ + --sync-binlog=1 \ + --innodb-stats-persistent \ + --innodb-read-io-threads=4 \ + --innodb-write-io-threads=4 \ + --key-buffer-size=16M \ + --max-allowed-packet=16M \ + --max-prepared-stmt-count=2000000 \ + --innodb-flush-method=fsync \ + --innodb-log-buffer-size=64M \ + --read-buffer-size=262144 \ + --read-rnd-buffer-size=524288 \ + --binlog-format=MIXED \ + --innodb-purge-threads=1 \ + --table-open-cache=8000 \ + --table-open-cache-instances=16 \ + --open-files-limit=1048576 \ + --default-authentication-plugin=mysql_native_password ``` Adjust `--datadir`, `--socket`, and `--port` as needed for your environment. Make sure the server is running and accessible before proceeding. -With the database running, open a second terminal to run the client commands. +With the database running, open a second terminal to create a benchmark User and third terminal to run the client commands. -## Install sysbench +In the new terminal, navigate to the build directory: -You will need sysbench to generate workloads for MySQL. On most Arm Linux distributions, you can install it using your package manager: +```bash +cd $HOME/mysql-server/build +``` +## Create Benchmark User and Database + +Run once after initializing MySQL for the first time: ```bash -sudo apt update -sudo apt install -y sysbench +bin/mysql -u root <<< " +CREATE USER 'bench'@'localhost' IDENTIFIED BY 'bench'; +CREATE DATABASE bench; +GRANT ALL PRIVILEGES ON *.* TO 'bench'@'localhost' WITH GRANT OPTION; +FLUSH PRIVILEGES;" ``` -Alternatively, see the [sysbench GitHub page](https://github.com/akopytov/sysbench) for build-from-source instructions if a package is not available for your platform. +This sets up the bench user and the bench database with full privileges. Do not repeat this before every test — it is only required once. -## Create a test database and user +## Reset Benchmark Database Between Runs -For sysbench to work, you need a test database and user. Connect to the MySQL server as the root user (or another admin user) and run: +This clears all existing tables and data from the bench database, giving you a clean slate for sysbench prepare without needing to recreate the user or reinitialize the datadir. ```bash -mysql -u root --socket=$HOME/mysql-bolt.sock +bin/mysql -u root <<< "DROP DATABASE bench; CREATE DATABASE bench;" ``` -Then, in the MySQL shell: +## Install and build sysbench -```sql -CREATE DATABASE IF NOT EXISTS bench; -CREATE USER IF NOT EXISTS 'bench'@'localhost' IDENTIFIED BY 'bench'; -GRANT ALL PRIVILEGES ON bench.* TO 'bench'@'localhost'; -FLUSH PRIVILEGES; -EXIT; +In a third terminal, run the commands below if you have not run sysbench yet. + +```bash +git clone https://github.com/akopytov/sysbench.git +cd sysbench +./autogen.sh +./configure +make -j$(nproc) +export LD_LIBRARY_PATH=/usr/local/mysql/lib/ ``` -## Run the instrumented binary under a feature-specific workload +Use `./src/sysbench` for running benchmarks unless installed globally. + +## Create a dataset with sysbench Run `sysbench` with the `prepare` option: ```bash -sysbench \ +./src/sysbench \ --db-driver=mysql \ --mysql-host=127.0.0.1 \ --mysql-db=bench \ @@ -177,13 +198,62 @@ sysbench \ --tables=8 \ --table-size=10000 \ --threads=1 \ - /usr/share/sysbench/oltp_read_only.lua prepare + src/lua/oltp_read_write.lua prepare ``` +## Shutdown MySQL and snapshot dataset for fast reuse + +Do these steps once at the start from MySQL source directory + +```bash +bin/mysqladmin -u root shutdown +mv data data-orig +``` + +This saves the populated dataset before benchmarking. + +```bash +rm -rf /dev/shm/dataset +cp -R data-orig/ /dev/shm/dataset +``` + +From MySQL source directory, + +```bash +ln -s /dev/shm/dataset/ data +``` + +This links the MySQL --datadir to a fast in-memory copy, ensuring every test starts from a clean, identical state. + +## Instrument the binary with BOLT + +Use `llvm-bolt` to create an instrumented version of the binary: + +```bash +llvm-bolt $HOME/mysql-server/build/bin/mysqld \ + -instrument \ + -o $HOME/mysql-server/build/bin/mysqldreadonly.instrumented \ + --instrumentation-file=$HOME/mysql-server/build/profile-readonly.fdata \ + --instrumentation-sleep-time=5 \ + --instrumentation-no-counters-clear \ + --instrumentation-wait-forks \ + 2>&1 | tee $HOME/mysql-server/bolt-instrumentation-readonly.log +``` + +### Explanation of key options + +- `-instrument`: Enables profile generation instrumentation +- `--instrumentation-file`: Path where the profile output will be saved +- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes) + +## Run the instrumented binary under a feature-specific workload + +Start the MySQL instrumented binary in first terminal. + Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench: ```bash -taskset -c 7 sysbench \ +taskset -c 7 ./src/sysbench \ --db-driver=mysql \ --mysql-host=127.0.0.1 \ --mysql-db=bench \ @@ -192,16 +262,30 @@ taskset -c 7 sysbench \ --mysql-port=3306 \ --tables=8 \ --table-size=10000 \ + --forced-shutdown \ + --report-interval=60 \ + --rand-type=uniform \ + --time=5 \ --threads=1 \ - /usr/share/sysbench/oltp_read_only.lua run + --simple-ranges=1 \ + --distinct-ranges=1 \ + --sum-ranges=1 \ + --order-ranges=1 \ + --point-selects=10 \ + src/lua/oltp_read_only.lua run ``` {{% notice Note %}} -On an 8-core system, cores are numbered 0-7. Adjust the `taskset -c` values as needed for your system. Avoid using the same core for both mysqld and sysbench to reduce contention. +On an 8-core system, cores are numbered 0-7. Adjust the `taskset -c` values as needed for your system. Avoid using the same core for both mysqld and sysbench to reduce contention. You can increase this time (e.g., --time=5 or --time=300) for more statistically meaningful profiling and better .fdata data. {{% /notice %}} The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data. +After completing each benchmark run (e.g. after sysbench run), you must cleanly shut down the MySQL server and reset the dataset to ensure the next test starts from a consistent state. +```bash +bin/mysqladmin -u root shutdown ; rm -rf /dev/shm/dataset ; cp -R data/ /dev/shm/dataset +``` + ## Verify the profile was created After running the workload: diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md index 4c4f141d14..0e739accc7 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md @@ -10,11 +10,25 @@ Next, you will collect profile data for a **write-heavy** workload and merge the ## Run Write-Only Workload for Application Binary -Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`: +Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata` + +For this you can reuse the existing instrumented binary, rename .fdata appropriately for read and write workloads or run llvm-bolt with new file target. +```bash +llvm-bolt $HOME/mysql-server/build/bin/mysqld \ + -instrument \ + -o $HOME/mysql-server/build/bin/mysqldwriteonly.instrumented \ + --instrumentation-file=$HOME/mysql-server/build/profile-writeonly.fdata \ + --instrumentation-sleep-time=5 \ + --instrumentation-no-counters-clear \ + --instrumentation-wait-forks \ + 2>&1 | tee $HOME/mysql-server/bolt-instrumentation-writeonly.log +``` + +Run sysbench again with the write-only workload: ```bash # On an 8-core system, use available cores (e.g., 7 for sysbench) -taskset -c 7 sysbench \ +taskset -c 7 ./src/sysbench \ --db-driver=mysql \ --mysql-host=127.0.0.1 \ --mysql-db=bench \ @@ -23,13 +37,25 @@ taskset -c 7 sysbench \ --mysql-port=3306 \ --tables=8 \ --table-size=10000 \ + --forced-shutdown \ + --report-interval=60 \ + --rand-type=uniform \ + --time=5 \ --threads=1 \ - /usr/share/sysbench/oltp_write_only.lua run + --simple-ranges=1 \ + --distinct-ranges=1 \ + --sum-ranges=1 \ + --order-ranges=1 \ + --point-selects=10 \ + src/lua/oltp_write_only.lua run ``` Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`. - +After completing each benchmark run (e.g. after sysbench run), you must cleanly shut down the MySQL server and reset the dataset to ensure the next test starts from a consistent state. +```bash +./bin/mysqladmin -u root shutdown ; rm -rf /dev/shm/dataset ; cp -R data/ /dev/shm/dataset +``` ### Verify the Second Profile Was Generated ```bash @@ -72,24 +98,17 @@ ls -lh $HOME/mysql-server/build/profile-merged.fdata Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file: ```bash -llvm-bolt $HOME/mysql-server/build/runtime_output_directory/mysqld \ - -instrument \ - -o $HOME/mysql-server/build/runtime_output_directory/mysqld.instrumented \ - --instrumentation-file=$HOME/mysql-server/build/profile-readonly.fdata \ - --instrumentation-sleep-time=5 \ - --instrumentation-no-counters-clear \ - --instrumentation-wait-forks - -llvm-bolt $HOME/mysql-server/build/runtime_output_directory/mysqld \ - -o $HOME/mysql-server/build/mysqldreadwrite_merged.bolt_instrumentation \ - -data=$HOME/mysql-server/build/prof-instrumentation-readwritemerged.fdata \ +llvm-bolt $HOME/mysql-server/build/bin/mysqld \ + -o $HOME/mysql-server/build/bin/mysqldreadwrite_merged.bolt_instrumentation \ + -data=$HOME/mysql-server/build/profile-merged.fdata \ -reorder-blocks=ext-tsp \ -reorder-functions=hfsort \ -split-functions \ -split-all-cold \ -split-eh \ -dyno-stats \ - --print-profile-stats 2>&1 | tee bolt_orig.log + --print-profile-stats \ + 2>&1 | tee $HOME/mysql-server/build/bolt-readwritemerged-opt.log ``` This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features. diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md index a237c7d4cc..65da71ece2 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md @@ -29,10 +29,11 @@ llvm-bolt $HOME/bolt-libs/openssl/lib/libssl.so.3 \ --instrumentation-file=$HOME/bolt-libs/openssl/lib/libssl-readwrite.fdata \ --instrumentation-sleep-time=5 \ --instrumentation-no-counters-clear \ - --instrumentation-wait-forks + --instrumentation-wait-forks \ + 2>&1 | tee $HOME/mysql-server/bolt-instrumentation-libssl.log ``` -Then launch MySQL using the **instrumented shared library** and run a **read+write** sysbench test to populate the profile: +Then launch MySQL using the **instrumented shared library** and run a **read+write** sysbench test to populate the profile ### Optimize libssl using the profile @@ -50,7 +51,8 @@ llvm-bolt $HOME/bolt-libs/openssl/lib/libssl.so.3 \ -split-all-cold \ -split-eh \ -dyno-stats \ - --print-profile-stats + --print-profile-stats \ + 2>&1 | tee $HOME/mysql-server/build/bolt-libssl.log ``` ### Replace the library at runtime @@ -58,8 +60,19 @@ llvm-bolt $HOME/bolt-libs/openssl/lib/libssl.so.3 \ Copy the optimized version over the original and export the path: ```bash -cp $HOME/bolt-libs/openssl/lib/libssl.so.optimized $HOME/bolt-libs/openssl/lib/libssl.so.3 +# Set LD_LIBRARY_PATH in the terminal before launching mysqld in order for mysqld to pick the optimized library. +cp $HOME/bolt-libs/openssl/libssl.so.optimized $HOME/bolt-libs/openssl/libssl.so.3 export LD_LIBRARY_PATH=$HOME/bolt-libs/openssl/lib + +# You can confirm that mysqld is loading your optimized library with: +LD_LIBRARY_PATH=$HOME/bolt-libs/openssl/ +ldd build/bin/mysqld | grep libssl +``` + +It should show: + +```output +libssl.so.3 => /home/ubuntu/bolt-libs/openssl/libssl.so.3 ``` This ensures MySQL will dynamically load the optimized `libssl.so`. @@ -70,7 +83,7 @@ Start the BOLT-optimized MySQL binary and link it against the optimized `libssl. ```bash # On an 8-core system, use available cores (e.g., 7 for sysbench) -taskset -c 7 sysbench \ +taskset -c 7 ./src/sysbench \ --db-driver=mysql \ --mysql-host=127.0.0.1 \ --mysql-db=bench \ @@ -79,8 +92,17 @@ taskset -c 7 sysbench \ --mysql-port=3306 \ --tables=8 \ --table-size=10000 \ + --forced-shutdown \ + --report-interval=60 \ + --rand-type=uniform \ + --time=5 \ --threads=1 \ - /usr/share/sysbench/oltp_read_write.lua run + --simple-ranges=1 \ + --distinct-ranges=1 \ + --sum-ranges=1 \ + --order-ranges=1 \ + --point-selects=10 \ + src/lua/oltp_read_write.lua run ``` @@ -90,37 +112,23 @@ In the next step, you'll optimize an additional critical external library (`libc Follow these steps to instrument and optimize `libcrypto.so`: -#### Instrument `libcrypto.so`: +### Instrument libcrypto ```bash -llvm-bolt $HOME/bolt-libs/openssl/lib/libcrypto.so.3 \ +llvm-bolt $HOME/bolt-libs/openssl/libcrypto.so.3 \ -instrument \ -o $HOME/bolt-libs/openssl/lib/libcrypto.so.3.instrumented \ --instrumentation-file=$HOME/bolt-libs/openssl/lib/libcrypto-readwrite.fdata \ --instrumentation-sleep-time=5 \ --instrumentation-no-counters-clear \ - --instrumentation-wait-forks -``` - -Run MySQL under the read-write workload to populate `libcrypto-readwrite.fdata`: - -```bash -export LD_LIBRARY_PATH=/path/to/libcrypto-instrumented -taskset -c 7 sysbench \ - --db-driver=mysql \ - --mysql-host=127.0.0.1 \ - --mysql-db=bench \ - --mysql-user=bench \ - --mysql-password=bench \ - --mysql-port=3306 \ - --tables=8 \ - --table-size=10000 \ - --threads=1 \ - /usr/share/sysbench/oltp_read_write.lua run + --instrumentation-wait-forks \ + 2>&1 | tee $HOME/mysql-server/bolt-instrumentation-libcrypto.log ``` +Then launch MySQL using the instrumented shared library and run a read+write sysbench test to populate the profile. +### Optimize libcrypto using the profile +After running the read+write test, ensure `libcrypto-readwrite.fdata` is populated. -#### Optimize the crypto library - +Run BOLT on the uninstrumented libcrypto.so with the collected read-write profile: ```bash llvm-bolt $HOME/bolt-libs/openssl/lib/libcrypto.so.3 \ -o $HOME/bolt-libs/openssl/lib/libcrypto.so.optimized \ @@ -131,15 +139,47 @@ llvm-bolt $HOME/bolt-libs/openssl/lib/libcrypto.so.3 \ -split-all-cold \ -split-eh \ -dyno-stats \ - --print-profile-stats + --print-profile-stats \ + 2>&1 | tee $HOME/mysql-server/build/bolt-libcrypto.log ``` Replace the original at runtime: ```bash -cp $HOME/bolt-libs/openssl/lib/libcrypto.so.optimized $HOME/bolt-libs/openssl/lib/libcrypto.so.3 -export LD_LIBRARY_PATH=$HOME/bolt-libs/openssl/lib +# Set LD_LIBRARY_PATH in the terminal before launching mysqld in order for mysqld to pick the optimized library. +cp $HOME/bolt-libs/openssl/libcrypto.so.optimized $HOME/bolt-libs/openssl/libcrypto.so.3 +export LD_LIBRARY_PATH=$HOME/bolt-libs/openssl/ + +# You can confirm that mysqld is loading your optimized library with: +LD_LIBRARY_PATH=$HOME/bolt-libs/openssl/ ldd build/bin/mysqld | grep libcrypto ``` -Run a final validation workload to ensure functionality and measure performance improvements. +It should show: +```output +libcrypto.so.3 => /home/ubuntu/bolt-libs/openssl/libcrypto.so.3 +``` + +Run a final validation workload to ensure functionality and measure performance improvements. +```bash +taskset -c 7 ./src/sysbench \ + --db-driver=mysql \ + --mysql-host=127.0.0.1 \ + --mysql-db=bench \ + --mysql-user=bench \ + --mysql-password=bench \ + --mysql-port=3306 \ + --tables=8 \ + --table-size=10000 \ + --forced-shutdown \ + --report-interval=60 \ + --rand-type=uniform \ + --time=5 \ + --threads=1 \ + --simple-ranges=1 \ + --distinct-ranges=1 \ + --sum-ranges=1 \ + --order-ranges=1 \ + --point-selects=10 \ + src/lua/oltp_read_write.lua run +``` diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md index 8c2b963995..2c96c74b9f 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md @@ -8,6 +8,9 @@ layout: learningpathall This step presents the performance comparisons across various BOLT optimization scenarios. You'll see how baseline performance compares with BOLT-optimized binaries using merged profiles and bolted external libraries. +For all test cases shown in the table below, sysbench was configured with --time=0 --events=10000. +This means each test ran until exactly 10,000 requests were completed per thread, rather than running for a fixed duration. + ### 1. Baseline Performance (No BOLT) | Metric | Read-Only (Baseline) | Write-Only (Baseline) | Read+Write (Baseline) | diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/_index.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/_index.md index 820701deea..4d708ab03c 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/_index.md @@ -1,22 +1,18 @@ --- -title: Go Benchmarks with Sweet and Benchstat - -draft: true -cascade: - draft: true +title: Benchmark Go performance with Sweet and Benchstat minutes_to_complete: 60 -who_is_this_for: This is an introductory topic for developers who are interested in measuring the performance of Go-based applications on Arm-based servers. +who_is_this_for: This introductory topic is for developers who want to measure and compare the performance of Go applications on Arm-based servers. -learning_objectives: - - Learn how to start up Arm64 and x64 instances of GCP VMs - - Install Go, benchmarks, benchstat, and sweet on the two VMs - - Use sweet and benchstat to compare the performance of Go applications on the two VMs +learning_objectives: + - Provision Arm64 and x86_64 VM instances on Google Cloud + - Install Go, Sweet, and Benchstat on each VM instance + - Run benchmarks and use Benchstat to compare Go application performance across architectures prerequisites: - - A [Google Cloud account](https://console.cloud.google.com/). This learning path can be run on on-prem or on any cloud provider instance, but specifically documents the process for running on Google Axion. - - A local machine with [Google Cloud CLI](/install-guides/gcloud/) installed. + - A [Google Cloud account](https://console.cloud.google.com/). This Learning Path can be run on any cloud provider or on-premises, but it focuses on Google Cloud’s Axion Arm64-based instances. + - A local machine with [Google Cloud CLI](/install-guides/gcloud/) installed author: Geremy Cohen @@ -31,7 +27,15 @@ tools_software_languages: operatingsystems: - Linux - +further_reading: + - resource: + title: Effective Go + link: https://go.dev/doc/effective_go#performance + type: blog + - resource: + title: Benchmark testing in Go + link: https://dev.to/stefanalfbo/benchmark-testing-in-go-17dc + type: blog ### FIXED, DO NOT MODIFY # ================================================================================ diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4_vm.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4_vm.md index 70b37ecf2e..f7b9ad3a57 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4_vm.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4_vm.md @@ -1,32 +1,31 @@ --- -title: Launching a Intel Emerald Rapids Instance +title: Launch an Intel Emerald Rapids c4-standard-8 instance weight: 30 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Section Overview -In this section, you will set up the second benchmarking system, an Intel Emerald Rapids `c4-standard-8` instance. +In this section, you'll set up the second benchmarking system: an Intel-based Emerald Rapids `c4-standard-8` instance on Google Cloud (referred to as **c4**). -## Creating the Instance +## Create the c4-standard-8 instance -To create the second system, follow the previous lesson's c4a install instructions, but make the following changes: +Follow the same steps from the previous section where you launched the c4a instance, but make the following changes for the Intel-based c4-standard-8: -1. **Name your instance:** For the `Name` field, enter "c4". +* In the **Name** field, enter "c4". +* In the **Machine types for common workloads** section, select the **c4** radio button. +![alt-text#center](images/launch_c4/3.png "Select the c4 radio button") -2. **Select machine series:** Scroll down to the Machine series section, and select the C4 radio button. +* In the **Machine configuration** section, open the dropdown select `c4-standard-8`. -![](images/launch_c4/3.png) +![alt-text#center](images/launch_c4/4.png "Open the dropdown and select `c4-standard-8`") -3. **View machine types:** Scroll down to the Machine type dropdown, and click it to show all available options. +* In the **Machine type** section, open the dropdown and select `c4-standard-8` under the **Standard** tab. -![](images/launch_c4/4.png) +![alt-text#center](images/launch_c4/5.png "Select `c4-standard-8`") -4. **Choose machine size:** Select "c4-standard-8" under the Standard tab. - -![](images/launch_c4/5.png) - -{{% notice Note %}} Don't forget to set the disk size for this c4 to 1000GB under the "OS and Storage" tab like you did for the c4a.{{% /notice %}} +{{% notice Note %}} +Be sure to set the disk size to **1000 GB** in the **OS and Storage** tab, just as you did for the `c4a` instance. +{{% /notice %}} After the c4 instance starts up, you are ready to continue to the next section, where you'll install the benchmarking software. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4a_vm.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4a_vm.md index 106352dc7c..e05e7222e9 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4a_vm.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/add_c4a_vm.md @@ -1,67 +1,62 @@ --- -title: Launching a Google Axion Instance +title: Launch an Arm-based c4a-standard-4 instance weight: 20 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Overview -In this section, you'll learn how to spin up the first of two different VMs used for benchmarking Go tests, an Arm-based Google Axion c4a-standard-4 (c4a for short). +In this section, you'll launch the first of two VMs used for benchmarking Go applications: the Arm-based c4a-standard-4 instance on Google Cloud, (referred to as "c4a"). -## Creating the c4a-standard-4 Instance +## Create the c4a-standard-4 instance -1. **Access Google Cloud Console:** Navigate to [https://console.cloud.google.com/welcome](https://console.cloud.google.com/welcome) +Go to the Google Cloud console: [https://console.cloud.google.com/welcome](https://console.cloud.google.com/welcome). -2. **Search for VM instances:** Click into the Search field. +In the search bar at the top, start typing `vm`, then select **VM instances** when it appears. -3. **Find VM Instances:** Start typing `vm` until the UI auto-completes `VM Instances`, then click it. +![alt-text#center](images/launch_c4a/3.png "Select VM instances") -![](images/launch_c4a/3.png) + On the **VM instances** page, click **Create instance**. -The VM Instances page appears. +![alt-text#center](images/launch_c4a/4.png "Select Create instance") -4. **Create a new instance:** Click `Create instance` + In the **Name** field, enter the name of the instance - here it should be `c4a`. -![](images/launch_c4a/4.png) +![alt-text#center](images/launch_c4a/5.png "Enter name of the instance") -The Machine configuration page appears. +Now select the machine series by scrolling down to the Machine series section, and selecting the **C4A** radio button. -5. **Name your instance:** Click the `Name` field, and enter "c4a" for the `Name`. +![alt-text#center](images/launch_c4a/7.png "Select C4A radio button") -![](images/launch_c4a/5.png) +To view machine types, scroll down to the **Machine type** dropdown, and select it to show all available options. -6. **Select machine series:** Scroll down to the Machine series section, and select the C4A radio button. +![alt-text#center](images/launch_c4a/8.png "Select Machine type dropdown") -![](images/launch_c4a/7.png) +Now choose machine size by selecting **c4a-standard-4** under the **Standard** tab. -7. **View machine types:** Scroll down to the Machine type dropdown, and click it to show all available options. +![alt-text#center](images/launch_c4a/9.png "Select machine size") -![](images/launch_c4a/8.png) +To configure storage, select the **OS and Storage** tab. -8. **Choose machine size:** Select "c4a-standard-4" under the Standard tab. +![alt-text#center](images/launch_c4a/10.png "Configure storage") -![](images/launch_c4a/9.png) +To modify storage settings, select **Change**. -9. **Configure storage:** Click the "OS and Storage" tab. +![alt-text#center](images/launch_c4a/11.png "Modify storage settings") -![](images/launch_c4a/10.png) +To set disk size, select the **Size (GB)** field and enter "1000" for the value. -10. **Modify storage settings:** Click "Change" +![alt-text#center](images/launch_c4a/16.png "Enter value in the Size (GB) field") -![](images/launch_c4a/11.png) +Now confirm storage settings by selecting **Select** to continue. -11. **Set disk size:** Double-click the "Size (GB)" field, then enter "1000" for the value. +![alt-text#center](images/launch_c4a/18.png "Confirm the selection of settings with the Select button") -![](images/launch_c4a/16.png) +To launch the instance, select **Create** to bring up the instance. -12. **Confirm storage settings:** Click "Select" to continue. +![alt-text#center](images/launch_c4a/19.png "Select the Create button to launch the instance") -![](images/launch_c4a/18.png) +After a few seconds, your c4a instance is up and running, and you are ready to continue to the next section. -13. **Launch the instance:** Click "Create" to bring up the instance. - -![](images/launch_c4a/19.png) - -After a few seconds, your c4a instance starts up, and you are ready to continue to the next section. In the next step, you will launch the second VM, an Intel-based Emerald Rapids c4-standard-8 (c4 for short), which will serve as the comparison system for our benchmarking tests. +In the next section, you'll launch the second VM, an Intel-based Emerald Rapids c4-standard-8 (referred to as "c4"), which serves as the comparison system for benchmarking. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/installing_go_and_sweet.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/installing_go_and_sweet.md index 9f8552fbea..c747426dae 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/installing_go_and_sweet.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/installing_go_and_sweet.md @@ -1,38 +1,39 @@ --- -title: Installing Go and Sweet +title: Install Go, Sweet, and Benchstat weight: 40 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you'll install Go, Sweet, and the Benchstat comparison tool on both VMs. +In this section, you'll install Go, Sweet, and Benchstat on both virtual machines: -## Installation Script - -Sweet is a Go benchmarking tool that provides a standardized way to run performance tests across different systems. Benchstat is a companion tool that analyzes and compares benchmark results, helping you understand performance differences between systems. Together, these tools will allow you to accurately measure and compare Go performance on Arm and x86 architectures. +* Sweet is a Go benchmarking tool that provides a standardized way to run performance tests across systems. +* Benchstat is a companion tool that compares benchmark results to highlight meaningful performance differences. +Together, these tools help you evaluate Go performance on both Arm and x86 architectures. {{% notice Note %}} -Subsequent steps in the learning path assume you are running this script (installing) from your home directory (`$HOME`), resulting in the creation of a `$HOME/benchmarks/sweet` final install path. If you decide to install elsewhere, you will need to adjust the path accordingly when prompted to run the benchmark logic later in the learning path. +Subsequent steps in this Learning Path assume you are running this script (installing) from your home directory (`$HOME`), resulting in the creation of a `$HOME/benchmarks/sweet` final install path. If you install to a different directory, update the paths in later steps to match your custom location. {{% /notice %}} +## Installation script -Start by copying and pasting the script below on **both** of your GCP VMs. This script checks the architecture of your running VM, installs the required Go package on your VM. It then installs sweet, benchmarks, and the benchstat tools. +Start by copying and pasting the script below on both of your GCP VMs. This script automatically detects your system architecture, installs the appropriate Go version, and sets up Sweet, Benchstat, and the Go benchmark suite. -**You don't need to run it after pasting**, just paste it into your home directory and press enter to install all needed dependencies: +Paste the full block into your terminal. This creates and runs an installer script directly from your home directory: ```bash #!/usr/bin/env bash -# Write the script to filesystem using a HEREDOC +# Write the install script to filesystem using a HEREDOC cat <<'EOF' > install_go_and_sweet.sh sudo apt-get -y update sudo apt-get -y install git build-essential # Detect architecture - this allows the same script to work on both -# our Arm (c4a) and x86 (c4) VMs without modification +# Arm (c4a) and x86 (c4) VMs without modification ARCH=$(uname -m) case "$ARCH" in arm64|aarch64) @@ -90,7 +91,9 @@ chmod 755 install_go_and_sweet.sh ``` -The end of the output should look like: +## Expected output from sweet get + +When `sweet get` completes successfully, you’ll see output similar to: ```output Sweet v0.3.0: Go Benchmarking Suite @@ -109,7 +112,7 @@ Usage: sweet get [flags] ``` -## Verify Installation +## Verify installation To test that everything is installed correctly, set the environment variables shown below on each VM: diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchmark.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchmark.md index 281d6bc3a8..c98ac9273c 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchmark.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchmark.md @@ -1,34 +1,44 @@ --- -title: Manually running benchmarks +title: Manually run benchmarks weight: 51 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you'll download the results of the benchmark you ran manually in the previous sections from each VM. You will use these results to understand how `sweet` and `benchstat` work together. +In this section, you'll download the benchmark results you ran manually in previous steps from each VM. You will use these results to understand how `sweet` and `benchstat` work together. -## Download Benchmark Results from each VM -Lets walk through the steps to manually download the sweet benchmark results from your initial run on each VM. +## Download benchmark results from each VM +Start by retrieving the results generated by Sweet from your earlier benchmark runs. -1. **Locate results:** Change directory to the `results/markdown` directory and list the files to see the `arm-benchmarks.result` file: +### Locate results + +Change directory to the `results/markdown` directory and list the files to see the `arm-benchmarks.result` file: ```bash cd results/markdown ls -d $PWD/* ``` -2. **Copy result path:** Copy the absolute pathname of `arm-benchmarks.result`. +### Copy result path + +Copy the absolute pathname of `arm-benchmarks.result`. You'll need this to initiate the download. + +### Download results + +Select `DOWNLOAD FILE` in your GCP terminal interface. Paste the absolute pathname you copied into the dialog and confirm the download. This downloads the benchmark results to your local machine. + + ![alt-text#center](images/run_manually/6.png "Download the results") -3. **Download results:** Click `DOWNLOAD FILE`, and paste the **ABSOLUTE PATHNAME** you just copied for the filename, and then click `Download`. This will download the benchmark results to your local machine. +### Rename the file - ![](images/run_manually/6.png) +After downloading the file to your local machine, rename it to `c4a.result` to distinguish it from the x86 results you'll download next. This naming convention helps you clearly identify which architecture each result came from. You'll know the download was successful if you see the file named `c4a.result` in your Downloads folder and receive a confirmation in your browser. -4. **Rename the file:** Once downloaded, on your local machine, rename this file to `c4a.result` so you can distinguish it from the x86 results you'll download later. This naming convention will help you clearly identify which results came from which architecture. You'll know the file downloaded successfully if you see the file in your Downloads directory with the name `c4a.result`, as well as the confirmation dialog in your browser: + ![alt-text#center](images/run_manually/7.png "A successful download") - ![](images/run_manually/7.png) +### Repeat for the second VM -5. **Repeat for c4 instance:** Repeat steps 2-8 with your `c4` (x86) instance. Do everything the same, except after downloading the c4's `arm-benchmarks.result` file, rename it to `c4.result`. +Repeat the same process with your c4 (x86) VM. Use the same results/markdown directory and download the `arm-benchmarks.result` file. This time, rename it to `c4.result` after downloading. -Now that you have the results from both VMs, in the next section, you'll learn how to use benchstat to analyze these results and understand the performance differences between the two architectures. +Now that you have the results from both VMs, in the next section, you'll learn how to use Benchstat to analyze these results and understand the performance differences between the two architectures. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchstat.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchstat.md index 66ff075f26..d3949feeb3 100755 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchstat.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/manual_run_benchstat.md @@ -1,47 +1,55 @@ --- -title: Manually running benchstat +title: Manually run Benchstat weight: 52 ### FIXED, DO NOT MODIFY layout: learningpathall --- -You've successfully run and downloaded the benchmark results from both your Arm-based and x86-based VMs. In this section, you'll compare them to each other using the benchstat tool. +You've successfully run and downloaded the benchmark results from both your Arm-based and x86-based VMs. In this section, you'll use Benchstat to compare performance between the two instances. -## Inspecting the Results Files +## Inspect the results files -With the results files downloaded to your local machine, if you're curious to what they look like, you can inspect them to understand better what `benchstat` is analyzing. +To understand what Benchstat analyzes, open the results files to view the raw benchmark output. -1. **View raw results:** Open the `c4a.result` file in a text editor, and you'll see something like this: +Open the `c4a.result` file in a text editor. You should see something like this: - ![](images/run_manually/11.png) + ![alt-text#center](images/run_manually/11.png "A results file") The file contains the results of the `markdown` benchmark run on the Arm-based c4a VM, showing time and memory stats taken for each iteration. If you open the `c4.result` file, you'll see similar results for the x86-based c4 VM. -2. **Close the editor:** Close the text editor when done. +Close the text editor when done. -## Running Benchstat to Compare Results +## Run Benchstat to compare results -To compare the results, you'll use `benchstat` to analyze the two result files you downloaded. Since all the prerequisites are already installed on the `c4` and `c4a` instances, benchstat will be run from one of those instances. +To compare the results, you'll now use Benchstat to analyze the two result files you downloaded. Since all the prerequisites are already installed on the `c4` and `c4a` instances, Benchstat will be run from one of those instances. -1. **Create working directory:** Make a temporary benchstat directory to hold the results files on either the c4a or c4 instance, and change directory into it: +### Create working directory + +Make a temporary benchstat directory to hold the results files on either the c4a or c4 instance, and change directory into it: ```bash mkdir benchstat_results cd benchstat_results ``` -2. **Upload result files:** Click the `UPLOAD FILE` button in the GCP console, and upload the `c4a.results` AND `c4.results` files you downloaded earlier. (This uploads them to your home directory, not to the current directory.) +### Upload results files + +Click the `UPLOAD FILE` button in the GCP console, and upload the `c4a.results` AND `c4.results` files you downloaded earlier. (This uploads them to your home directory, not to the current directory.) + + ![alt-text#center](images/run_manually/16.png "Upload results file") - ![](images/run_manually/16.png) +### Verify upload -3. **Verify upload:** You'll know it worked correctly via the confirmation dialog in your terminal: +You'll know it worked correctly via the confirmation dialog in your terminal: - ![](images/run_manually/17.png) + ![alt-text#center](images/run_manually/17.png "Confirmation dialog in terminal") -4. **Move files to working directory:** Move the results files to the `benchstat_results` directory, and confirm their presence: +### Move files to working directory + +Move the results files to the `benchstat_results` directory, and confirm their presence: ```bash mv ~/c4a.results ~/c4.results . @@ -54,7 +62,9 @@ To compare the results, you'll use `benchstat` to analyze the two result files y c4.results c4a.results ``` -5. **Run benchstat:** Now you can run `benchstat` to compare the two results files: +### Run benchstat + +Now you can run `benchstat` to compare the two results files: ```bash export GOPATH=$HOME/go @@ -63,7 +73,9 @@ To compare the results, you'll use `benchstat` to analyze the two result files y benchstat c4a.results c4.results > c4a_vs_c4.txt ``` -6. **View comparison results:** Run the `cat` command to view the results: +### View comparison results + +Run the `cat` command to view the results: ```bash cat c4a_vs_c4.txt @@ -114,7 +126,7 @@ To compare the results, you'll use `benchstat` to analyze the two result files y In this example, you can see that the c4a (Arm) instance completed the markdown benchmark in 143.9m seconds, while the c4 (x86) instance took 158.3m seconds, indicating better performance on the Arm system for this particular workload. - If you wanted the results in CSV format, you could run the `benchstat` command with the `-format csv` option instead. + If you want the results in CSV format, you can run the `benchstat` command with the `-format csv` option instead. At this point, you can download the `c4a_vs_c4.txt` for further analysis or reporting. You can also run the same or different benchmarks with the same, or different combinations of VMs, and continue comparing results using `benchstat`. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/overview.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/overview.md index 4a5995608f..591976d467 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/overview.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/overview.md @@ -6,28 +6,32 @@ weight: 10 layout: learningpathall --- -# Go Benchmarking Overview +## Overview of Go benchmarking tools -In this section, you will learn how to measure, collect, and compare Go performance data across different CPU architectures. This knowledge is essential for developers and system architects who need to make informed decisions about infrastructure choices for their Go applications. +This section shows you how to measure, collect, and compare Go performance data across different CPU architectures. These techniques help developers and system architects make informed infrastructure decisions for their Go applications. You'll gain hands-on experience with: -- **Go Benchmarks**, a collection of pre-written benchmark definitions that standardizes performance tests for popular Go applications, leveraging Go's built-in benchmark support. +- **Go Benchmarks** - standardized definitions for popular Go applications, using Go’s built-in testing framework. -- **Sweet**, a benchmark runner that automates running Go benchmarks across multiple environments, collecting and formatting results for comparison. +- **Sweet** - a benchmark runner that automates execution and formats results for comparison across multiple environments. -- **Benchstat**, a statistical comparison tool that analyzes benchmark results to identify meaningful performance differences between systems. +- **Benchstat** - a statistical comparison tool that analyzes benchmark results to identify meaningful performance differences between systems. -Benchmarking is critical for modern software development because it allows you to: -- Quantify the performance impact of code changes -- Compare different hardware platforms objectively -- Make data-driven decisions about infrastructure investments -- Identify optimization opportunities in your applications +Benchmarking is critical for modern software development because it allows you to do the following: +- Quantify the impact of code changes +- Compare performance across hardware architectures +- Make data-driven decisions about infrastructure +- Identify optimization opportunities in your application code -You'll use Intel c4-standard-8 and Arm-based c4a-standard-4 (both four-core) instances running on GCP to run and compare benchmarks using these tools. +In this Learning Path, you'll compare performance using two four-core GCP instance types: + +* The Arm-based c4a-standard-4 +* The Intel-based c4-standard-8 {{% notice Note %}} -Arm-based c4a-standard-4 instances and Intel-based c4-standard-8 instances both utilize four cores. Both instances are categorized by GCP as members of the **consistently high performing** series; the main difference between the two is that the c4a has 16 GB of RAM, while the c4 has 30 GB of RAM. We've chosen to keep CPU cores equivalent across the two instances of the same series to keep the comparison as close as possible. +Arm-based c4a-standard-4 instances and Intel-based c4-standard-8 instances both utilize four cores. Both instances are categorized by GCP as members a series that demonstrates consistent high performance. +The main difference between the two is that c4a has 16 GB of RAM, while c4 has 30 GB of RAM. This Learning Path uses equivalent core counts as an example of performance comparison. {{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_install.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_install.md index 86882fcfd8..abc9d441ed 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_install.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_install.md @@ -1,59 +1,70 @@ --- -title: Installing the Automated Benchmark and Benchstat Runner +title: Install the automated benchmark and Benchstat runner weight: 53 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In the last section, you learned how to run benchmarks and benchstat manually. Now you'll learn how to run them automatically, with enhanced visualization of the results. +In the last section, you learned how to run benchmarks and Benchstat manually. Now you'll automate that process and generate visual reports using a tools called `rexec_sweet`. -## Introducing rexec_sweet.py +## What is rexec_sweet? -The `rexec_sweet.py` script is a powerful automation tool that simplifies the benchmarking workflow. This tool connects to your GCP instances, runs the benchmarks, collects the results, and generates comprehensive reports—all in one seamless operation. It provides several key benefits: +`rexec_sweet` is a Python project available on GitHub that automates the benchmarking workflow. It connects to your GCP instances, runs benchmarks, collects results, and generates HTML reports - all in one step. + +It provides several key benefits: - **Automation**: Runs benchmarks on multiple VMs without manual SSH connections - **Consistency**: Ensures benchmarks are executed with identical parameters - **Visualization**: Generates HTML reports with interactive charts for easier analysis -The only dependency you are responsible for satisfying before the script runs is completion of the "Installing Go and Sweet" sections of this learning path. Additional dependencies are dynamically loaded at install time by the install script. +Before running the tool, ensure you've completed the "Install Go, Sweet, and Benchstat" step. All other dependencies are installed automatically by the installer. + +## Set up rexec_sweet -## Setting up rexec_sweet +Follow the steps below to set up `rexec_sweet`. -1. **Create a working directory:** On your local machine, open a terminal, then create and change into a directory to store the `rexec_sweet.py` script and related files: +### Create a working directory - ```bash - mkdir rexec_sweet - cd rexec_sweet - ``` +On your local machine, open a terminal, and create a new directory: + +```bash +mkdir rexec_sweet +cd rexec_sweet +``` -2. **Clone the repository inside the directory:** Get the `rexec_sweet.py` script from the GitHub repository: +### Clone the repository + +Get `rexec_sweet` from GitHub: - ```bash - git clone https://github.com/geremyCohen/go_benchmarks.git - cd go_benchmarks - ``` +```bash +git clone https://github.com/geremyCohen/go_benchmarks.git +cd go_benchmarks +``` -3. **Run the installer:** Copy and paste this command into your terminal to run the installer: +### Run the installer - ```bash - ./install.sh - ``` +Copy and paste this command into your terminal to run the installer: - If the install.sh script detects that you already have dependencies installed, it may ask you if you wish to reinstall them with the following prompt as shown: +```bash +./install.sh +``` - ```output - pyenv: /Users/gercoh01/.pyenv/versions/3.9.22 already exists - continue with installation? (y/N) - ``` +If the installer detects that you already have dependencies installed, it might ask you if you want to reinstall them: - If you see this prompt, enter `N` (not `Y`!) to continue with the installation without modifying the existing installed dependencies. +```output +pyenv: /Users/gercoh01/.pyenv/versions/3.9.22 already exists +continue with installation? (y/N) +``` -4. **Verify VM status:** Make sure the GCP VM instances you created in the previous section are running. If not, start them now, and give them a few minutes to come up. +If you see this prompt, enter `N` to continue with the installation without modifying the existing installed dependencies. + +### Verify VM status + +Make sure the GCP VM instances you created in the previous section are running. If not, start them now, and wait a few minutes for them to finish booting. {{% notice Note %}} -The install script will prompt you to authenticate with Google Cloud Platform (GCP) using the gcloud command-line tool at the end of install. If after installing you have issues running the script and/or get GCP authentication errors, you can manually authenticate with GCP by running the following command: `gcloud auth login` +The installer prompts you to authenticate with Google Cloud Platform (GCP) using the gcloud command-line tool at the end of install. If after installing you have issues running or you get GCP authentication errors, you can manually authenticate with GCP by running the following command: `gcloud auth login` {{% /notice %}} - -Continue on to the next section to run the script and see how it simplifies the benchmarking process. +Continue on to the next section to run the tool and see how it simplifies the benchmarking process. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_run.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_run.md index b2cfbf4ba5..f5fbf751ca 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_run.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/rexec_sweet_run.md @@ -1,22 +1,24 @@ --- -title: Running the Automated Benchmark and Benchstat Runner +title: Run the automated benchmark and Benchstat runner weight: 54 ### FIXED, DO NOT MODIFY layout: learningpathall --- -With `rexec_sweet` installed, your benchmarking instances running, and your localhost authenticated with GCP, you'll now see how to run benchmarks in an automated fashion. +With `rexec_sweet` installed, your benchmarking instances running, and your local machine authenticated with GCP, you're ready to run automated benchmarks across your configured environments. -## Run an Automated Benchmark and Analysis +## Run an automated benchmark and generate results -1. **Run the script:** Execute the `rexec_sweet` script from your local terminal: +To begin, open a terminal on your local machine and run: ```bash -rexec_sweet +rexec-sweet ``` -2. **Select a benchmark:** The script will prompt you for the name of the benchmark you want to run. Press enter to run the default benchmark, which is `markdown` (this is the recommended benchmark to run the first time.) +The tool will prompt you to choose a benchmark. + +Press **Enter** to run the default benchmark, markdown, which is a good starting point for your first run. ```bash Available benchmarks: @@ -33,7 +35,7 @@ Available benchmarks: Enter number (1-10) [default: markdown]: ``` -3. **Select instances:** The script will proceed and call into GCP to detect all running VMs. You should see the script output: +The tool then detects your running GCP instances and displays them. You’ll be asked whether you want to use the first two instances it finds and the default install paths. ```output Available instances: @@ -42,14 +44,11 @@ Available instances: Do you want to run the first two instances found with default install directories? [Y/n]: ``` +You can accept the defaults by pressing **Enter**, which uses the instances listed and assumes Go and Sweet were installed to ~/benchmarks/sweet. -4. **Choose your configuration:** You have two options: - - - **Use default settings:** If you want to run benchmarks on the instances labeled with "will be used as nth instance", and you installed Go and Sweet into the default directories as noted in the tutorial, you can press Enter to accept the defaults. +If you're running more than two instances or installed Go and Sweet to a non-default location, enter n and follow the prompts to manually select instances and specify custom install paths. - - **Custom configuration:** If you are running more than two instances, and the script doesn't suggest the correct two to autorun, or you installed Go and Sweet to non-default folders, select "n" and press Enter. The script will then prompt you to select the instances and runtime paths. - -In this example, we'll manually select the instances and paths as shown below: +In this example, you'll manually select the instances and paths as shown below: ```output Available instances: @@ -73,9 +72,9 @@ Output directory: /private/tmp/a/go_benchmarks/results/c4-c4a-markdown-20250610T ... ``` -Upon entering instance names and paths for the VMs, the script will automatically: - - Run the benchmark on both VMs - - Run `benchstat` to compare the results +After selecting instances and paths, the tool will: + - Run the selected benchmark on both VMs + - Use `benchstat` to compare the results - Push the results to your local machine ```output @@ -88,15 +87,17 @@ Running benchmarks on the selected instances... Report generated in results/c4-c4a-markdown-20250610T190407 ``` -5. **View the report:** Once on your local machine, `rexec_sweet` will generate an HTML report that will open automatically in your web browser. +### View the report + +Once on your local machine, `rexec_sweet` will generate an HTML report that opens automatically in your web browser. - If you close the tab or browser, you can always reopen the report by navigating to the `results` subdirectory of the current working directory of the `rexec_sweet.py` script, and opening `report.html`. +If you close the report, you can reopen it by navigating to the `results` subdirectory and opening report.html in your browser. -![](images/run_auto/2.png) +![alt-text#center](images/run_auto/2.png "Sample HTML report") {{% notice Note %}} -If you see output messages from `rexec_sweet.py` similar to "geomeans may not be comparable" or "Dn: ratios must be >0 to compute geomean", this is expected and can be ignored. These messages indicate that the benchmark sets differ between the two VMs, which is common when running benchmarks on different hardware or configurations. +If you see output messages similar to "geomeans may not be comparable" or "Dn: ratios must be >0 to compute geomean", this is expected and can be ignored. These warnings typically appear when benchmark sets differ slightly between the two VMs. {{% /notice %}} -6. **Analyze results:** Upon completion, the script will generate a report in the `results` subdirectory of the current working directory of the `rexec_sweet.py` script, which opens automatically in your web browser to view the benchmark results and comparisons. +Upon completion, the tool generates a report in the `results` subdirectory of the current working directory, which opens automatically in your web browser to view the benchmark results and comparisons. diff --git a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/running_benchmarks.md b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/running_benchmarks.md index 8ddf05bec3..e017f83787 100644 --- a/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/running_benchmarks.md +++ b/content/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/running_benchmarks.md @@ -1,24 +1,25 @@ --- -title: Benchmark Types and Metrics +title: Benchmark types and metrics weight: 50 ### FIXED, DO NOT MODIFY layout: learningpathall --- -With setup complete, you can now run and analyze the benchmarks. Before you do, it's good to understand all the different pieces in more detail. +Now that setup is complete, it's important to understand the benchmarks you’ll run and the performance metrics you’ll use to evaluate results across systems. -## Choosing a Benchmark to Run +## Available benchmarks Whether running manually or automatically, the benchmarking process consists of two main steps: -1. **Running benchmarks with Sweet**: `sweet` executes the benchmarks on each VM, generating raw performance data +1. **Running benchmarks with Sweet**: `sweet` executes the benchmarks on each VM, generating raw performance data. 2. **Analyzing results with Benchstat**: `benchstat` compares the results from different VMs to identify performance differences. Benchstat can output results in text format (default) or CSV format. The text format provides a human-readable tabular view, while CSV allows for further processing with other tools. Sweet comes ready to run with the following benchmarks: -| Benchmark | Description | Command | + +| Benchmark | Description | Example command | |-----------------|-------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------| | **biogo-igor** | Processes pairwise alignment data using the biogo library, grouping repeat feature families and outputting results in JSON format. | `sweet run -count 10 -run="biogo-igor" config.toml` | | **biogo-krishna** | Pure-Go implementation of the PALS algorithm for pairwise sequence alignment, measuring alignment runtime performance. | `sweet run -count 10 -run="biogo-krishna" config.toml` | @@ -31,18 +32,18 @@ Sweet comes ready to run with the following benchmarks: | **markdown** | Parses and renders Markdown documents to HTML using a Go-based markdown library to evaluate parsing and rendering throughput. | `sweet run -count 10 -run="markdown" config.toml` | | **tile38** | Stress-tests a Tile38 geospatial database with WITHIN, INTERSECTS, and NEARBY queries to measure spatial query performance. | `sweet run -count 10 -run="tile38" config.toml` | -## Metrics Summary +## Performance metrics When running benchmarks, several key metrics are collected to evaluate performance. The following summarizes the most common metrics and their significance: -### Seconds per Operation - Lower is better +### Seconds per operation (lower is better) This metric measures the time taken to complete a single operation, indicating the raw speed of execution. It directly reflects the performance efficiency of a system for a specific task, making it one of the most fundamental benchmarking metrics. A system with lower seconds per operation completes tasks faster. This metric primarily reflects CPU performance but can also be influenced by memory access speeds and I/O operations. If seconds per operation is the only metric showing significant difference while memory metrics are similar, the performance difference is likely CPU-bound. -### Operations per Second - Higher is better +### Operations per second (higher is better) This metric provides a clear measure of system performance capacity, making it essential for understanding raw processing power and scalability potential. A system performing more operations per second has greater processing capacity. This metric reflects overall system performance including CPU speed, memory access efficiency, and I/O capabilities. @@ -51,7 +52,7 @@ If operations per second is substantially higher while memory usage remains prop This metric is essentially the inverse of "seconds per operation" and provides a more intuitive way to understand throughput capacity. -### Average RSS Bytes - Lower is better +### Average RSS bytes (lower is better) Resident Set Size (RSS) represents the portion of memory occupied by a process that is held in RAM (not swapped out). It shows the typical memory footprint during operation, indicating memory efficiency and potential for scalability. @@ -59,7 +60,7 @@ Lower average RSS indicates more efficient memory usage. A system with lower ave If one VM has significantly higher seconds per operation but lower RSS, it may be trading speed for memory efficiency. Systems with similar CPU performance but different RSS values indicate different memory optimization approaches; lower RSS with similar CPU performance suggests better memory management, which is a critical indicator of performance in memory-constrained environments. -### Peak RSS Bytes - Lower is better +### Peak RSS bytes (lower is better) Peak RSS bytes is the maximum Resident Set Size reached during execution, representing the worst-case memory usage scenario. The peak RSS metric helps to understand memory requirements and potential for memory-related bottlenecks during intensive operations. @@ -67,7 +68,7 @@ Lower peak RSS indicates better handling of memory-intensive operations. High pe Large differences between average and peak RSS suggest memory usage volatility. A system with lower peak RSS but similar performance is better suited for memory-constrained environments. -### Peak VM Bytes - Lower is better +### Peak VM bytes (lower is better) Peak VM Bytes is the maximum Virtual Memory size used, including both RAM and swap space allocated to the process. @@ -75,7 +76,7 @@ Lower peak VM indicates more efficient use of the total memory address space. Hi If peak VM is much higher than peak RSS, the system is relying heavily on virtual memory management. Systems with similar performance but different VM usage patterns may have different memory allocation strategies. High VM with performance degradation suggests potential memory-bound operations due to excessive paging. -## Summary of Efficiency Indicators +## Summary of efficiency indicators When comparing metrics across two systems, keep the following in mind: @@ -84,28 +85,28 @@ A system is likely CPU-bound if seconds per operation differs significantly whil A system is likely memory-bound if performance degrades as memory metrics increase, especially when peak RSS approaches available physical memory. -### Efficiency Indicators +### Efficiency indicators The ideal system shows lower values across all metrics - faster execution with smaller memory footprint. Systems with similar seconds per operation but significantly different memory metrics indicate different optimization priorities. -### Scalability Potential +### Scalability potential Lower memory metrics (especially peak values) suggest better scalability for concurrent workloads. Systems with lower seconds per operation but higher memory usage may perform well for single tasks but scale poorly. -### Optimization Targets +### Optimization targets Large gaps between average and peak memory usage suggest opportunities for memory optimization. High seconds per operation with low memory usage suggests CPU optimization potential. -## Best Practices when benchmarking across different instance types +## Best practices when benchmarking across different instance types Here are some general tips to keep in mind as you explore benchmarking across different apps and instance types: -- Unlike Intel and AMD processors that use hyper-threading, Arm processors provide single-threaded cores without hyper-threading. A four-core Arm processor has four independent cores running four threads, while an four-core Intel processor provides eight logical cores through hyper-threading. This means each Arm vCPU represents a full physical core, while each Intel/AMD vCPU represents half a physical core. For fair comparison, this learning path uses a 4-vCPU Arm instance against an 8-vCPU Intel instance. When scaling up instance sizes during benchmarking, make sure to keep a 2:1 Intel/AMD:Arm vCPU ratio if you wish to keep parity on CPU resources. +- On Intel and AMD processors with hyper-threading, each vCPU corresponds to a logical core (hardware thread), and two vCPUs share a single physical core. On Arm processors (which do not use hyper-threading), each vCPU corresponds to a full physical core. For comparison, this Learning Path uses a 4-vCPU Arm instance against an 8-vCPU Intel instance, maintaining a 2:1 Intel:Arm vCPU ratio to keep parity on physical CPU resources. + +- Run each benchmark at least 10 times to account for outliers and produce statistically meaningful results. -- It's suggested to run each benchmark at least 10 times (specified via the `count` parameter) to handle outlier/errant runs and ensure statistical significance. +- Results can be bound by CPU, memory, or I/O performance. If you see significant differences in one metric but not others, it might indicate a bottleneck in that area; running the same benchmark with different configurations (for example, using more CPU cores or more memory) can help identify the bottleneck. -- Results may be bound by CPU, memory, or I/O performance. If you see significant differences in one metric but not others, it may indicate a bottleneck in that area; running the same benchmark with different configurations (e.g., more CPU cores, more memory) can help identify the bottleneck. - diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md index 79c351e6e2..0d45aa6fa6 100644 --- a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md @@ -22,7 +22,7 @@ Make sure you can connect to your Kubernetes cluster using `kubectl`. For Arm Linux, download the KubeArchInspect package from GitHub: ```console -wget https://github.com/ArmDeveloperEcosystem/kubearchinspect/releases/download/v0.4.0/kubearchinspect_Linux_arm64.tar.gz +wget https://github.com/ArmDeveloperEcosystem/kubearchinspect/releases/download/v0.7.0/kubearchinspect_Linux_arm64.tar.gz ``` Extract the files from the release package: diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_index.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_index.md new file mode 100644 index 0000000000..b64edf6edb --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_index.md @@ -0,0 +1,53 @@ +--- +title: Understanding Libamath's vector accuracy modes + +draft: true +cascade: + draft: true + +minutes_to_complete: 20 +author: Joana Cruz + +who_is_this_for: This is an introductory topic for software developers who want to learn how to use the different accuracy modes present in Libamath, a component of Arm Performance Libraries. + +learning_objectives: + - Understand how accuracy is defined in Libamath. + - Pick an appropriate accuracy mode for your application. + +prerequisites: + - An Arm computer running Linux with [Arm Performance Libraries](https://learn.arm.com/install-guides/armpl/) version 25.04 or newer installed. + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse +tools_software_languages: +- Arm Performance Libraries +- GCC +- Libmath +operatingsystems: + - Linux + +further_reading: + - resource: + title: ArmPL Libamath Documentation + link: https://developer.arm.com/documentation/101004/2410/General-information/Arm-Performance-Libraries-math-functions + type: documentation +# - resource: +# title: PLACEHOLDER BLOG +# link: PLACEHOLDER BLOG LINK +# type: blog + - resource: + title: ArmPL Installation Guide + link: https://learn.arm.com/install-guides/armpl/ + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/examples.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/examples.md new file mode 100644 index 0000000000..b622d0edae --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/examples.md @@ -0,0 +1,93 @@ +--- +title: Arm Performance Libraries example +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Arm Performance Libraries example + +Here is an example invoking all accuracy modes of the Neon single precision exp function. The file `ulp_error.h` is from the previous section. + +Make sure you have [Arm Performance Libraries](https://learn.arm.com/install-guides/armpl/) installed. + +Use a text editor save the code below in a file named `example.c`. + +```C { line_numbers = "true" } +#include +#include +#include +#include + +#include "ulp_error.h" + +void check_accuracy(float32x4_t (__attribute__((aarch64_vector_pcs)) *vexp_fun)(float32x4_t), float arg, const char *label) { + float32x4_t varg = vdupq_n_f32(arg); + float32x4_t vres = vexp_fun(varg); + double want = exp((double)arg); + float got = vgetq_lane_f32(vres, 0); + + printf(label, arg); + printf("\n got = %a\n", got); + printf(" (float)want = %a\n", (float)want); + printf(" want = %.12a\n", want); + printf(" ULP error = %.4f\n\n", ulp_error(got, want)); +} + +int main(void) { + // Inputs that trigger worst-case errors for each accuracy mode + printf("Libamath example:\n"); + printf("-----------------------------------------------\n"); + printf(" // Display worst-case ULP error in expf for each\n"); + printf(" // accuracy mode, along with approximate (`got`) and exact results (`want`)\n\n"); + + check_accuracy (armpl_vexpq_f32_u10, 0x1.ab312p+4, "armpl_vexpq_f32_u10(%a) delivers error under 1.0 ULP"); + check_accuracy (armpl_vexpq_f32, 0x1.8163ccp+5, "armpl_vexpq_f32(%a) delivers error under 3.5 ULP"); + check_accuracy (armpl_vexpq_f32_umax, -0x1.5b7322p+6, "armpl_vexpq_f32_umax(%a) delivers result with half correct bits"); + + return 0; +} +``` + +Compile the program with: + +```bash +gcc -O2 -o example example.c -lamath -lm +``` + +Run the example: + +```bash +./example +``` + +The output is: + +```output +Libamath example: +----------------------------------------------- + // Display worst-case ULP error in expf for each + // accuracy mode, along with approximate (`got`) and exact results (`want`) + +armpl_vexpq_f32_u10(0x1.ab312p+4) delivers error under 1.0 ULP + got = 0x1.6ee554p+38 + (float)want = 0x1.6ee556p+38 + want = 0x1.6ee555bb01d1p+38 + ULP error = 0.8652 + +armpl_vexpq_f32(0x1.8163ccp+5) delivers error under 3.5 ULP + got = 0x1.6a09ep+69 + (float)want = 0x1.6a09e4p+69 + want = 0x1.6a09e3e3d585p+69 + ULP error = 1.9450 + +armpl_vexpq_f32_umax(-0x1.5b7322p+6) delivers result with half correct bits + got = 0x1.9b56bep-126 + (float)want = 0x1.9b491cp-126 + want = 0x1.9b491b9376d3p-126 + ULP error = 1745.2120 +``` + +The inputs used for each variant correspond to the worst case scenario known to date (ULP Error argmax). +This means that the ULP error should not be higher than the one demonstrated here, ensuring the results remain below the defined thresholds for each accuracy. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/floating-point-rep.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/floating-point-rep.md new file mode 100644 index 0000000000..1dff7d8364 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/floating-point-rep.md @@ -0,0 +1,139 @@ +--- +title: Floating Point Representation +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Floating-Point Representation Basics + +Floating Point numbers are a finite and discrete approximation of the real numbers, allowing us to implement and compute functions in the continuous domain with an adequate (but limited) resolution. + +A Floating Point number is typically expressed as: + +```output ++/-d.dddd...d x B^e +``` + +where: +* B is the base; +* e is the exponent; +* d.dddd...d is the mantissa (or significand). It is p-bit word, where p represents the precision; +* +/- sign which is usually stored separately. + +If the leading digit is non-zero then it is a normalized representation/normal number. + +{{% notice Example 1 %}} +Fixing `B=2, p=24` + +`0.1 = 1.10011001100110011001101 × 2^4` is a normalized representation of 0.1 + +`0.1 = 0.000110011001100110011001 × 2^0` is a non normalized representation of 0.1 + +{{% /notice %}} + +Usually a Floating Point number has multiple non-normalized representations, but only 1 normalized representation (assuming leading digit is strictly smaller than base), when fixing a base and a precision. + +### Building a Floating-Point Ruler + +Given a base `B`, a precision `p`, a maximum exponent `emax` and a minimum exponent `emin`, we can create the set of all the normalized values in this system. + +{{% notice Example 2 %}} +`B=2, p=3, emax=2, emin=-1` + +| Significand | × 2⁻¹ | × 2⁰ | × 2¹ | × 2² | +|-------------|-------|------|------|------| +| 1.00 (1.0) | 0.5 | 1.0 | 2.0 | 4.0 | +| 1.01 (1.25) | 0.625 | 1.25 | 2.5 | 5.0 | +| 1.10 (1.5) | 0.75 | 1.5 | 3.0 | 6.0 | +| 1.11 (1.75) | 0.875 | 1.75 | 3.5 | 7.0 | + + +{{% /notice %}} + +Note that, for any given integer n, numbers are evenly spaced between 2ⁿ and 2ⁿ⁺¹. But the gap between them (also called [ULP](/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp/), which is explained in the more detail in the next section) grows as the exponent increases. So the spacing between floating point numbers gets larger as numbers get bigger. + +### The Floating-Point bitwise representation + +Since there are `B^p` possible mantissas, and `emax-emin+1` possible exponents, then `log2(B^p) + log2(emax-emin+1) + 1` (sign) bits are needed to represent a given Floating Point number in a system. + +In Example 2, 3+2+1=6 bits are needed. + +Based on this, the floating point's bitwise representation is defined to be: + +``` +b0 b1 b2 b3 b4 b5 +``` + +where + +```output +b0 -> sign (S) +b1, b2 -> exponent (E) +b3, b4, b5 -> mantissa (M) +``` + +However, this is not enough. In this bitwise definition, the possible values of E are 0, 1, 2, 3. +But in the system being defined, only the integer values in the range [-1, 2] are of interest. + +For this reason, E is called the biased exponent, and in order to retrieve the value it is trying to represent (i.e. the unbiased exponent) an offset must be added or subtracted (in this case, subtract 1): + +```output +x = (-1)^S x M x 2^(E-1) +``` + +## IEEE-754 Single Precision + +Single precision (also called float) is a 32-bit format defined by the [IEEE-754 Floating Point Standard](https://ieeexplore.ieee.org/document/8766229) + +In this standard the sign is represented using 1 bit, the exponent uses 8 bits and the mantissa uses 23 bits. + +The value of a (normalized) Floating Point in IEEE-754 can be represented as: + +```output +x=(−1)^S x 1.M x 2^E−127 +``` + +The exponent bias of 127 allows storage of exponents from -126 to +127. The leading digit is implicit - that is we have 24 bits of precision. In normalized numbers the leading digit is implicitly 1. + +{{% notice Special Cases in IEEE-754 Single Precision %}} +Since we have 8 bits of storage, meaning E ranges between 0 and 2^8-1=255. However not all these 256 values are going to be used for normal numbers. + +If the exponent E is: +* 0, then we are either in the presence of a denormalized number or a 0 (if M is 0 as well); +* 1 to 254 then we are in the normalized range; +* 255 then we are in the presence of Inf (if M==0), or Nan (if M!=0). + +Subnormal numbers (also called denormal numbers) are special floating-point values defined by the IEEE-754 standard. + +They allow the representation of numbers very close to zero, smaller than what is normally possible with the standard exponent range. + +Subnormal numbers do not have the a leading 1 in their representation. They also assume exponent is 0. + +The interpretation of denormal Floating Point in IEEE-754 can be represented as: + +``` +x=(−1)^S x 0.M x 2^−126 +``` + + + + +{{% /notice %}} + +If you're interested in diving deeper in this subject, [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) by David Goldberg is a good place to start. + diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/multi-accuracy.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/multi-accuracy.md new file mode 100644 index 0000000000..807b338b49 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/multi-accuracy.md @@ -0,0 +1,112 @@ +--- +title: Accuracy modes in Libamath +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## The 3 accuracy modes of Libamath + +Libamath vector functions can come in various accuracy modes for the same mathematical function. +This means, some of our functions allow users and compilers to choose between: +- **High accuracy** (≤ 1 ULP) +- **Default accuracy** (≤ 3.5 ULP) +- **Low accuracy / max performance** (approx. ≤ 4096 ULP) + + +## How accuracy modes are encoded in Libamath + +You can recognize the accuracy mode of a function by inspecting the **suffix** in its symbol: + +- **`_u10`** → High accuracy + For instance, `armpl_vcosq_f32_u10` + Ensures results stay within **1 Unit in the Last Place (ULP)**. + +- *(no suffix)* → Default accuracy + For instance, `armpl_vcosq_f32` + Keeps errors within **3.5 ULP** — a sweet spot for many workloads. + +- **`_umax`** → Low accuracy + For instance, `armpl_vcosq_f32_umax` + Prioritizes speed, tolerating errors up to **4096 ULP**, or roughly **11 correct bits** in single-precision. + + +## Applications + +Selecting an appropriate accuracy level helps avoid unnecessary compute cost while preserving output quality where it matters. + + +### High Accuracy (≤ 1 ULP) + +Use when **numerical (almost) correctness** is a priority. These routines involve precise algorithms (such as high-degree polynomials, careful range reduction, or FMA usage) and are ideal for: + +- **Scientific computing** + such as simulations or finite element analysis +- **Signal processing pipelines** [1,2] + particularly recursive filters or transform +- **Validation & reference implementations** + +While slower, these functions provide **near-bitwise reproducibility** — critical in sensitive domains. + + +### Default Accuracy (≤ 3.5 ULP) + +The default mode strikes a **practical balance** between performance and numerical fidelity. It’s optimized for: + +- **General-purpose math libraries** +- **Analytics workloads** [3] + such as log or sqrt during feature extraction +- **Inference pipelines** [4] + especially on edge devices where latency matters + +Also suitable for many **scientific workloads** that can tolerate modest error in exchange for **faster throughput**. + + +### Low Accuracy / Max Performance (≤ 4096 ULP) + +This mode trades precision for speed — aggressively. It's designed for: + +- **Games, graphics, and shaders** [5] + such as approximating sin or cos for animation curves +- **Monte Carlo simulations** + where statistical convergence outweighs per-sample accuracy [6] +- **Genetic algorithms, audio processing, and embedded DSP** + +Avoid in control-flow-critical code or where **errors amplify**. + + +## Summary + +| Accuracy Mode | Libamath example | Approx. Error | Performance | Typical Applications | +|---------------|------------------------|------------------|-------------|-----------------------------------------------------------| +| `_u10` | _ZGVnN4v_cosf_u10 | ≤1.0 ULP | Low | Scientific computing, backpropagation, validation | +| *(default)* | _ZGVnN4v_cosf | ≤3.5 ULP | Medium | General compute, analytics, inference | +| `_umax` | _ZGVnN4v_cosf_umax | ≤4096 ULP | High | Real-time graphics, DSP, approximations, simulations | + + + +{{% notice Tip %}} +If your workload has mixed precision needs, you can *selectively call different accuracy modes* for different parts of your pipeline. Libamath lets you tailor precision where it matters — and boost performance where it doesn’t. +{{% /notice %}} + + +#### References +1. Higham, N. J. (2002). *Accuracy and Stability of Numerical Algorithms* (2nd ed.). SIAM. + +2. Texas Instruments. Overflow Avoidance Techniques in Cascaded IIR Filter Implementations on the TMS320 DSPs. Application Report SPRA509, 1999. +https://www.ti.com/lit/pdf/spra509 + +3. Ma, S., & Huai, J. (2019). Approximate Computation for Big Data Analytics. arXiv:1901.00232. +https://arxiv.org/pdf/1901.00232 + +4. Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015). Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML), PMLR 37. +https://proceedings.mlr.press/v37/gupta15.html + +5. Unity Technologies. *Precision Modes*. Unity Shader Graph Documentation. +[https://docs.unity3d.com/Packages/com.unity.shadergraph@17.1/manual/Precision-Modes.html](https://docs.unity3d.com/Packages/com.unity.shadergraph@17.1/manual/Precision-Modes.html) + +6. Croci, M., Gorman, G. J., & Giles, M. B. (2021). Rounding Error using Low Precision Approximate Random Variables. arXiv:2012.09739. +https://arxiv.org/abs/2012.09739 + diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp-error.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp-error.md new file mode 100644 index 0000000000..8f253905f9 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp-error.md @@ -0,0 +1,113 @@ +--- +title: ULP Error and Accuracy +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# ULP Error and Accuracy + +In the development of Libamath, a metric called ULP error is used to assess the accuracy of functions. +This metric measures the distance between two numbers, a reference (`want`) and an approximation (`got`), relative to how many floating-point “steps” (ULPs) these two numbers are apart. + +It can be calculated by: + +``` +ulp_err = | want - got | / ULP(want) +``` + +Because this is a relative measure in terms of floating-point spacing (ULPs)—that is, this metric is scale-aware—it is ideal for comparing accuracy across magnitudes. Otherwise, error measures would be very biased by the uneven distribution of the floats. + + +# ULP Error Implementation + +In practice, however, the above expression may take different forms to account for sources of error that may occur during the computation of the error itself. + +In the implementation used here, this quantity is held by a term called `tail`: + +``` +ulp_err = | (got - want) / ULP(want) - tail | +``` + +This term takes into account the error introduced by casting `want` from a higher precision to working precision. This contribution is given in terms of ULP distance: + +``` +tail = | (want_l - want) / ULP(want) | +``` + +Here is a simplified version of the ULP Error. Use the same `ulp.h` from the previous section. + +Use a text editor to opy the code below into a new file `ulp_error.h`. + +```C +// Defines ulpscale(x) +#include "ulp.h" + +// Compute ULP error given: +// - got: computed result -> got (float) +// - want_l: high-precision reference -> want (double) +double ulp_error(float got, double want_l) { + + float want = (float) want_l; + + // Early exit for exact match + if (want_l == (double)want && got == want) { + return 0.0; + } + + int ulp_exp = ulpscale(want); + + // Fractional tail from float rounding + double tail = scalbn(want_l - (double)want, -ulp_exp); + + // Difference between computed and rounded reference + double diff = (double)got - (double)want; + + // Return total ULP error with bias correction + return fabs(scalbn(diff, -ulp_exp) - tail); +} +``` +Note that the final scaling is done with respect to the rounded reference. + +In this implementation, it is possible to get exactly 0.0 ULP error if and only if: + +* The high-precision reference (`want_l`, a double) is exactly representable as a float, and +* The computed result (`got`) is bitwise equal to that float representation. + +Below is a small example to check this implementation. + +Save the code below into a file named `ulp_error.c`. + +```C +#include +#include "ulp_error.h" + +int main() { + float got = 1.0000001f; + double want_l = 1.0; + double ulp = ulp_error(got, want_l); + printf("ULP error: %f\n", ulp); + return 0; +} +``` + +Compile the program with GCC. + +```bash +gcc -O2 ulp_error.c -o ulp_error +``` + +Run the program: + +```bash +./ulp_error +``` + +The output should be: + +``` +ULP error: 1.0 +``` + +If you are interested in diving into the full implementation of the ulp error, you can consult the [tester](https://github.com/ARM-software/optimized-routines/tree/master/math/test) tool in [AOR](https://github.com/ARM-software/optimized-routines/tree/master), with particular focus to the [ulp.h](https://github.com/ARM-software/optimized-routines/blob/master/math/test/ulp.h) file. Note this tool also handles special cases and considers the effect of different rounding modes in the ULP error. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp.md b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp.md new file mode 100644 index 0000000000..d37302d6bd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multi-accuracy-libamath/ulp.md @@ -0,0 +1,159 @@ +--- +title: Units in the Last Place (ULP) +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# ULP + +Units in the Last Place (ULP) is the distance between two adjacent floating-point numbers at a given value, representing the smallest possible change in that number's representation. + +It is a property of a number and can be calculated with the following expression: + +```output +ULP(x) = nextafter(x, +inf) - x +``` + +Building on the example shown in the previous section: + +Fixed `B=2, p=3, e^max=2, e^min=-1` + +| Significand | × 2⁻¹ | × 2⁰ | × 2¹ | × 2² | +|-------------|-------|------|------|------| +| 1.00 (1.0) | 0.5 | 1.0 | 2.0 | 4.0 | +| 1.01 (1.25) | 0.625 | 1.25 | 2.5 | 5.0 | +| 1.10 (1.5) | 0.75 | 1.5 | 3.0 | 6.0 | +| 1.11 (1.75) | 0.875 | 1.75 | 3.5 | 7.0 | + +Based on the above definition, the ULP value for the numbers in this set can be computed as follows: + +``` +ULP(0.625) = nextafter(0.625, +inf) - 0.625 = 0.75-0.625 = 0.125 +``` +``` +ULP(4.0) = 1.0 +``` + +As the exponent of `x` grows, `ULP(x)` also increases exponentially; that is, the spacing between floating points becomes larger. + +Numbers with the same exponent have the same ULP. + +For normalized IEEE-754 floats, a similar behavior is observed: the distance between two adjacent representable values — i.e., ULP(x) — is a power of two that depends only on the exponent of x. + +Hence, another expression used to calculate the ULP of normalized Floating Point numbers is: + +``` +ULP(x) = 2^(e-p+1) +``` + +where: +* `e` is the exponent (in the IEEE-754 definition of single precision this is `E-127`) +* `p` is the precision + +When computing the ULP of IEEE-754 floats, this expression becomes: +``` +ULP(x) = 2^(e-23) +``` +This expression is often used in mathematical computations of ULP since it offers performance benefits. + + +{{% notice ULP of Denormal Numbers %}} +Note that for denormal numbers, the latter expression does not apply. + +In single precision as defined in IEEE-754, the smallest positive subnormal is: + +``` +min_pos_denormal = 2 ^ -23 x 2 ^ -126 = 2^-149 +``` + +The second smallest is: +``` +second_min_pos_denormal = 2 ^ -22 x 2 ^ -126 = 2^-148 = 2*2^-149 +``` +and so on... + +The denormal numbers are evenly spaced by `2^-149`. + +{{% /notice %}} + + +## ULP implementation + +Below is an example of an implementation of the ULP function of a number. + +Use a text editor to save the code below in a file named `ulp.h`. + +```C +#include +#include +#include + +// Bit cast float to uint32_t +static inline uint32_t asuint(float x) { + uint32_t u; + memcpy(&u, &x, sizeof(u)); + return u; +} + +// Compute exponent of ULP spacing at x +static inline int ulpscale(float x) { + //recover the biased exponent E + int e = asuint(x) >> 23 & 0xff; + if (e == 0) + e++; // handle subnormals + + // get exponent of the ULP + // e-p = E - 127 -23 + return e - 127 - 23; +} + +// Compute ULP spacing at x using ulpscale and scalbnf +static float ulp(float x) { + return scalbnf(1.0f, ulpscale(x)); +} +``` + +There are three key functions in this implementation: +* the `asuint(x)` function reinterprets the bit pattern of a float as a 32-bit unsigned integer, allowing the extraction of specific bit fields such as the exponent. +* the `ulpscale(x)` function returns the base-2 exponent of the ULP spacing at a given float value x, which is the result of `log2(ULP(x))`. The `e` variable in this function corresponds to the quantity E previously mentioned (the bitwise value of the exponent). +* the `scalbnf(m, n)` function (a standard function declared in math.h) efficiently evaluates `m x 2^n`. + + +Below is an example which uses the `ulp()` function. + +Use a text editor to save the code below in a file named `ulp.c`. + +```C +#include +#include "ulp.h" + +int main() { + float x = 1.00000001f; + float spacing = ulp(x); + + printf("ULP of %.8f is %.a\n", x, spacing); + return 0; +} +``` + +Compile the program with GCC. + +```bash +gcc -O2 ulp.c -o ulp +``` + +Run the program: + +```bash +./ulp +``` + +On most systems, the output will print: + +```output +ULP of 1.00000000 is 0x1p-23 +``` + +This is the correct ULP spacing for values near 1.0f in IEEE-754 single-precision format. \ No newline at end of file diff --git a/data/stats_weekly_data.yml b/data/stats_weekly_data.yml index 6bc46cd4f4..196860c0d1 100644 --- a/data/stats_weekly_data.yml +++ b/data/stats_weekly_data.yml @@ -6447,3 +6447,113 @@ avg_close_time_hrs: 0 num_issues: 14 percent_closed_vs_total: 0.0 +- a_date: '2025-06-30' + content: + automotive: 2 + cross-platform: 33 + embedded-and-microcontrollers: 41 + install-guides: 102 + iot: 6 + laptops-and-desktops: 38 + mobile-graphics-and-gaming: 34 + servers-and-cloud-computing: 124 + total: 380 + contributions: + external: 97 + internal: 505 + github_engagement: + num_forks: 30 + num_prs: 8 + individual_authors: + adnan-alsinan: 2 + alaaeddine-chakroun: 2 + albin-bernhardsson: 1 + alex-su: 1 + alexandros-lamprineas: 1 + andrew-choi: 2 + andrew-kilroy: 1 + annie-tallund: 4 + arm: 3 + arnaud-de-grandmaison: 4 + arnaud-de-grandmaison.: 1 + aude-vuilliomenet: 1 + avin-zarlez: 1 + barbara-corriero: 1 + basma-el-gaabouri: 1 + ben-clark: 1 + bolt-liu: 2 + brenda-strech: 1 + chaodong-gong: 1 + chen-zhang: 1 + christophe-favergeon: 1 + christopher-seidl: 7 + cyril-rohr: 1 + daniel-gubay: 1 + daniel-nguyen: 2 + david-spickett: 2 + dawid-borycki: 33 + diego-russo: 2 + dominica-abena-o.-amanfo: 1 + elham-harirpoush: 2 + florent-lebeau: 5 + "fr\xE9d\xE9ric--lefred--descamps": 2 + gabriel-peterson: 5 + gayathri-narayana-yegna-narayanan: 1 + georgios-mermigkis: 1 + geremy-cohen: 2 + gian-marco-iodice: 1 + graham-woodward: 1 + han-yin: 1 + iago-calvo-lista: 1 + james-whitaker: 1 + jason-andrews: 103 + joe-stech: 6 + johanna-skinnider: 2 + jonathan-davies: 2 + jose-emilio-munoz-lopez: 1 + julie-gaskin: 5 + julio-suarez: 6 + jun-he: 1 + kasper-mecklenburg: 1 + kieran-hejmadi: 12 + koki-mitsunami: 2 + konstantinos-margaritis: 8 + kristof-beyls: 1 + leandro-nunes: 1 + liliya-wu: 1 + mark-thurman: 1 + masoud-koleini: 1 + mathias-brossard: 1 + michael-hall: 5 + na-li: 1 + nader-zouaoui: 2 + nikhil-gupta: 1 + nina-drozd: 1 + nobel-chowdary-mandepudi: 6 + odin-shen: 7 + owen-wu: 2 + pareena-verma: 46 + paul-howard: 3 + peter-harris: 1 + pranay-bakre: 5 + preema-merlin-dsouza: 1 + przemyslaw-wirkus: 2 + rin-dobrescu: 1 + roberto-lopez-mendez: 2 + ronan-synnott: 45 + shuheng-deng: 1 + thirdai: 1 + tianyu-li: 2 + tom-pilar: 1 + uma-ramalingam: 1 + varun-chari: 2 + visualsilicon: 1 + willen-yang: 1 + ying-yu: 2 + yiyang-fan: 1 + zach-lasiuk: 2 + zhengjun-xing: 2 + issues: + avg_close_time_hrs: 0 + num_issues: 17 + percent_closed_vs_total: 0.0