diff --git a/.wordlist.txt b/.wordlist.txt index c6a6a55f98..0b5f05a1cd 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -4210,4 +4210,29 @@ iperf normals svcntb svmatch -tc \ No newline at end of file +tc +Alexa +BLERP +Cepstral +Datagram +Edudzi +GDPR +Gbit +Gershon +HIPAA +HVAC +Kordorwu +MCUs +MFCC +Mbit +PDM +Situnayake +accelerometers +iPerf +libcrypto +libray +libssl +misclassification +retransmission +subquery +uninstrumented \ No newline at end of file diff --git a/content/learning-paths/automotive/openadkit1_container/3_setup_openadkit.md b/content/learning-paths/automotive/openadkit1_container/3_setup_openadkit.md index df396181c4..bb8abd517b 100644 --- a/content/learning-paths/automotive/openadkit1_container/3_setup_openadkit.md +++ b/content/learning-paths/automotive/openadkit1_container/3_setup_openadkit.md @@ -32,7 +32,7 @@ Docker version 28.0.4, build b8034c0 Clone the demo repository using: ```bash -git clone https://github.com/autowarefoundation/openadkit_demo.autoware.git +git clone https://github.com/odincodeshen/openadkit_demo.autoware.git ``` The project is containerized in three Docker images, so you do not need to install any additional software. diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Connect and set up arduino.md b/content/learning-paths/embedded-and-microcontrollers/Egde/Connect and set up arduino.md new file mode 100644 index 0000000000..0983867b0c --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/Connect and set up arduino.md @@ -0,0 +1,71 @@ +--- +title: Board Connection and IDE setup +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +### Arduino Nano RP2040 + +To get started with your first **TinyML project**, a great option is the **Arduino Nano RP2040 Connect**. Built by Arduino, it uses the powerful **RP2040 microcontroller** and is fully supported by the Arduino core package. The board comes with built-in Wi-Fi, Bluetooth, and an onboard IMU—features that make it ideal for deploying machine learning models at the edge. + +![example image alt-text#center](Images/nano.png "Arduino Nano RP2040") + +Its compatibility with popular tools like Edge Impulse and the Arduino IDE makes it a beginner-friendly yet powerful choice for TinyML applications. You can learn more about the Arduino Nano RP2040 Connect on the [official Arduino website](https://store.arduino.cc/products/arduino-nano-rp2040-connect-with-headers?_gl=1*1laabar*_up*MQ..*_ga*MTk1Nzk5OTUwMS4xNzQ2NTc2NTI4*_ga_NEXN8H46L5*czE3NDY1NzY1MjUkbzEkZzEkdDE3NDY1NzY5NTkkajAkbDAkaDE1MDk0MDg0ODc.). + +## Put everything together + +### Step 1: Connect the LED to the Arduino Nano RP2040 + +To visualize the output of the voice command model, we will use a simple LED circuit. + +### Components Needed + +- Arduino Nano RP2040 Connect +- 1x LED +- 1x 220Ω resistor +- Breadboard and jumper wires + +#### Circuit Diagram + +- **Anode (long leg) of the LED** → Connect to **GPIO pin D2** via the 220Ω resistor +- **Cathode (short leg)** → Connect to **GND** + +![example image alt-text#center](Images/LED_Connection.png "Figure 14. Circuit Connection") + +![example image alt-text#center](Images/LED_Connection_Schematic.png "Figure 15. Circuit Schematic Connection") + +### Step 2: Set Up the Arduino IDEs + +To program and deploy your trained model to the Arduino Nano RP2040, you first need to configure your development environment. + +Follow the detailed setup instructions provided in the following learning path: + +[Arduino Nano RP2040 Setup Guide](https://learn.arm.com/install-guides/arduino-pico/) + +This guide will walk you through: + +- Installing the Arduino IDE +- Adding the board support package for the Nano RP2040 + +{{% notice Note %}} +**Note:** Follow every instruction in the guide **except** `How do I set up the Raspberry Pi Pico W?`, as it is not needed for this project. +{{% /notice %}} + +### Step 3: Select Your Board and Port in the Arduino IDE + +First, open the **Arduino IDE**. + +To select your board: + +1. Go to **Tools** > **Board**. +2. From the list, choose **Arduino Nano RP2040 Connect**. + +To select your port: + +1. Connect your Arduino board to your computer using a USB cable. +2. Go to **Tools** > **Port**. +3. Select the port labeled with your board’s name, e.g., `COM4 (Arduino Nano RP2040 Connect)` or `/dev/cu.usbmodem...` on macOS. + +*Your Arduino IDE is now ready to upload code to the Arduino Nano RP2040.* diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/1.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/1.png new file mode 100644 index 0000000000..395465d841 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/1.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/10.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/10.png new file mode 100644 index 0000000000..c29ce5ddf1 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/10.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/11.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/11.png new file mode 100644 index 0000000000..289c9cb116 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/11.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/12.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/12.png new file mode 100644 index 0000000000..07f4ea140b Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/12.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/13.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/13.png new file mode 100644 index 0000000000..6f9e54834d Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/13.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/14.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/14.png new file mode 100644 index 0000000000..c031886e1b Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/14.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/15.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/15.png new file mode 100644 index 0000000000..217bf1a7d6 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/15.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/16.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/16.png new file mode 100644 index 0000000000..4cfefaad0e Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/16.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/17.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/17.png new file mode 100644 index 0000000000..ad4e26d2fa Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/17.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/2.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/2.png new file mode 100644 index 0000000000..c479b6f9a8 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/2.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3.png new file mode 100644 index 0000000000..b2a545c191 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3b.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3b.png new file mode 100644 index 0000000000..27b9b0a623 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/3b.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/4.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/4.png new file mode 100644 index 0000000000..199fb6331b Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/4.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/5.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/5.png new file mode 100644 index 0000000000..73534bfedd Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/5.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/6.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/6.png new file mode 100644 index 0000000000..9a846b8cc6 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/6.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/7.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/7.png new file mode 100644 index 0000000000..9f18570ac9 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/7.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/8.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/8.png new file mode 100644 index 0000000000..11b278ed2f Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/8.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/9.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/9.png new file mode 100644 index 0000000000..af4a34ef1e Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/9.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection.png new file mode 100644 index 0000000000..ad88957013 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection_Schematic.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection_Schematic.png new file mode 100644 index 0000000000..793a359ac9 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/LED_Connection_Schematic.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/Serial_monitor.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/Serial_monitor.png new file mode 100644 index 0000000000..2baa857394 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/Serial_monitor.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Images/nano.png b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/nano.png new file mode 100644 index 0000000000..cd5592e129 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/Egde/Images/nano.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Overview.md b/content/learning-paths/embedded-and-microcontrollers/Egde/Overview.md new file mode 100644 index 0000000000..580524ad61 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/Overview.md @@ -0,0 +1,69 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Edge AI +Edge AI refers to artificial intelligence models that run directly on edge devices, processing data locally rather than relying on cloud computing. These models are optimized for real-time decision-making on resource-constrained devices, such as microcontrollers, embedded systems, and IoT sensors. + +**TinyML (Tiny Machine Learning)** is a subset of Edge AI that focuses specifically on deploying machine learning models on ultra-low-power microcontrollers and resource-constrained devices. These microcontrollers typically have limited computational resources — often less than 1 MB of flash memory and only a few hundred kilobytes of RAM — and are designed to run on minimal power, sometimes for years on a single coin-cell battery. Despite these constraints, TinyML enables such devices to perform on-device inference, allowing them to make intelligent decisions in real time without needing to send data to the cloud. This opens the door for smart functionality in low-cost, battery-powered devices used in applications such as environmental monitoring, wearables, smart homes, industrial sensors, and more. + +## Key Characteristics of Edge AI and TinyML + +Key features of Edge AI and TinyML include; + +- **Low Power Consumption**: Designed to run on batteries or harvested energy for months or years. + +- **Small Model Size**: Models are optimized (e.g., quantized or pruned) to fit into a few kilobytes or megabytes. + +- **Limited Compute & Memory** : Typically operates with <1MB RAM and very limited storage. + +- **Real-Time Inference** : Enables immediate local decision-making (e.g., wake-word detection). + +- **Low Latency** : No reliance on cloud – inference is performed on-device. + +- **Applications** : Often used in audio classification, gesture detection, anomaly detection, etc. + +- **Example Devices** : Arduino Nano 33 BLE Sense, STM32 MCUs, Raspberry Pi Pico, Arduino Nano RP2040 Connect, and more. + +## Running AI Models on Resource-Constrained Devices + +Running AI on edge devices presents several challenges. These devices often lack high-performance CPUs or GPUs, making computational power a limiting factor. Limited RAM and storage require careful memory management, and since many edge devices run on batteries, energy efficiency is a critical concern. To overcome these constraints, models are optimized through techniques such as quantization, pruning, and knowledge distillation, which reduce model size while maintaining accuracy. + +## Edge AI Implementation Workflow + +The process of implementing Edge AI begins with data collection using sensors, such as cameras, microphones, or motion detectors. This data is then used to train machine learning models on high-performance machines, such as cloud servers or workstations. Once trained, the models undergo optimization to reduce size and computational requirements before being deployed on microcontrollers or Arm-based microprocessors. Finally, inference takes place, where the model processes real-time data directly on the device to make decisions. + +## Applications of Edge AI + +Edge AI is used in a wide range of applications. In smart homes, voice assistants like Amazon Alexa rely on on-device speech recognition to process wake words. Security systems use AI-driven cameras to detect motion and identify anomalies, while energy management systems optimize power usage by analyzing real-time data from HVAC units. + +Wearable devices also benefit from Edge AI. Smartwatches monitor health by detecting heart rate irregularities, and fitness trackers use AI-powered motion analysis to improve exercise tracking. + +In industrial settings, predictive maintenance applications rely on IoT sensors to monitor vibrations and temperatures, helping prevent machinery failures. Smart agriculture systems use soil condition sensors to optimize irrigation and fertilization, while autonomous vehicles process sensor data for real-time navigation and obstacle detection. + +## Importance of Edge AI + +To understand the benefits of **Edge AI**, just **BLERP**, BLERP highlights the critical aspects of deploying machine learning models on edge devices, focusing on **Bandwidth, Latency, Economics, Reliability, and Privacy**. These components are key to understanding the advantages of processing data on-device rather than relying on the cloud. The table below provides an overview of each component and its importance in Edge AI applications "Situnayake, 2023" + +| Area | Description | +|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| B – Bandwidth | Edge AI reduces the amount of data that needs to be sent to the cloud. This is critical when working with high-volume data like video or sensor streams. Processing locally helps avoid congestion and dependency on internet speed. | +| L – Latency | Edge devices can make real-time decisions faster because they don't rely on cloud round trips. One of the significant benefits of Edge AI is low latency - processing occurs on-device without needing to send data to the cloud. This is crucial for applications requiring real-time decision-making, such as self-driving cars or medical monitoring devices. Additionally, Edge AI allows devices to function in offline environments, making it ideal for remote locations with limited connectivity. | +| E – Economics | Running models locally on low-power edge devices is often cheaper in the long run. It reduces cloud compute costs, data transmission costs, and energy consumption. | +| R – Reliability | Edge AI systems can continue functioning even with limited or no internet connection. This makes them more robust in remote areas, mission-critical applications, or offline settings. | +| P – Privacy | Data can be processed locally without being transmitted to external servers, reducing the risk of data breaches and complying with privacy regulations like GDPR or HIPAA. | + +## Why Learn Edge AI? + +Edge AI is transforming multiple industries. In healthcare, AI-powered medical diagnostics assist in early disease detection, while remote patient monitoring improves access to care. In agriculture, AI-driven sensors optimize soil conditions and pest control, leading to higher yields and resource efficiency. The manufacturing sector benefits from predictive maintenance and quality inspection, reducing downtime and improving productivity. + +## Next Steps + +To build effective TinyML and Edge AI projects, one needs more than just data—**both software and hardware** play a critical role in the development process. While data forms the foundation for training machine learning models, the **software** enables data processing, model development, and deployment, and the **hardware** provides the physical platform for running these models at the edge. + +In this learning path, we will build a model that recognize specific voice commands, which will be used to **control LEDs on the Arduino Nano RP2040 Connect**. In the following steps, both software and hardware components will be discussed in detail. + diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Program and deployment.md b/content/learning-paths/embedded-and-microcontrollers/Egde/Program and deployment.md new file mode 100644 index 0000000000..c44a096124 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/Program and deployment.md @@ -0,0 +1,332 @@ +--- +title: Program your first tinyML device +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +# Programming your first tinyML device + +This Learning Path provides a complete sketch that you can upload onto your Arduino Nano RP2040. Follow the steps below to get started. + +## Step 1: Create a New Sketch + +1. Open the **Arduino IDE**. +2. Go to **File** > **New**. +3. A new sketch (blank code window) will open, ready for you to start writing your code. +4. Save your sketch by going to **File** > **Save**. Choose a name and location for your file. + +## Step 2: Upload the Arduino Library from Edge Impulse + +1. After creating and saving your sketch, go to **Sketch** > **Include Library** > **Add .ZIP Library**. +2. In the file dialog that opens, navigate to the location of the **ZIP file** you exported from Edge Impulse in [Set up your environment](http://localhost:1313/learning-paths/embedded-and-microcontrollers/egde/software_edge_impulse/) +3. Select the **ZIP file** and click **Open**. + +## Step 3: Include the Library in Your Sketch + +Finally, to include the library and model in your sketch, go to **Sketch** > **Include Library** and select the newly installed library and model from the list. + +{{% notice Note %}} +The libray should be of the form `Name_of_your_library_inferencing.h` +{{% /notice %}} + +# Code walk-through + +Before running the code, it’s important to understand what each part does. + +Take a few minutes to read through the comments and logic in the sketch before uploading it to your board. The code can be downloaded [here](jkhkjhjk). + +## Include Necessary Libraries and Define Data Structure for Inference + +This block sets up the core dependencies for running Edge Impulse inference on audio input. It includes necessary libraries and defines a structure `inference_t` that holds the audio buffer and relevant state needed to manage sampling and inferencing. + +```c +#include // Include the Edge Impulse inference SDK for running the model +#include // Include the Pulse Density Modulation (PDM) library for audio input + +// Define a structure to store inference-related audio data +typedef struct { + int16_t *buffer; // Pointer to the audio sample buffer (16-bit signed integers) + uint8_t buf_ready; // Flag to indicate if the buffer is ready for inference (1 = ready) + uint32_t buf_count; // Number of audio samples currently in the buffer + uint32_t n_samples; // Total number of samples required for a complete inference +} inference_t; +``` + +## Code to Define Global Variables for Inference and Sample Buffer + +Declares and initializes global variables used for storing inference data, raw audio samples, debug mode flag, and a recording readiness flag. These are essential for coordinating the data collection and inference process. + +```c +static inference_t inference; // Global instance of the inference structure to manage audio data +static signed short sampleBuffer[2048]; // Buffer to temporarily store raw audio samples (16-bit signed) +static bool debug_nn = false; // Flag to enable/disable detailed neural network debug output +static volatile bool record_ready = false; // Flag indicating when the system is ready to start recording (volatile due to use in ISR) +``` + +## Setup Function for Initializing the Serial and Microphone + +The `setup()` function is called once when the program starts. It initializes the serial communication, sets up the LED pin, prints out configuration details, and starts the microphone buffer for audio data collection. + +```c +// DEFINE THE MACRO FOR YOUR LED PIN HERE +# define LED_PIN 2 // Update the pin number to match the digital pin you intend to use for the LED +void setup() { + Serial.begin(115200); // Start serial communication at 115200 baud rate + while (!Serial); // Wait for the serial port to be ready + pinMode(LED_PIN, OUTPUT); // Configure the LED pin as output (to control an LED) + Serial.println("Edge Impulse Inferencing Demo"); // Print a startup message to the serial monitor + + // Print inference configuration settings + ei_printf("Inferencing settings:\n"); + ei_printf("\tInterval: "); + ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS); // Print the interval between inference cycles + ei_printf(" ms.\n"); + ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE); // Print frame size for DSP processing + ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16); // Print sample length in ms + ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0])); // Print the number of classes + + // Start the microphone buffer for audio input and check if it was successful + if (!microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT)) { + ei_printf("ERR: Could not allocate audio buffer\n"); // Error message if allocation fails + return; // Exit the setup function + } +} +``` +{{% notice Note %}} +The macro `#define LED_PIN 2` specifies the pin number to which the LED is connected. You can change this value to any available digital pin on your board. +{{% /notice %}} + +## Main Loop to Handle Inference and Control LED + +Defines the `loop()` function, which runs continuously after `setup()`. It performs the inference process, including recording audio, running the classifier, and controlling an LED based on the inference result. + +```c +void loop() { + ei_printf("Starting inferencing in 2 seconds...\n"); // Print message indicating a 2-second delay before starting inference + delay(2000); // Wait for 2 seconds before beginning the next operation + + ei_printf("Recording...\n"); // Print message to indicate that recording is starting + if (!microphone_inference_record()) { // Start recording audio data + ei_printf("ERR: Failed to record audio...\n"); // Print error if recording fails + return; // Exit loop if recording fails + } + + ei_printf("Recording done\n"); // Print message after recording is done + + // Set up the signal structure to hold the audio data and provide a function for data retrieval + signal_t signal; + signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT; // Set the total length of the recorded signal + signal.get_data = microphone_audio_signal_get_data; // Specify the function to get the data from the microphone + + ei_impulse_result_t result = {0}; // Initialize the result structure to store the classifier's output + + // Run the classifier continuously on the signal and store the results + if (run_classifier_continuous(&signal, &result, debug_nn) != EI_IMPULSE_OK) { + ei_printf("ERR: Failed to run classifier\n"); // Print error if classifier fails + return; // Exit loop if classifier fails + } + + print_inference_result(result); // Print the inference results (e.g., classification) + + // LED logic based on classifier result (turn on/off the LED based on inference) + for (uint16_t i = 0; i < EI_CLASSIFIER_LABEL_COUNT; i++) { + // If the classification result is "on", turn the LED on + if (strcmp(ei_classifier_inferencing_categories[i], "on") == 0 && result.classification[i].value > 0.5) { + digitalWrite(LED_PIN, HIGH); // Turn LED on + } + // If the classification result is "off", turn the LED off + else if (strcmp(ei_classifier_inferencing_categories[i], "off") == 0 && result.classification[i].value > 0.4) { + digitalWrite(LED_PIN, LOW); // Turn LED off + } + } +} +``` + +{{% notice Note %}} +The values 0.5 and 0.4 in the code above represent threshold levels that you can adjust to test and optimize the performance of the machine learning model. These thresholds determine when the LED should turn on or off based on the model's inference results. + +In this example, the LED will turn on if the model predicts the "on" label with a confidence of 50% or higher (i.e., a threshold of 0.5). Similarly, it will turn off if the model predicts the "off" label with a confidence of 40% or higher (i.e., a threshold of 0.4). + +You are free to modify these threshold values to better suit your application or improve model response. +{{% /notice %}} + +## PDM Data Ready Callback for Buffer Management + +Defines the `pdm_data_ready_inference_callback()` function, which is triggered when the Pulse Density Modulation (PDM) buffer is full. It reads available audio data and stores it in the inference buffer for processing. + +```c +/* PDM buffer full callback */ +static void pdm_data_ready_inference_callback(void) { + int bytesAvailable = PDM.available(); // Check how many bytes are available in the PDM buffer + int bytesRead = PDM.read((char *)&sampleBuffer[0], bytesAvailable); // Read the available data into the sampleBuffer + + // If the buffer is not yet full and recording is ready, store the incoming data in the inference buffer + if ((inference.buf_ready == 0) && (record_ready == true)) { + // Loop through the bytes read and store them in the inference buffer + for (int i = 0; i < bytesRead >> 1; i++) { + inference.buffer[inference.buf_count++] = sampleBuffer[i]; // Store each 16-bit audio sample + if (inference.buf_count >= inference.n_samples) { // Check if enough samples have been collected + inference.buf_count = 0; // Reset the sample count + inference.buf_ready = 1; // Mark the buffer as ready for inference + break; // Exit loop once the buffer is full + } + } + } +} +``` + +## Initialize Microphone for Inference + +Allocates memory for audio sampling, configures the PDM microphone, and prepares the system to begin audio inference. + +```c +static bool microphone_inference_start(uint32_t n_samples) { + // Allocate memory to store the audio samples + inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t)); + if (inference.buffer == NULL) { + return false; // Return false if memory allocation fails + } + + // Initialize inference buffer settings + inference.buf_count = 0; // Reset sample count + inference.n_samples = n_samples; // Store number of required samples + inference.buf_ready = 0; // Mark buffer as not ready yet + + // Set the callback function that will be triggered when PDM data is available + PDM.onReceive(pdm_data_ready_inference_callback); + + // Set the size of the internal PDM buffer + PDM.setBufferSize(2048); + + // Give the microphone some time to initialize + delay(250); + + // Begin capturing from PDM microphone with 1 channel at classifier frequency + if (!PDM.begin(1, EI_CLASSIFIER_FREQUENCY)) { + ei_printf("ERR: Failed to start PDM!"); // Print error if microphone fails to start + microphone_inference_end(); // Clean up resources + return false; // Return failure + } + + return true; // Microphone successfully started +} + +``` + +## Microphone Inference Record Function + +Defines the `microphone_inference_record()` function, which waits for the microphone buffer to be filled with audio samples. Once the buffer is ready, it resets the buffer state and prepares for the next recording cycle. + +```c +static bool microphone_inference_record(void) { + record_ready = true; // Set the flag indicating that the system is ready to start recording + while (inference.buf_ready == 0) { // Wait until the buffer is full (i.e., ready for processing) + delay(10); // Brief delay to avoid busy-waiting + } + inference.buf_ready = 0; // Reset the buffer ready flag after processing the data + record_ready = false; // Reset the recording flag, indicating recording is no longer active + return true; // Return true indicating the recording was successful +} +``` + +## Microphone Data Handling and Cleanup + +Defines two functions: `microphone_audio_signal_get_data()`, which converts raw audio data from the buffer into a float format, and `microphone_inference_end()`, which cleans up by stopping the PDM and freeing memory allocated for the audio buffer. + +```c +// Function to retrieve audio data from the inference buffer and convert it to float format +static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr) { + numpy::int16_to_float(&inference.buffer[offset], out_ptr, length); // Convert int16 audio data to float format + return 0; // Return 0 to indicate success +} + +// Function to clean up the microphone inference by stopping the PDM and freeing memory +static void microphone_inference_end(void) { + PDM.end(); // Stop the Pulse Density Modulation (PDM) interface + ei_free(inference.buffer); // Free the memory allocated for the inference buffer +} +``` + +## Function to Print the Inference Results + +Defines the `print_inference_result()` function, which outputs the results of the inference, including the timing of different stages — **DSP**, classification, anomaly detection — and the prediction values for each class. + +```c +void print_inference_result(ei_impulse_result_t result) { + // Print timing information for the DSP, inference, and anomaly stages + ei_printf("Timing: DSP %d ms, inference %d ms, anomaly %d ms\n", + result.timing.dsp, // Time taken for DSP processing + result.timing.classification, // Time taken for classification + result.timing.anomaly); // Time taken for anomaly detection + + ei_printf("Predictions:\n"); + // Loop through each class and print the classification result + for (uint16_t i = 0; i < EI_CLASSIFIER_LABEL_COUNT; i++) { + ei_printf(" %s: %.5f\n", // Print class name and its predicted value + ei_classifier_inferencing_categories[i], + result.classification[i].value); + } + + // If anomaly detection is enabled, print the anomaly prediction +#if EI_CLASSIFIER_HAS_ANOMALY == 1 + ei_printf("Anomaly prediction: %.3f\n", result.anomaly); // Print the anomaly score +#endif +} +``` + +{{% notice Note %}} +The `ei_printf` command is a custom logging function from the Edge Impulse SDK, used for printing debug or inference-related information to the serial monitor, optimized for embedded systems. It works similarly to `printf` but is tailored for the Edge Impulse environment. You can download the complete [Code_Sample.ino](https://github.com/e-dudzi/Learning-Path.git) and try it out yourself. +{{% /notice %}} + +# Run Your Code + +Now that you have a good understanding of the code, you should run it on your device. With your **Arduino Nano RP2040** plugged into your computer, and the correct [board and port](http://localhost:1313/learning-paths/embedded-and-microcontrollers/egde/connect-and-set-up-arduino/) selected in the Arduino IDE, follow these steps: + +#### If you're using the **Upload Button** + +1. Click the **right-facing arrow** at the top-left of the Arduino IDE window. +2. The IDE will compile your code and upload it to your board. +3. Wait for the message **“Done uploading.”** to appear at the bottom of the IDE. + +#### If you're using the **Sketch Menu** + +1. Go to **Sketch** > **Upload**. +2. The IDE will compile and upload your sketch to the board. +3. Once the upload is complete, you’ll see **“Done uploading.”** at the bottom. + +Your board should now start running the uploaded code automatically. + +### Verify Your Code is Running + +To further confirm that your code is running properly: + +1. Go to **Tools** > **Serial Monitor** in the Arduino IDE. +2. Set the baud rate to **115200** (if it's not already). +3. Observe the output messages: + - Start and end of recording + - Inference process + - Predictions for each label + +These messages indicate that your model is working and processing voice input as expected. + +### Recording Your Voice to Toggle the LED + +1. Wait for the **"Recording..."** message to appear on the Serial Monitor. This indicates that the system is ready to record your voice input. + +2. Speak your command (e.g., "on" or "off") quickly, as the system only records for a brief 1-second window. Once that window closes, inference will take place, and the system will process the voice command. + +3. The system will make a prediction based on the input, toggling the LED accordingly. + +4. You can adjust the **threshold** for prediction accuracy in the code to fine-tune when the LED should toggle, based on the prediction confidence level. This helps control how sensitive the system is to voice commands. + +### Serial Monitor Output + +Your Serial Monitor should look like the image below. + +![example image alt-text#center](Images/Serial_monitor.png "Figure 16. Circuit Connection") + +{{% notice Congratulations %}} +You’ve successfully programmed your first TinyML microcontroller! You've also built a functional, smart system to control an LED with your voice. +{{% /notice %}} diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/Software_Edge_Impulse.md b/content/learning-paths/embedded-and-microcontrollers/Egde/Software_Edge_Impulse.md new file mode 100644 index 0000000000..62fe07e8ae --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/Software_Edge_Impulse.md @@ -0,0 +1,163 @@ +--- +title: Set up your environment +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +# Using Edge Impulse to Train TinyML Models + +Now that the foundational concepts of TinyML and Edge AI are clear, it's time to move from theory to practice. One of the most accessible and easy to use platforms for training TinyML models is **Edge Impulse**. It provides an intuitive, end-to-end pipeline for collecting data, designing features, training models, and deploying them to edge devices. In this section, we will explore how Edge Impulse is used to train models specifically for ultra-low-power microcontrollers, bridging the gap between machine learning and real-world embedded applications. + +## What is Edge Impulse? + +**Edge Impulse** is a development platform designed to simplify the process of building, training, and deploying machine learning (ML) models on **embedded systems and edge devices**, such as microcontrollers, sensors, and single-board computers (e.g., Raspberry Pi, Arduino). + +## Key Features of Edge Impulse + +| Feature | Description | +|-----------------------|-----------------------------------------------------------------------------------------------------| +| **Data Collection** | Collects data from sensors (e.g., accelerometers, microphones, cameras) in real time. | +| **Preprocessing** | Provides tools for signal processing and feature extraction tailored for embedded systems. | +| **Model Training** | Supports built-in ML algorithms and integrates with frameworks like TensorFlow. | +| **Edge Optimization** | Automatically optimizes models to run efficiently on low-power edge hardware. | +| **Deployment** | Enables seamless deployment to microcontrollers, RTOS-based boards, and Linux devices. | +| **Web-Based Platform**| Fully browser-based interface for managing projects and workflows without needing local setup. | + +--- + +## Why It’s Important in Edge AI and TinyML + +- **Bridges the gap** between machine learning and embedded development. +- **Accelerates prototyping** and deployment of AI features directly on hardware. +- **Supports TinyML** applications that run on devices with very limited memory and compute power. +- Works with popular hardware platforms like **Arduino**, **Raspberry Pi**, **Nordic**, **STMicroelectronics**, and more. + +## Getting Started with Edge Impulse + +To begin working with TinyML models, visit the **[Edge Impulse](https://edgeimpulse.com)**. You’ll need to create a free account to access the full platform. In the following sections, you will walk through each key page on the Edge Impulse platform using the attached snapshots as guide. These will help you understand what actions to take and how each part of the interface contributes to building and deploying your machine learning model. + +![example image alt-text#center](Images/1.png "Figure 1. Home Page of Edge Impulse") + +### Step 1: Create a New Project + +Once you’ve created your account and logged in, the first step is to **create a new project**. Give your project a name that clearly reflects its purpose—this helps with easy identification, especially if you plan to build multiple models later on. For example, if you're building a keyword spotting model, you might name it "Wake Word Detection". You’ll also need to select the appropriate **project type** and **project setting**, as shown in the snapshot below. + +![example image alt-text#center](Images/3.png "Figure 2. New Project Setup") + +### Step 2: Configure the Target Device + +After creating your project, the next step is to **configure the target device**. Since we are using the **Arduino Nano RP2040 Connect**, click the highlighted button to begin device configuration, as shown in the snapshot below. This ensures that the data collection, model training, and deployment steps are optimized for your specific hardware. + +The specifications of the Arduino Nano RP2040 Connect board can be found on [Arduino’s official page](https://store.arduino.cc/products/arduino-nano-rp2040-connect). + +Follow the exact settings in the attached snapshot to complete the configuration. + +![example image alt-text#center](Images/4.png "Figure 3. Configure Arduino Nano RP2040") + +### Step 3: Add the Dataset + +With your device configured, the next step is to **add your dataset** to the project. Click on the **"Add existing data"** button and follow the configuration settings shown in the attached snapshot. This allows you to upload pre-recorded data instead of collecting it live, which can save time during the development phase. + +The dataset for this project can be downloaded from the following link: [Download Dataset](https://github.com/e-dudzi/Learning-Path.git). The Dataset has already been split into **training** and **testing**. + +![example image alt-text#center](Images/6.png "Figure 4. Add Existing Data") + +{{% notice Note %}} +Do **not** check the **Green** highlighted area during upload. The dataset already includes metadata. Enabling that option may result in **much slower upload times** and is unnecessary for this project. +{{% /notice %}} + +![example image alt-text#center](Images/7.png "Figure 5. Dataset Overview") + +### Dataset Uploaded Successfully + +This is what you should see after the dataset has been successfully uploaded. The data samples will appear in the **Data acquisition** tab, categorized by their respective labels. You can click on each sample to inspect the raw signal, view metadata, and even **listen to the audio recordings** directly within the Edge Impulse interface. This helps verify that the uploaded data is accurate and usable for training. + +{{% notice Note %}} +This dataset is made up of **four labels**: `on`, `off`, `noise`, and `unknown`. +{{% /notice %}} + +![example image alt-text#center](Images/8.png "Figure 6. Dataset Overview") + +### Step 4: Create the Impulse + +Now that your data is ready, it's time to create the **impulse**, which defines the flow of data from input to output through processing blocks. Click on the **"Create Impulse"** button in the menu and configure it exactly as shown in the snapshot below. This typically includes setting the input data type (e.g., audio), adding a **processing block** (such as MFCC for audio), and a **learning block** (such as a neural network classifier). + +After configuring everything, **don’t forget to save your impulse**. + +![example image alt-text#center](Images/9.png "Figure 7. Create Impulse") + +### Step 5: Configure the MFCC Block + +Next, you'll configure the **MFCC (Mel Frequency Cepstral Coefficients)** processing block, which transforms the raw audio data into features suitable for model training. Click on **"MFCC"** in the left-hand menu under the **"Impulse Design"** section. + +Set the parameters exactly as shown in the snapshot below. These settings determine how the audio input is broken down and analyzed. Once you're done, be sure to **save the parameters**. These parameters are chosen for this path. Modifications can be made once you are familiar with Edge Impulse. + +![example image alt-text#center](Images/10.png "Figure 8. MFCC Block Configuration") + +{{% notice Note %}} +The **green highlighted section** on the MFCC configuration page gives an estimate of how the model will perform **on the target device**. This includes information like memory usage (RAM/Flash) and latency, helping you ensure the model fits within the constraints of your hardware. +{{% /notice %}} + +### Step 6: Generate Features + +After saving the MFCC parameters, the next step is to generate features from your dataset. Click on the **"Generate features"** button highlighted. Edge Impulse will process all your data samples using the MFCC configuration and create a set of features suitable for training a machine learning model. + +Once the feature generation is complete, you'll see a **2D visualization plot** that shows how the dataset is distributed across the four labels: `on`, `off`, `noise`, and `unknown`. This helps to visually confirm whether the different classes are well-separated and learnable by the model. + +![example image alt-text#center](Images/12.png "Figure 9. Feature Explorer") + +### Step 7: Setting Up the Classifier + +Now it's time to configure the **neural network classifier**, which will learn to recognize the different audio commands. Click on the **"Classifier"** button in the left-hand menu under **Impulse Design** and set the parameters exactly as shown in the snapshot below. + +{{% notice Note %}} +For this learning path, a learning rate of `0.002` was chosen, although the snapshot shows a value of `0.005`. You are free to experiment with different values to improve model accuracy. However, using `0.002` is recommended as a good starting point. +{{% /notice %}} + +Once all the parameters are set, click on **"Save and train"** to start training your model. + +![example image alt-text#center](Images/13.png "Figure 10. Classifier Settings") + +### Step 8: Reviewing Model Performance + +After the training process is complete, Edge Impulse will display the **model's performance**, including its overall **accuracy**, **loss**, and a **confusion matrix**. + +![example image alt-text#center](Images/14.png "Figure 11. Model Performance") + +- **Accuracy** reflects how often the model predicts the correct label. +- **Loss** indicates how far the model’s predictions are from the actual labels during training — a lower loss generally means better performance. +- The **confusion matrix** shows how well the model predicted each of the four labels (`on`, `off`, `noise`, `unknown`), and can help identify patterns of misclassification. + +Review these metrics to determine if the model is learning effectively. If needed, adjust the model parameters or revisit earlier steps to improve performance. + +**On-Device Performance (EON Compiler - RAM Optimized):** + +| Metric | Value | +|--------------------|-----------| +| Inference Time | 6 ms | +| Peak RAM Usage | 12.5 KB | +| Flash Usage | 49.7 KB | + +![example image alt-text#center](Images/15.png "Figure 12. Model Performance") + +You can also [download](https://github.com/e-dudzi/Learning-Path.git) a pre-trained model and continue from here. + +### Final Step: Deploying the Model + +To use the trained model on your Arduino Nano RP2040, follow the steps below to export it as an Arduino library. + +1. Click on the **Deployment** tab from the menu. +2. In the **search bar**, type **"Arduino"** to filter the export options. +3. Select **Arduino library** from the list. +4. The export process will start automatically, and the model will be downloaded as a `.zip` file. + +![example image alt-text#center](Images/16.png "Figure 13. Model Deployment") + +## Next Steps + +In the following steps, you will move from model training to real-world deployment. Specifically, we will: + +- Connect an **LED** to the **Arduino Nano RP2040** board. +- Set up the **Arduino IDE** for development. +- Program the board and **deploy the trained model** to recognize voice commands which will be used to turn `ON` and `OFF` the LED diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/_index.md b/content/learning-paths/embedded-and-microcontrollers/Egde/_index.md new file mode 100644 index 0000000000..72e4f00957 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/_index.md @@ -0,0 +1,64 @@ +--- +title: Learn How to Run AI on Edge Devices- Arduino Nano RP2040 + +draft: true +cascade: + draft: true + +minutes_to_complete: 90 + +who_is_this_for: This learning path is for beginners in Edge AI and TinyML, including developers, engineers, hobbyists, AI/ML enthusiasts, and researchers working with embedded AI and IoT. + +learning_objectives: + - Understand Edge AI and TinyML basics. + - Collect and preprocess audio data using Edge Impulse. + - Train and deploy an audio classification model on Arduino Nano RP2040 + - Interface with LEDs to switch them on and off . + +prerequisites: + - Explore this [learning path](https://learn.arm.com/learning-paths/embedded-and-microcontrollers/arduino-pico/) if you are an absolute beginner. + - An [Edge Impulse](https://edgeimpulse.com/) Studio account. + - The [Arduino IDE with the RP2040 board support package](https://learn.arm.com/install-guides/arduino-pico/) installed on your computer + - An Arduino Nano RP2040 Connect [board](https://store.arduino.cc/products/arduino-nano-rp2040-connect-with-headers?_gl=1*9t4cti*_up*MQ..*_ga*NTA1NTQwNzgxLjE3NDYwMjIyODk.*_ga_NEXN8H46L5*MTc0NjAyMjI4Ny4xLjEuMTc0NjAyMjMxOC4wLjAuMjA3MjA2NTUzMA..). + +author: Bright Edudzi Gershon Kordorwu +### Tags +skilllevels: Introductory +subjects: tinyML +armips: + - Cortex-M + +tools_software_languages: + - Edge Impulse + - tinyML + - Edge AI + - Arduino +operatingsystems: + - Baremetal + + + + +further_reading: + + - resource: + title: TinyML Brings AI to Smallest Arm Devices + link: https://newsroom.arm.com/blog/tinyml + type: blog + - resource: + title: What is edge AI? + link: https://docs.edgeimpulse.com/nordic/concepts/edge-ai/what-is-edge-ai + type: blog + - resource: + title: Edge Impulse for Beginners + link: https://docs.edgeimpulse.com/docs/readme/for-beginners + type: doc + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/Egde/_next-steps.md b/content/learning-paths/embedded-and-microcontrollers/Egde/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/Egde/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/_index.md b/content/learning-paths/embedded-and-microcontrollers/_index.md index 3c103cb3c0..eeb8fd0e51 100644 --- a/content/learning-paths/embedded-and-microcontrollers/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/_index.md @@ -10,7 +10,7 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 1 -- Baremetal: 29 +- Baremetal: 30 - Linux: 28 - macOS: 6 - RTOS: 9 @@ -29,7 +29,7 @@ subtitle: Learn best practices for microcontroller development title: Embedded and Microcontrollers tools_software_languages_filter: - AI: 1 -- Arduino: 1 +- Arduino: 2 - Arm Compiler for Embedded: 7 - Arm Compiler for Linux: 1 - Arm Compute Library: 1 @@ -51,6 +51,8 @@ tools_software_languages_filter: - DetectNet: 1 - Docker: 9 - DSTREAM: 2 +- Edge AI: 1 +- Edge Impulse: 1 - ExecuTorch: 2 - Fixed Virtual Platform: 9 - FPGA: 1 @@ -88,7 +90,7 @@ tools_software_languages_filter: - STM32: 2 - TensorFlow: 3 - TensorRT: 1 -- tinyML: 1 +- tinyML: 2 - Trusted Firmware: 3 - TrustZone: 2 - TVMC: 1 diff --git a/content/learning-paths/embedded-and-microcontrollers/mlek/_index.md b/content/learning-paths/embedded-and-microcontrollers/mlek/_index.md index dbc9d5c226..dd94259436 100644 --- a/content/learning-paths/embedded-and-microcontrollers/mlek/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/mlek/_index.md @@ -40,7 +40,7 @@ tools_software_languages: further_reading: - resource: title: ML Evaluation Kit Quick Start Guide - link: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit/+/HEAD/docs/quick_start.md + link: https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/main/docs/quick_start.md type: documentation - resource: title: Creating ML applications for embedded devices on Arm Virtual Hardware diff --git a/content/learning-paths/embedded-and-microcontrollers/mlek/build.md b/content/learning-paths/embedded-and-microcontrollers/mlek/build.md index ced3fb262b..a88507a7d6 100644 --- a/content/learning-paths/embedded-and-microcontrollers/mlek/build.md +++ b/content/learning-paths/embedded-and-microcontrollers/mlek/build.md @@ -7,7 +7,7 @@ weight: 2 # 1 is first, 2 is second, etc. # Do not modify these elements layout: "learningpathall" --- -The [Arm ML Evaluation Kit (MLEK)](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit) provides a number of ready-to-use ML applications. These allow you to investigate the embedded software stack and evaluate performance on the Cortex-M55 and Ethos-U85 processors. +The [Arm ML Evaluation Kit (MLEK)](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit) provides a number of ready-to-use ML applications. These allow you to investigate the embedded software stack and evaluate performance on the Cortex-M55 and Ethos-U85 processors. You can use the MLEK source code to build sample applications and run them on the [Corstone reference systems](https://www.arm.com/products/silicon-ip-subsystems/), for example the [Corstone-320](https://developer.arm.com/Processors/Corstone-320) Fixed Virtual Platform (FVP). @@ -53,7 +53,7 @@ You can review the installation guides for further details. Clone the ML Evaluation Kit repository, and navigate into the new directory: ```bash -git clone "https://review.mlplatform.org/ml/ethos-u/ml-embedded-evaluation-kit" +git clone "https://git.gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit.git" cd ml-embedded-evaluation-kit git submodule update --init ``` diff --git a/content/learning-paths/embedded-and-microcontrollers/mlek/run.md b/content/learning-paths/embedded-and-microcontrollers/mlek/run.md index 387dbde15e..a0f20db865 100644 --- a/content/learning-paths/embedded-and-microcontrollers/mlek/run.md +++ b/content/learning-paths/embedded-and-microcontrollers/mlek/run.md @@ -72,7 +72,7 @@ The application executes and identifies words spoken within audio files. Repeat with any of the other built applications. -Full instructions are provided in the evaluation kit [documentation](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit/+/HEAD/docs/quick_start.md). +Full instructions are provided in the evaluation kit [documentation](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/main/docs/quick_start.md). ## Addendum: Speed up FVP execution diff --git a/content/learning-paths/embedded-and-microcontrollers/nav-mlek/sw.md b/content/learning-paths/embedded-and-microcontrollers/nav-mlek/sw.md index 1f129e9d63..a5451c2dac 100644 --- a/content/learning-paths/embedded-and-microcontrollers/nav-mlek/sw.md +++ b/content/learning-paths/embedded-and-microcontrollers/nav-mlek/sw.md @@ -11,7 +11,7 @@ layout: "learningpathall" You should use an `x86_64` development machine running Windows or Linux for the best experience. -The [Arm ML Evaluation Kit (MLEK)](https://git.mlplatform.org/ml/ethos-u/ml-embedded-evaluation-kit.git/) is not fully supported on Windows. Some of the required tools work only on Linux. Linux is recommended if you plan to use MLEK extensively. +The [Arm ML Evaluation Kit (MLEK)](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit) is not fully supported on Windows. Some of the required tools work only on Linux. Linux is recommended if you plan to use MLEK extensively. There are some ML examples which can be developed using Windows tools. @@ -52,7 +52,7 @@ You may want to use [Docker](/install-guides/docker) to simplify ML development As an example, clone the MLEK repository and look at the `Dockerfile` at the top of the repository to see one way to use Docker in ML application development: ```console -git clone "https://review.mlplatform.org/ml/ethos-u/ml-embedded-evaluation-kit" +git clone "https://git.gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit.git" cd ml-embedded-evaluation-kit git submodule update --init ``` @@ -96,9 +96,9 @@ Resources for learning about ML applications are listed below for you to investi ### Arm ML Evaluation Kit (MLEK) -The [MLEK](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit) provides a number of example ML applications. +The [MLEK](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit) provides a number of example ML applications. -[The Quick Start Guide](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit/+/HEAD/docs/quick_start.md) guides you through running an example application. +[The Quick Start Guide](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/main/docs/quick_start.md) guides you through running an example application. ### Micro speech diff --git a/content/learning-paths/iot/iot-sdk/openiot.md b/content/learning-paths/iot/iot-sdk/openiot.md index 244fce5e59..6d9a793edf 100644 --- a/content/learning-paths/iot/iot-sdk/openiot.md +++ b/content/learning-paths/iot/iot-sdk/openiot.md @@ -7,7 +7,7 @@ weight: 2 # 1 is first, 2 is second, etc. # Do not modify these elements layout: "learningpathall" --- -[Arm Total Solutions for IoT](https://www.arm.com/markets/iot/total-solutions-iot) provide reference software stacks, integrating various Arm technologies, such as [Arm Trusted Firmware-M](https://developer.arm.com/Tools%20and%20Software/Trusted%20Firmware-M) and the [Arm ML Evaluation Kit (MLEK)](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit). +[Arm Total Solutions for IoT](https://www.arm.com/markets/iot/total-solutions-iot) provide reference software stacks, integrating various Arm technologies, such as [Arm Trusted Firmware-M](https://developer.arm.com/Tools%20and%20Software/Trusted%20Firmware-M) and the [Arm ML Evaluation Kit (MLEK)](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit). The [Open-IoT-SDK](https://github.com/ARM-software/open-iot-sdk) is designed to be used with [Arm Virtual Hardware (AVH)](https://www.arm.com/products/development-tools/simulation/virtual-hardware), which provides [Corstone-300](https://developer.arm.com/Processors/Corstone-300) Virtual Hardware. diff --git a/content/learning-paths/laptops-and-desktops/docker-models/_index.md b/content/learning-paths/laptops-and-desktops/docker-models/_index.md index 1c8661c8e9..e2eb7bc7e0 100644 --- a/content/learning-paths/laptops-and-desktops/docker-models/_index.md +++ b/content/learning-paths/laptops-and-desktops/docker-models/_index.md @@ -1,22 +1,19 @@ --- -title: Learn how to use Docker Model Runner in AI applications +title: Run AI models with Docker Model Runner -draft: true -cascade: - draft: true minutes_to_complete: 45 -who_is_this_for: This is for software developers and AI enthusiasts who want to run AI models using Docker Model Runner. +who_is_this_for: This is for software developers and AI enthusiasts who want to run pre-trained AI models locally using Docker Model Runner. learning_objectives: - Run AI models locally using Docker Model Runner. - - Easily build containerized applications with LLMs. + - Build containerized applications that integrate Large Language Models (LLMs). prerequisites: - - A computer with at least 16GB of RAM (recommended) and Docker Desktop installed (version 4.40 or later). - - Basic understanding of Docker. - - Familiarity with Large Language Model (LLM) concepts. + - Docker Desktop (version 4.40 or later) installed on a system with at least 16GB of RAM (recommended). + - Basic understanding of Docker CLI and concepts. + - Familiarity with LLM concepts. author: Jason Andrews diff --git a/content/learning-paths/laptops-and-desktops/docker-models/compose.md b/content/learning-paths/laptops-and-desktops/docker-models/compose.md index fcc3657c08..d37779cf82 100644 --- a/content/learning-paths/laptops-and-desktops/docker-models/compose.md +++ b/content/learning-paths/laptops-and-desktops/docker-models/compose.md @@ -4,15 +4,13 @@ weight: 3 layout: "learningpathall" --- -Docker Compose makes it easy to run multi-container applications. Docker Compose can also include AI models in your project. +Docker Compose makes it easy to run multi-container applications, and it can also include those that include local AI inference services. -In this section, you'll learn how to use Docker Compose to deploy a web-based AI chat application that uses Docker Model Runner as the backend for AI inference. +In this section, you'll use Docker Compose to deploy a simple web-based AI chat application. The frontend is a Flask web app, and the backend uses Docker Model Runner to serve AI responses. ## Clone the example project -The example project, named [docker-model-runner-chat](https://github.com/jasonrandrews/docker-model-runner-chat) is available on GitHub. It provides a simple web interface to interact with local AI models such as Llama 3.2 or Gemma 3. - -First, clone the example repository: +Clone the [docker-model-runner-chat](https://github.com/jasonrandrews/docker-model-runner-chat) repository from GitHub. This project provides a simple web interface to interact with local AI models such as Llama 3.2 or Gemma 3. ```console git clone https://github.com/jasonrandrews/docker-model-runner-chat.git @@ -21,7 +19,7 @@ cd docker-model-runner-chat ## Review the Docker Compose file -The `compose.yaml` file defines how the application is deployed using Docker Compose. +The `compose.yaml` file defines defines how Docker Compose sets up and connects the services. It sets up two services: @@ -60,21 +58,21 @@ From the project directory, start the app with: docker compose up --build ``` -Docker Compose will build the web app image and start both services. +Docker Compose builds the web app image and starts both services. ## Access the chat interface -Open your browser and copy and paste the local URL below: +Once running, open your browser and copy-and-paste the local URL below: ```console http://localhost:5000 ``` -You can now chat with the AI model using the web interface. Enter your prompt and view the response in real time. +You’ll see a simple chat UI. Enter a prompt and get real-time responses from the AI model. -![Compose #center](compose-app.png) +![Compose #center](compose-app.png "Docker Model Chat") -## Configuration +## Configure the model You can change the AI model or endpoint by editing the `vars.env` file before starting the containers. The file contains environment variables used by the web application: @@ -88,15 +86,20 @@ BASE_URL=http://model-runner.docker.internal/engines/v1/ MODEL=ai/gemma3 ``` -To use a different model, change the `MODEL` value. For example: +To use a different model or API endpoint, change the `MODEL` value. For example: ```console MODEL=ai/llama3.2 ``` -Make sure to change the model in the `compose.yaml` file also. +Be sure to also update the model name in the `compose.yaml` under the `ai-runner` service. + +## Optional: customize generation parameters + +You can edit `app.py` to adjust parameters such as: -You can also change the `temperature` and `max_tokens` values in `app.py` to further customize the application. +* `temperature`: controls randomness (higher is more creative) +* `max_tokens`: controls the length of responses ## Stop the application @@ -112,12 +115,13 @@ docker compose down Use the steps below if you have any issues running the application: -- Ensure Docker and Docker Compose are installed and running -- Make sure port 5000 is not in use by another application -- Check logs with: +* Ensure Docker and Docker Compose are installed and running +* Make sure port 5000 is not in use by another application +* Check logs with: ```console docker compose logs ``` +## What you've learned In this section, you learned how to use Docker Compose to run a containerized AI chat application with a web interface and local model inference from Docker Model Runner. diff --git a/content/learning-paths/laptops-and-desktops/docker-models/models.md b/content/learning-paths/laptops-and-desktops/docker-models/models.md index 3b8a2897cf..6bb71324d9 100644 --- a/content/learning-paths/laptops-and-desktops/docker-models/models.md +++ b/content/learning-paths/laptops-and-desktops/docker-models/models.md @@ -4,11 +4,13 @@ weight: 2 layout: "learningpathall" --- -Docker Model Runner is an official Docker extension that allows you to run Large Language Models (LLMs) on your local computer. It provides a convenient way to deploy and use AI models across different environments, including Arm-based systems, without complex setup or cloud dependencies. +## Simplified Local LLM Inference + +Docker Model Runner is an official Docker extension that allows you to run Large Language Models (LLMs) directly on your local computer. It provides a convenient way to deploy and use AI models across different environments, including Arm-based systems, without complex framework setup or cloud dependencies. Docker uses [llama.cpp](https://github.com/ggml-org/llama.cpp), an open source C/C++ project developed by Georgi Gerganov that enables efficient LLM inference on a variety of hardware, but you do not need to download, build, or install any LLM frameworks. -Docker Model Runner provides a easy to use CLI that is familiar to Docker users. +Docker Model Runner provides a easy-to-use CLI interface that is familiar to Docker users. ## Before you begin @@ -18,21 +20,21 @@ Verify Docker is running with: docker version ``` -You should see output showing your Docker version. +You should see your Docker version shown in the output. -Confirm the Docker Desktop version is 4.40 or above, for example: +Confirm that Docker Desktop is version 4.40 or above, for example: ```output Server: Docker Desktop 4.41.2 (191736) ``` -Make sure the Docker Model Runner is enabled. +Make sure the Docker Model Runner is enabled: ```console docker model --help ``` -You should see the usage message: +You should see this output: ```output Usage: docker model COMMAND @@ -52,27 +54,28 @@ Commands: version Show the Docker Model Runner version ``` -If Docker Model Runner is not enabled, enable it using the [Docker Model Runner documentation](https://docs.docker.com/model-runner/). +If Docker Model Runner is not enabled, enable it by following the [Docker Model Runner documentation](https://docs.docker.com/model-runner/). -You should also see the Models icon in your Docker Desktop sidebar. +You should also see the **Models** tab and icon appear in your Docker Desktop sidebar. -![Models #center](models-tab.png) +![Models #center](models-tab.png "Docker Models UI") -## Running your first AI model with Docker Model Runner +## Run your first AI model with Docker Model Runner Docker Model Runner is an extension for Docker Desktop that simplifies running AI models locally. Docker Model Runner automatically selects compatible model versions and optimizes performance for the Arm architecture. -You can try Docker Model Runner by using an LLM from Docker Hub. +You can try Model Runner by downloading and running a model from Docker Hub. -The example below uses the [SmolLM2 model](https://hub.docker.com/r/ai/smollm2), a compact language model with 360 million parameters, designed to run efficiently on-device while performing a wide range of language tasks. You can explore additional [models in Docker Hub](https://hub.docker.com/u/ai). +The example below uses the [SmolLM2 model](https://hub.docker.com/r/ai/smollm2), a compact LLM with ~360 million parameters, designed for efficient on-device inference while performing a wide range of language tasks. You can explore further models in [Docker Hub](https://hub.docker.com/u/ai). -Download the model using: +1. Download the model ```console docker model pull ai/smollm2 ``` +2. Run the model interactively For a simple chat interface, run the model: @@ -96,10 +99,9 @@ int main() { return 0; } ``` +To exit the chat, use the `/bye` command. -You can ask more questions and continue to chat. - -To exit the chat use the `/bye` command. +3. View downloaded models You can print the list of models on your computer using: @@ -119,7 +121,9 @@ ai/llama3.2 3.21 B IQ2_XXS/Q4_K_M llama 436bb282b419 2 months ag ## Use the OpenAI endpoint to call the model -From your host computer you can access the model using the OpenAI endpoint and a TCP port. +Docker Model Runner exposes a REST endpoint compatible with OpenAI's API spec. + +From your host computer, you can access the model using the OpenAI endpoint and a TCP port. First, enable the TCP port to connect with the model: @@ -155,7 +159,7 @@ Run the shell script: bash ./curl-test.sh | jq ``` -If you don't have `jq` installed, you eliminate piping the output. +If you don't have `jq` installed, you can eliminate piping the output. The output, including the performance information, is shown below: @@ -193,5 +197,14 @@ The output, including the performance information, is shown below: } } ``` +You now have a fully functioning OpenAI-compatible inference endpoint running locally. + +## What you've learned + +In this section, you learned: + +* How to verify and use Docker Model Runner on Docker Desktop +* How to run a model interactively from the CLI +* How to connect to a model using a local OpenAI-compatible API -In this section you learned how to run AI models using Docker Model Runner. Continue to see how to use Docker Compose to build an application with a built-in AI model. +In the next section, you'll use Docker Compose to deploy a web-based AI chat interface powered by Docker Model Runner. diff --git a/content/learning-paths/mobile-graphics-and-gaming/get-started-with-arm-asr/02-ue.md b/content/learning-paths/mobile-graphics-and-gaming/get-started-with-arm-asr/02-ue.md index 0e1d217a14..3801d49c43 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/get-started-with-arm-asr/02-ue.md +++ b/content/learning-paths/mobile-graphics-and-gaming/get-started-with-arm-asr/02-ue.md @@ -9,21 +9,21 @@ layout: learningpathall ## Objective -This Learning Path describes how to get started with Arm® Accuracy Super Resolution™ (Arm ASR) using an example project in Unreal Engine. +This Learning Path describes how to get started with Arm® Accuracy Super Resolution™ (Arm ASR) using an example project in Unreal Engine. -It is for Unreal Engine developers who want to apply upscaling techniques to their projects. +It is for Unreal Engine developers who want to apply upscaling techniques to their projects. You will walk through the processes of installing Arm ASR performing some of the common setup tasks. ## Before you begin -It is recommended that you use Unreal Engine versions 5.3-5.5 through this tutorial. +It is recommended that you use Unreal Engine versions 5.3-5.5 through this tutorial. ## Installing the Arm ASR plugin Follow these steps to install the Arm ASR plugin in Unreal Engine: -1. Open the Unreal Engine project you plan to use with Arm ASR. +1. Open the Unreal Engine project you plan to use with Arm ASR. The Third Person pack is available as an example, see below: @@ -35,11 +35,16 @@ The Third Person pack is available as an example, see below: git clone https://github.com/arm/accuracy-super-resolution-for-unreal ``` -3. Navigate to the `UE` directory in the cloned repository. +3. Check out the branch corresponding to your Unreal Engine version. - The repository base contains directories containing the plugin for each supported version of Unreal Engine. Navigate to the folder corresponding to your version. For example, use `550` for Unreal Engine 5.5. + The repository contains branches containing the plugin for each supported version of Unreal Engine. For example, use branch `5.5` for Unreal Engine 5.5. -4. From the directory for your version of Unreal Engine, copy the Arm ASR plugin into the `Plugins` folder in the game directory. + ``` + cd accuracy-super-resolution-for-unreal + git checkout 5.5 + ``` + +4. From the directory for your version of Unreal Engine, copy the Arm ASR plugin into the `Plugins` folder in the game directory. See below: @@ -71,8 +76,8 @@ After reopening the Unreal Engine project, ensure that the Arm ASR plugin is ena ![Change Anti-Aliasing Method](images/change_anti_aliasing_method.png "Set the Anti-Aliasing Method") -3. To verify that Arm ASR is enabled and active, use the `ShowFlag.VisualizeTemporalUpscaler 1` console command. - +3. To verify that Arm ASR is enabled and active, use the `ShowFlag.VisualizeTemporalUpscaler 1` console command. + {{%notice%}} The debug views produced by this command are generated by Unreal Engine's TAA, not directly by Arm ASR.{{%/notice%}} @@ -127,6 +132,6 @@ You can configure Arm ASR's behavior through the following plugin-specific conso ## Next steps -You are now ready to use Arm ASR in your Unreal Engine projects. +You are now ready to use Arm ASR in your Unreal Engine projects. You can use [Arm Performance Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio) tools to measure the performance of your game as it runs on a mobile device, allowing you to monitor the effect of Arm ASR. diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index 260563c025..dcef07a3e8 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -8,7 +8,7 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 2 -- Linux: 146 +- Linux: 147 - macOS: 10 - Windows: 14 pinned_modules: @@ -23,7 +23,7 @@ subjects_filter: - Databases: 15 - Libraries: 9 - ML: 27 -- Performance and Architecture: 56 +- Performance and Architecture: 57 - Storage: 1 - Web: 10 subtitle: Optimize cloud native apps on Arm for performance and cost @@ -54,7 +54,7 @@ tools_software_languages_filter: - Bash: 1 - bash: 2 - Bastion: 3 -- BOLT: 1 +- BOLT: 2 - bpftool: 1 - C: 4 - C#: 2 @@ -98,7 +98,7 @@ tools_software_languages_filter: - Hugging Face: 9 - InnoDB: 1 - Intrinsics: 1 -- iperf3: 1 +- iPerf3: 1 - Java: 3 - JAX: 1 - Kafka: 1 @@ -127,7 +127,7 @@ tools_software_languages_filter: - ONNX Runtime: 1 - OpenBLAS: 1 - PAPI: 1 -- perf: 4 +- perf: 5 - Perf: 1 - PostgreSQL: 4 - Python: 27 @@ -136,7 +136,7 @@ tools_software_languages_filter: - Redis: 3 - Remote.It: 2 - RME: 6 -- Runbook: 70 +- Runbook: 71 - Rust: 2 - snappy: 1 - Snort3: 1 diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md new file mode 100644 index 0000000000..e5bbe5ecbb --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -0,0 +1,66 @@ +--- +# User change +title: "Optimize bitmap scanning in databases with SVE and NEON on Arm servers" + +weight: 2 + +layout: "learningpathall" +--- +## Overview + +Bitmap scanning is a core operation in many database systems. It's essential for powering fast filtering in bitmap indexes, Bloom filters, and column filters. However, these scans can become performance bottlenecks in complex analytical queries. + +In this Learning Path, you’ll learn how to accelerate bitmap scanning using Arm’s vector processing technologies - NEON and SVE - on Neoverse V2–based servers like AWS Graviton4. + +Specifically, you will: + +* Explore how to use SVE instructions on Arm Neoverse V2–based servers like AWS Graviton4 to optimize bitmap scanning +* Compare scalar, NEON, and SVE implementations to demonstrate the performance benefits of specialized vector instructions + +## What is bitmap scanning in databases? + +Bitmap scanning involves searching through a bit vector to find positions where bits are set (`1`) or unset (`0`). + +In database systems, bitmaps are commonly used to represent: + +* **Bitmap indexes**: each bit represents whether a row satisfies a particular condition +* **Bloom filters**: probabilistic data structures used to test set membership +* **Column filters**: bit vectors indicating which rows match certain predicates + +The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. + +## The evolution of vector processing for bitmap scanning + +Here's how vector processing has evolved to improve bitmap scanning performance: + +* **Generic scalar processing**: traditional bit-by-bit processing with conditional branches +* **Optimized scalar processing**: byte-level skipping to avoid processing empty bytes +* **NEON**: fixed-width 128-bit SIMD processing with vector operations +* **SVE**: scalable vector processing with predication and specialized instructions like MATCH + +## Set up your Arm development environment + +To follow this Learning Path, you will need: + +* An AWS Graviton4 instance running `Ubuntu 24.04`. +* A GCC compiler with SVE support + +First, install the required development tools: + +```bash +sudo apt-get update +sudo apt-get install -y build-essential gcc g++ +``` +{{% notice Tip %}} +An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. For best performance, use the latest available GCC version with SVE support. This Learning Path was tested with GCC 13, the default on Ubuntu 24.04. Newer versions should also work. +{{% /notice %}} + + +Create a directory for your implementations: +```bash +mkdir -p bitmap_scan +cd bitmap_scan +``` + +## Next up: build the bitmap scanning foundation +With your development environment set up, you're ready to dive into the core of bitmap scanning. In the next section, you’ll define a minimal bitmap data structure and implement utility functions to set, clear, and inspect individual bits. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md new file mode 100644 index 0000000000..9d3d1d4ed2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -0,0 +1,88 @@ +--- +# User change +title: "Build and manage a bit vector in C" + +weight: 3 + +layout: "learningpathall" + +--- +## Bitmap data structure + +Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: + - A byte array to store the actual bits + - Tracking of the physical size (bytes) + - Tracking of the logical size (bits) + +For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps. + +Use a file editor of your choice and then copy the code below into `bitvector_scan_benchmark.c`: + +```c +// Define a simple bit vector structure +typedef struct { + uint8_t* data; + size_t size_bytes; + size_t size_bits; +} bitvector_t; + +// Create a new bit vector +bitvector_t* bitvector_create(size_t size_bits) { + bitvector_t* bv = (bitvector_t*)malloc(sizeof(bitvector_t)); + bv->size_bits = size_bits; + bv->size_bytes = (size_bits + 7) / 8; + bv->data = (uint8_t*)calloc(bv->size_bytes, 1); + return bv; +} + +// Free bit vector resources +void bitvector_free(bitvector_t* bv) { + free(bv->data); + free(bv); +} + +// Set a bit in the bit vector +void bitvector_set_bit(bitvector_t* bv, size_t pos) { + if (pos < bv->size_bits) { + bv->data[pos / 8] |= (1 << (pos % 8)); + } +} + +// Get a bit from the bit vector +bool bitvector_get_bit(bitvector_t* bv, size_t pos) { + if (pos < bv->size_bits) { + return (bv->data[pos / 8] & (1 << (pos % 8))) != 0; + } + return false; +} + +// Generate a bit vector with specified density +bitvector_t* generate_bitvector(size_t size_bits, double density) { + bitvector_t* bv = bitvector_create(size_bits); + + // Set bits according to density + size_t num_bits_to_set = (size_t)(size_bits * density); + + for (size_t i = 0; i < num_bits_to_set; i++) { + size_t pos = rand() % size_bits; + bitvector_set_bit(bv, pos); + } + + return bv; +} + +// Count set bits in the bit vector +size_t bitvector_count_scalar(bitvector_t* bv) { + size_t count = 0; + for (size_t i = 0; i < bv->size_bits; i++) { + if (bitvector_get_bit(bv, i)) { + count++; + } + } + return count; +} +``` + +## Next up: implement and benchmark your first scalar bitmap scan + +With your bit vector infrastructure in place, you're now ready to scan it for set bits—the core operation that underpins all bitmap-based filters in database systems. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md new file mode 100644 index 0000000000..320549b0f6 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -0,0 +1,89 @@ +--- +# User change +title: "Implement scalar bitmap scanning in C" + +weight: 4 + +layout: "learningpathall" + + +--- +## Bitmap scanning implementations + +Bitmap scanning is a fundamental operation in performance-critical systems such as databases, search engines, and filtering pipelines. It involves identifying the positions of set bits (`1`s) in a bit vector, which is often used to represent filtered rows, bitmap indexes, or membership flags. + +In this section, you'll implement multiple scalar approaches to bitmap scanning in C, starting with a simple per-bit baseline, followed by an optimized version that reduces overhead for sparse data. + +Now, let’s walk through the scalar versions of this operation that locate all set bit positions. + +### Generic scalar implementation + +This is the most straightforward implementation, checking each bit individually. It serves as the baseline for comparison against the other implementations to follow. + +Copy the code below into the same file: + +```c +// Generic scalar implementation of bit vector scanning (bit-by-bit) +size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + + for (size_t i = 0; i < bv->size_bits; i++) { + if (bitvector_get_bit(bv, i)) { + result_positions[result_count++] = i; + } + } + + return result_count; +} +``` + +You might notice that this generic C implementation processes every bit, even when most bits are not set. It has high per-bit function call overhead and does not take advantage of any vector instructions. + +In the following implementations, you can address these inefficiencies with more optimized techniques. + +### Optimized scalar implementation + +This implementation adds byte-level skipping to avoid processing empty bytes. + +Copy this optimized C scalar implementation code into the same file: + +```c +// Optimized scalar implementation of bit vector scanning (byte-level) +size_t scan_bitvector_scalar(bitvector_t* bv, uint32_t* result_positions) { +size_t result_count = 0; + + for (size_t byte_idx = 0; byte_idx < bv->size_bytes; byte_idx++) { + uint8_t byte = bv->data[byte_idx]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = byte_idx * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + + return result_count; +} +``` +Instead of iterating through each bit individually, this implementation processes one byte (8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely. For sparse bitmaps, this can dramatically reduce the number of bit checks. + +## Next up: accelerate bitmap scanning with NEON and SVE + +You’ve now implemented two scalar scanning routines: + +* A generic per-bit loop for correctness and simplicity + +* An optimized scalar version that improves performance using byte-level skipping + +These provide a solid foundation and performance baseline—but scalar methods can only take you so far. To unlock real throughput gains, it’s time to leverage SIMD (Single Instruction, Multiple Data) execution. + +In the next section, you’ll explore how to use Arm NEON and SVE vector instructions to accelerate bitmap scanning. These approaches will process multiple bytes at once and significantly outperform scalar loops—especially on modern Arm-based CPUs like AWS Graviton4. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md new file mode 100644 index 0000000000..3de8fba739 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -0,0 +1,164 @@ +--- +# User change +title: "Vectorized bitmap scanning with NEON and SVE" + +weight: 5 + +layout: "learningpathall" + + +--- +Modern Arm CPUs like Neoverse V2 support SIMD (Single Instruction, Multiple Data) extensions that allow processing multiple bytes in parallel. In this section, you'll explore how NEON and SVE vector instructions can dramatically accelerate bitmap scanning by skipping over large regions of unset data and reducing per-bit processing overhead. + +## NEON implementation + +This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. + +Copy the NEON implementation shown below into the same file: + +```c +// NEON implementation of bit vector scanning +size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + + // Process 16 bytes at a time using NEON + size_t i = 0; + for (; i + 16 <= bv->size_bytes; i += 16) { + uint8x16_t data = vld1q_u8(&bv->data[i]); + + // Quick check if all bytes are zero + uint8x16_t zero = vdupq_n_u8(0); + uint8x16_t cmp = vceqq_u8(data, zero); + uint64x2_t cmp64 = vreinterpretq_u64_u8(cmp); + + // If all bytes are zero (all comparisons are true/0xFF), skip this chunk + if (vgetq_lane_u64(cmp64, 0) == UINT64_MAX && + vgetq_lane_u64(cmp64, 1) == UINT64_MAX) { + continue; + } + + // Process each byte + uint8_t bytes[16]; + vst1q_u8(bytes, data); + + for (int j = 0; j < 16; j++) { + uint8_t byte = bytes[j]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = (i + j) * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + } + + // Handle remaining bytes with scalar code + for (; i < bv->size_bytes; i++) { + uint8_t byte = bv->data[i]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = i * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + + return result_count; +} +``` +This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. + +## SVE implementation + +This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. + +Copy this SVE implementation into the same file: + +```c +// SVE implementation using svcmp_u8, PNEXT, and LASTB +size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + size_t sve_len = svcntb(); + svuint8_t zero = svdup_n_u8(0); + + // Process the bitvector to find all set bits + for (size_t offset = 0; offset < bv->size_bytes; offset += sve_len) { + svbool_t pg = svwhilelt_b8((uint64_t)offset, (uint64_t)bv->size_bytes); + svuint8_t data = svld1_u8(pg, bv->data + offset); + + // Prefetch next chunk + if (offset + sve_len < bv->size_bytes) { + __builtin_prefetch(bv->data + offset + sve_len, 0, 0); + } + + // Find non-zero bytes + svbool_t non_zero = svcmpne_u8(pg, data, zero); + + // Skip if all bytes are zero + if (!svptest_any(pg, non_zero)) { + continue; + } + + // Create an index vector for byte positions + svuint8_t indexes = svindex_u8(0, 1); // 0, 1, 2, 3, ... + + // Initialize next with false predicate + svbool_t next = svpfalse_b(); + + // Find the first non-zero byte + next = svpnext_b8(non_zero, next); + + // Process each non-zero byte using PNEXT + while (svptest_any(pg, next)) { + // Get the index of this byte + uint8_t byte_idx = svlastb_u8(next, indexes); + + // Get the actual byte value + uint8_t byte_value = svlastb_u8(next, data); + + // Calculate the global byte position + size_t global_byte_pos = offset + byte_idx; + + // Process each bit in the byte using scalar code + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte_value & (1 << bit_pos)) { + size_t global_bit_pos = global_byte_pos * 8 + bit_pos; + if (global_bit_pos < bv->size_bits) { + result_positions[result_count++] = global_bit_pos; + } + } + } + + // Find the next non-zero byte + next = svpnext_b8(non_zero, next); + } + } + + return result_count; +} +``` +The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. + +## Next up: apply vectorized scanning to database workloads + +With both NEON and SVE implementations in place, you’ve now unlocked the full power of Arm’s vector processing capabilities for bitmap scanning. These SIMD techniques allow you to process large bitvectors more efficiently—especially when filtering sparse datasets or skipping over large blocks of empty rows. + +In the next section, you’ll learn how to apply these optimizations in the context of real database operations like bitmap index scans, Bloom filter probes, and column filtering. You’ll also explore best practices for selecting the right implementation based on bit density, and tuning for maximum performance on AWS Graviton4. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md new file mode 100644 index 0000000000..0c5edf7bba --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -0,0 +1,225 @@ +--- +# User change +title: "Benchmarking bitmap scanning across implementations" + +weight: 6 + +layout: "learningpathall" + + +--- +## Benchmarking code + +Now that you've created four different bitmap scanning implementations, let’s build a benchmarking framework to compare their performance. + +Copy the code shown below into `bitvector_scan_benchmark.c` : + +```c +// Timing function for bit vector scanning +double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), + bitvector_t* bv, uint32_t* result_positions, + int iterations, size_t* found_count) { + struct timespec start, end; + *found_count = 0; + + clock_gettime(CLOCK_MONOTONIC, &start); + + for (int iter = 0; iter < iterations; iter++) { + size_t count = scan_func(bv, result_positions); + if (iter == 0) { + *found_count = count; + } + } + + clock_gettime(CLOCK_MONOTONIC, &end); + + double elapsed = (end.tv_sec - start.tv_sec) * 1000.0 + + (end.tv_nsec - start.tv_nsec) / 1000000.0; + return elapsed / iterations; +} +``` + +## Main function +The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. + +Copy the main function code below into `bitvector_scan_benchmark.c`: + +```C +int main() { + srand(time(NULL)); + + printf("Bit Vector Scanning Performance Benchmark\n"); + printf("========================================\n\n"); + + // Parameters + size_t bitvector_size = 10000000; // 10 million bits + int iterations = 10; // 10 iterations for timing + + // Test different densities + double densities[] = {0.0, 0.0001, 0.001, 0.01, 0.1}; + int num_densities = sizeof(densities) / sizeof(densities[0]); + + printf("Bit vector size: %zu bits\n", bitvector_size); + printf("Iterations: %d\n\n", iterations); + + // Allocate result array + uint32_t* result_positions = (uint32_t*)malloc(bitvector_size * sizeof(uint32_t)); + + printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", + "Density", "Set Bits", "Scalar Gen (ms)", "Scalar Opt (ms)", "NEON (ms)", "SVE (ms)"); + printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", + "-------", "--------", "--------------", "--------------", "--------", "---------------"); + + for (int d = 0; d < num_densities; d++) { + double density = densities[d]; + + // Generate bit vector with specified density + bitvector_t* bv = generate_bitvector(bitvector_size, density); + + // Count actual set bits + size_t actual_set_bits = bitvector_count_scalar(bv); + + // Benchmark implementations + size_t found_scalar_gen, found_scalar, found_neon, found_sve2; + + double scalar_gen_time = benchmark_scan(scan_bitvector_scalar_generic, bv, result_positions, + iterations, &found_scalar_gen); + + double scalar_time = benchmark_scan(scan_bitvector_scalar, bv, result_positions, + iterations, &found_scalar); + + double neon_time = benchmark_scan(scan_bitvector_neon, bv, result_positions, + iterations, &found_neon); + + double sve2_time = benchmark_scan(scan_bitvector_sve2_pnext, bv, result_positions, + iterations, &found_sve2); + + // Print results + printf("%-10.4f %-15zu %-15.3f %-15.3f %-15.3f %-15.3f\n", + density, actual_set_bits, scalar_gen_time, scalar_time, neon_time, sve2_time); + + // Print speedups for this density + printf("Speedups at %.4f density:\n", density); + printf(" Scalar Opt vs Scalar Gen: %.2fx\n", scalar_gen_time / scalar_time); + printf(" NEON vs Scalar Gen: %.2fx\n", scalar_gen_time / neon_time); + printf(" SVE vs Scalar Gen: %.2fx\n", scalar_gen_time / sve2_time); + printf(" NEON vs Scalar Opt: %.2fx\n", scalar_time / neon_time); + printf(" SVE vs Scalar Opt: %.2fx\n", scalar_time / sve2_time); + printf(" SVE vs NEON: %.2fx\n\n", neon_time / sve2_time); + + // Verify results match + if (found_scalar_gen != found_scalar || found_scalar_gen != found_neon || found_scalar_gen != found_sve2) { + printf("WARNING: Result mismatch at %.4f density!\n", density); + printf(" Scalar Gen found %zu bits\n", found_scalar_gen); + printf(" Scalar Opt found %zu bits\n", found_scalar); + printf(" NEON found %zu bits\n", found_neon); + printf(" SVE found %zu bits\n\n", found_sve2); + } + + // Clean up + bitvector_free(bv); + } + + free(result_positions); + + return 0; +} +``` + +## Compiling and running + +You are now ready to compile and run your bitmap scanning implementations. + +To compile the bitmap scanning implementations with the appropriate flags, run: + +```bash +gcc -O3 -march=armv9-a+sve2 -o bitvector_scan_benchmark bitvector_scan_benchmark.c -lm +``` + +## Performance results + +When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results should look similar to: + +### Execution time (ms) + +| Density | Set Bits | Scalar Generic | Scalar Optimized | NEON | SVE | +|---------|----------|----------------|------------------|-------|------------| +| 0.0000 | 0 | 7.169 | 0.456 | 0.056 | 0.093 | +| 0.0001 | 1,000 | 7.176 | 0.477 | 0.090 | 0.109 | +| 0.0010 | 9,996 | 7.236 | 0.591 | 0.377 | 0.249 | +| 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | +| 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | + +### Speed-up vs generic scalar + +| Density | Scalar Optimized | NEON | SVE | +|---------|------------------|---------|------------| +| 0.0000 | 15.72x | 127.41x | 77.70x | +| 0.0001 | 15.05x | 80.12x | 65.86x | +| 0.0010 | 12.26x | 19.35x | 29.07x | +| 0.0100 | 5.02x | 3.49x | 5.78x | +| 0.1000 | 1.54x | 1.40x | 1.90x | + +## Understanding the results + +The benchmarking results reveal how different bitmap scanning implementations perform across a range of bit densities—from completely empty vectors to those with millions of set bits. Understanding these trends is key to selecting the most effective approach for your specific use case. + +### Generic scalar vs optimized scalar + +The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: + +* **Byte-level Skipping**: avoiding processing empty bytes +* **Reduced Function Calls**: accessing bits directly rather than through function calls +* **Better Cache Utilization**: more sequential memory access patterns + +### Optimized scalar vs NEON + +The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: + +* **Chunk-level Skipping**: quickly skipping 16 empty bytes at once +* **Vectorized Comparison**: checking multiple bytes in parallel +* **Early Termination**: quickly determining if a chunk contains any set bits + +### NEON vs SVE + +The performance comparison between NEON and SVE depends on the bit density: + +* **Very Sparse Bit Vectors (0% - 0.01% density)**: + - NEON performs better for empty bitvectors due to lower overhead + - NEON achieves up to 127.41x speedup over generic scalar + - SVE performs better for very sparse bitvectors (0.001% density) + - SVE achieves up to 29.07x speedup over generic scalar at 0.001% density + +* **Higher Density Bit Vectors (0.1% - 10% density)**: + - SVE consistently outperforms NEON + - SVE achieves up to 1.66x speedup over NEON at 0.01% density + +### Key optimizations in SVE implementation + +The SVE implementation includes several key optimizations: + +* **Efficient Non-Zero Byte Detection**: using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. + +* **Byte-Level Processing**: using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. + +* **Value Extraction**: using `svlastb_u8` to extract both the index and value of non-zero bytes. + +* **Hybrid Vector-Scalar Approach**: combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. + +* **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. + +## Next up: apply what you’ve learned to real-world workloads + +Now that you’ve benchmarked all four bitmap scanning implementations—scalar (generic and optimized), NEON, and SVE—you have a data-driven understanding of how vectorization impacts performance across different bitmap densities. + +In the next section, you’ll explore how to apply these techniques in real-world database workloads, including: + +* Bitmap index scans + +* Bloom filter checks + +* Column-level filtering in analytical queries + +You’ll also learn practical guidelines for choosing the right implementation based on bit density, and discover optimization tips that go beyond the code to help you get the most out of Arm-based systems like Graviton4. + + diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md new file mode 100644 index 0000000000..1f5bdc05dd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -0,0 +1,50 @@ +--- +# User change +title: "Applications and optimization best practices" + +weight: 7 + +layout: "learningpathall" +--- +## Applications to database systems + +Optimized bitmap scanning can accelerate several core operations in modern database engines, particularly those used for analytical and vectorized workloads. + +### Bitmap index scans +Bitmap indexes are widely used in analytical databases to accelerate queries with multiple filter predicates across large datasets. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. + +### Bloom filter checks + +Bloom filters are probabilistic structures used to test set membership, commonly employed in join filters or subquery elimination. Vectorized scanning via NEON or SVE accelerates these checks by quickly rejecting rows that don’t match, reducing the workload on subsequent stages of the query. + +### Column filtering + +Columnar databases frequently use bitmap filters to track which rows satisfy filter conditions. These bitmaps can be scanned in a vectorized fashion using NEON or SVE instructions, substantially speeding up predicate evaluation and minimizing CPU cycles spent on row selection. + +## Best practices + +Based on the benchmark results, here are some best practices for optimizing bitmap scanning operations: + +* Choose the right implementation based on the expected bit density**: + - For empty bit vectors: NEON is optimal + - For very sparse bit vectors (0.001% - 0.1% set bits): SVE is optimal due to efficient skipping + - For medium to high densities (> 0.1% density): SVE still outperforms NEON + +* Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. + +* Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. + +* Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. + +* Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. + +## Conclusion + +Scalable Vector Extension (SVE) instructions provide a powerful and portable way to accelerate bitmap scanning in modern database systems. When implemented on Arm Neoverse V2–based servers like AWS Graviton4, they deliver substantial performance improvements across a wide range of bit densities. + +The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it maintains a performance advantage by amortizing scan costs across wider vectors. + +These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. + + + diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md index e1f27e22ea..9da8cbe6b4 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md @@ -1,13 +1,9 @@ --- title: Accelerate Bitmap Scanning with NEON and SVE Instructions on Arm servers -draft: true -cascade: - draft: true - minutes_to_complete: 20 -who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances. +who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone interested in optimizing data processing workloads on Arm-based cloud instances. learning_objectives: diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md deleted file mode 100644 index 2b7eb102b3..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md +++ /dev/null @@ -1,570 +0,0 @@ ---- -# User change -title: "Compare performance of different Bitmap Scanning implementations" - -weight: 2 - -layout: "learningpathall" - - ---- - -## Introduction - -Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly impact query execution times, especially for large datasets. - -In this learning path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. - -## What is Bitmap Scanning? - -Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: - -1. **Bitmap Indexes**: Where each bit represents whether a row satisfies a particular condition -2. **Bloom Filters**: Probabilistic data structures used to test set membership -3. **Column Filters**: Bit vectors indicating which rows match certain predicates - -The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. - -## The Evolution of Vector Processing for Bitmap Scanning - -Let's look at how vector processing has evolved for bitmap scanning: - -1. **Generic Scalar Processing**: Traditional bit-by-bit processing with conditional branches -2. **Optimized Scalar Processing**: Byte-level skipping to avoid processing empty bytes -3. **NEON**: Fixed-length 128-bit SIMD processing with vector operations -4. **SVE**: Scalable vector processing with predication and specialized instructions - -## Set Up Your Environment - -To follow this learning path, you will need: - -1. An AWS Graviton4 instance running `Ubuntu 24.04`. -2. GCC compiler with SVE support - -Let's start by setting up our environment: - -```bash -sudo apt-get update -sudo apt-get install -y build-essential gcc g++ -``` -An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you can run it with newer versions of GCC as well. - -Create a directory for your implementations: -```bash -mkdir -p bitmap_scan -cd bitmap_scan -``` - -## Bitmap Data Structure - -First, let's define a simple bitmap data structure that will serve as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: - - A byte array to store the actual bits - - Tracking of the physical size(bytes) - - Tracking of the logical size(bits) - -For testing the different implementations in this Learning Path, you will also need functions to generate and analyze the bitmaps. - -Use a file editor of your choice and the copy the code below into `bitvector_scan_benchmark.c`: - -```c -// Define a simple bit vector structure -typedef struct { - uint8_t* data; - size_t size_bytes; - size_t size_bits; -} bitvector_t; - -// Create a new bit vector -bitvector_t* bitvector_create(size_t size_bits) { - bitvector_t* bv = (bitvector_t*)malloc(sizeof(bitvector_t)); - bv->size_bits = size_bits; - bv->size_bytes = (size_bits + 7) / 8; - bv->data = (uint8_t*)calloc(bv->size_bytes, 1); - return bv; -} - -// Free bit vector resources -void bitvector_free(bitvector_t* bv) { - free(bv->data); - free(bv); -} - -// Set a bit in the bit vector -void bitvector_set_bit(bitvector_t* bv, size_t pos) { - if (pos < bv->size_bits) { - bv->data[pos / 8] |= (1 << (pos % 8)); - } -} - -// Get a bit from the bit vector -bool bitvector_get_bit(bitvector_t* bv, size_t pos) { - if (pos < bv->size_bits) { - return (bv->data[pos / 8] & (1 << (pos % 8))) != 0; - } - return false; -} - -// Generate a bit vector with specified density -bitvector_t* generate_bitvector(size_t size_bits, double density) { - bitvector_t* bv = bitvector_create(size_bits); - - // Set bits according to density - size_t num_bits_to_set = (size_t)(size_bits * density); - - for (size_t i = 0; i < num_bits_to_set; i++) { - size_t pos = rand() % size_bits; - bitvector_set_bit(bv, pos); - } - - return bv; -} - -// Count set bits in the bit vector -size_t bitvector_count_scalar(bitvector_t* bv) { - size_t count = 0; - for (size_t i = 0; i < bv->size_bits; i++) { - if (bitvector_get_bit(bv, i)) { - count++; - } - } - return count; -} -``` - -## Bitmap Scanning Implementations - -Now, let's implement four versions of a bitmap scanning operation that finds all positions where a bit is set: - -### 1. Generic Scalar Implementation - -This is the most straightforward implementation, checking each bit individually. It serves as our baseline for comparison against the other implementations to follow. Copy the code below into the same file: - -```c -// Generic scalar implementation of bit vector scanning (bit-by-bit) -size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - - for (size_t i = 0; i < bv->size_bits; i++) { - if (bitvector_get_bit(bv, i)) { - result_positions[result_count++] = i; - } - } - - return result_count; -} -``` - -You will notice this generic C implementation processes every bit, even when most bits are not set. It has high function call overhead and does not advantage of vector instructions. - -In the following implementations, you will address these inefficiencies with more optimized techniques. - -### 2. Optimized Scalar Implementation - -This implementation adds byte-level skipping to avoid processing empty bytes. Copy this optimized C scalar implementation code into the same file: - -```c -// Optimized scalar implementation of bit vector scanning (byte-level) -size_t scan_bitvector_scalar(bitvector_t* bv, uint32_t* result_positions) { -size_t result_count = 0; - - for (size_t byte_idx = 0; byte_idx < bv->size_bytes; byte_idx++) { - uint8_t byte = bv->data[byte_idx]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = byte_idx * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - - return result_count; -} -``` -Instead of iterating through each bit, this implementation processes one byte(8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely, For sparse bitmaps, this can dramatically reduce the number of bit checks. - -### 3. NEON Implementation - -This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. Copy the NEON implementation shown below into the same file: -```c -// NEON implementation of bit vector scanning -size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - - // Process 16 bytes at a time using NEON - size_t i = 0; - for (; i + 16 <= bv->size_bytes; i += 16) { - uint8x16_t data = vld1q_u8(&bv->data[i]); - - // Quick check if all bytes are zero - uint8x16_t zero = vdupq_n_u8(0); - uint8x16_t cmp = vceqq_u8(data, zero); - uint64x2_t cmp64 = vreinterpretq_u64_u8(cmp); - - // If all bytes are zero (all comparisons are true/0xFF), skip this chunk - if (vgetq_lane_u64(cmp64, 0) == UINT64_MAX && - vgetq_lane_u64(cmp64, 1) == UINT64_MAX) { - continue; - } - - // Process each byte - uint8_t bytes[16]; - vst1q_u8(bytes, data); - - for (int j = 0; j < 16; j++) { - uint8_t byte = bytes[j]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = (i + j) * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - } - - // Handle remaining bytes with scalar code - for (; i < bv->size_bytes; i++) { - uint8_t byte = bv->data[i]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = i * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - - return result_count; -} -``` -This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. - -### 4. SVE Implementation - -This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. Copy this SVE implementation into the same file: - -```c -// SVE implementation using svcmp_u8, PNEXT, and LASTB -size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - size_t sve_len = svcntb(); - svuint8_t zero = svdup_n_u8(0); - - // Process the bitvector to find all set bits - for (size_t offset = 0; offset < bv->size_bytes; offset += sve_len) { - svbool_t pg = svwhilelt_b8((uint64_t)offset, (uint64_t)bv->size_bytes); - svuint8_t data = svld1_u8(pg, bv->data + offset); - - // Prefetch next chunk - if (offset + sve_len < bv->size_bytes) { - __builtin_prefetch(bv->data + offset + sve_len, 0, 0); - } - - // Find non-zero bytes - svbool_t non_zero = svcmpne_u8(pg, data, zero); - - // Skip if all bytes are zero - if (!svptest_any(pg, non_zero)) { - continue; - } - - // Create an index vector for byte positions - svuint8_t indexes = svindex_u8(0, 1); // 0, 1, 2, 3, ... - - // Initialize next with false predicate - svbool_t next = svpfalse_b(); - - // Find the first non-zero byte - next = svpnext_b8(non_zero, next); - - // Process each non-zero byte using PNEXT - while (svptest_any(pg, next)) { - // Get the index of this byte - uint8_t byte_idx = svlastb_u8(next, indexes); - - // Get the actual byte value - uint8_t byte_value = svlastb_u8(next, data); - - // Calculate the global byte position - size_t global_byte_pos = offset + byte_idx; - - // Process each bit in the byte using scalar code - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte_value & (1 << bit_pos)) { - size_t global_bit_pos = global_byte_pos * 8 + bit_pos; - if (global_bit_pos < bv->size_bits) { - result_positions[result_count++] = global_bit_pos; - } - } - } - - // Find the next non-zero byte - next = svpnext_b8(non_zero, next); - } - } - - return result_count; -} -``` -The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. - -## Benchmarking Code - -Now, that you have created four different implementations of a bitmap scanning algorithm, let's create a benchmarking framework to compare the performance of our implementations. Copy the code shown below into `bitvector_scan_benchmark.c` : - -```c -// Timing function for bit vector scanning -double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), - bitvector_t* bv, uint32_t* result_positions, - int iterations, size_t* found_count) { - struct timespec start, end; - *found_count = 0; - - clock_gettime(CLOCK_MONOTONIC, &start); - - for (int iter = 0; iter < iterations; iter++) { - size_t count = scan_func(bv, result_positions); - if (iter == 0) { - *found_count = count; - } - } - - clock_gettime(CLOCK_MONOTONIC, &end); - - double elapsed = (end.tv_sec - start.tv_sec) * 1000.0 + - (end.tv_nsec - start.tv_nsec) / 1000000.0; - return elapsed / iterations; -} -``` - -## Main Function -The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. Copy the main function code below into `bitvector_scan_benchmark.c`: - -```C -int main() { - srand(time(NULL)); - - printf("Bit Vector Scanning Performance Benchmark\n"); - printf("========================================\n\n"); - - // Parameters - size_t bitvector_size = 10000000; // 10 million bits - int iterations = 10; // 10 iterations for timing - - // Test different densities - double densities[] = {0.0, 0.0001, 0.001, 0.01, 0.1}; - int num_densities = sizeof(densities) / sizeof(densities[0]); - - printf("Bit vector size: %zu bits\n", bitvector_size); - printf("Iterations: %d\n\n", iterations); - - // Allocate result array - uint32_t* result_positions = (uint32_t*)malloc(bitvector_size * sizeof(uint32_t)); - - printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", - "Density", "Set Bits", "Scalar Gen (ms)", "Scalar Opt (ms)", "NEON (ms)", "SVE (ms)"); - printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", - "-------", "--------", "--------------", "--------------", "--------", "---------------"); - - for (int d = 0; d < num_densities; d++) { - double density = densities[d]; - - // Generate bit vector with specified density - bitvector_t* bv = generate_bitvector(bitvector_size, density); - - // Count actual set bits - size_t actual_set_bits = bitvector_count_scalar(bv); - - // Benchmark implementations - size_t found_scalar_gen, found_scalar, found_neon, found_sve2; - - double scalar_gen_time = benchmark_scan(scan_bitvector_scalar_generic, bv, result_positions, - iterations, &found_scalar_gen); - - double scalar_time = benchmark_scan(scan_bitvector_scalar, bv, result_positions, - iterations, &found_scalar); - - double neon_time = benchmark_scan(scan_bitvector_neon, bv, result_positions, - iterations, &found_neon); - - double sve2_time = benchmark_scan(scan_bitvector_sve2_pnext, bv, result_positions, - iterations, &found_sve2); - - // Print results - printf("%-10.4f %-15zu %-15.3f %-15.3f %-15.3f %-15.3f\n", - density, actual_set_bits, scalar_gen_time, scalar_time, neon_time, sve2_time); - - // Print speedups for this density - printf("Speedups at %.4f density:\n", density); - printf(" Scalar Opt vs Scalar Gen: %.2fx\n", scalar_gen_time / scalar_time); - printf(" NEON vs Scalar Gen: %.2fx\n", scalar_gen_time / neon_time); - printf(" SVE vs Scalar Gen: %.2fx\n", scalar_gen_time / sve2_time); - printf(" NEON vs Scalar Opt: %.2fx\n", scalar_time / neon_time); - printf(" SVE vs Scalar Opt: %.2fx\n", scalar_time / sve2_time); - printf(" SVE vs NEON: %.2fx\n\n", neon_time / sve2_time); - - // Verify results match - if (found_scalar_gen != found_scalar || found_scalar_gen != found_neon || found_scalar_gen != found_sve2) { - printf("WARNING: Result mismatch at %.4f density!\n", density); - printf(" Scalar Gen found %zu bits\n", found_scalar_gen); - printf(" Scalar Opt found %zu bits\n", found_scalar); - printf(" NEON found %zu bits\n", found_neon); - printf(" SVE found %zu bits\n\n", found_sve2); - } - - // Clean up - bitvector_free(bv); - } - - free(result_positions); - - return 0; -} -``` - -## Compiling and Running - -You are now ready to compile and run your bitmap scanning implementations. - -To compile our bitmap scanning implementations with the appropriate flags, run: - -```bash -gcc -O3 -march=armv9-a+sve2 -o bitvector_scan_benchmark bitvector_scan_benchmark.c -lm -``` - -## Performance Results - -When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results should look similar to: - -### Execution Time (ms) - -| Density | Set Bits | Scalar Generic | Scalar Optimized | NEON | SVE | -|---------|----------|----------------|------------------|-------|------------| -| 0.0000 | 0 | 7.169 | 0.456 | 0.056 | 0.093 | -| 0.0001 | 1,000 | 7.176 | 0.477 | 0.090 | 0.109 | -| 0.0010 | 9,996 | 7.236 | 0.591 | 0.377 | 0.249 | -| 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | -| 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | - -### Speedup vs Generic Scalar - -| Density | Scalar Optimized | NEON | SVE | -|---------|------------------|---------|------------| -| 0.0000 | 15.72x | 127.41x | 77.70x | -| 0.0001 | 15.05x | 80.12x | 65.86x | -| 0.0010 | 12.26x | 19.35x | 29.07x | -| 0.0100 | 5.02x | 3.49x | 5.78x | -| 0.1000 | 1.54x | 1.40x | 1.90x | - -## Understanding the Performance Results - -### Generic Scalar vs Optimized Scalar - -The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: - -1. **Byte-level Skipping**: Avoiding processing empty bytes -2. **Reduced Function Calls**: Accessing bits directly rather than through function calls -3. **Better Cache Utilization**: More sequential memory access patterns - -### Optimized Scalar vs NEON - -The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: - -1. **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once -2. **Vectorized Comparison**: Checking multiple bytes in parallel -3. **Early Termination**: Quickly determining if a chunk contains any set bits - -### NEON vs SVE - -The performance comparison between NEON and SVE depends on the bit density: - -1. **Very Sparse Bit Vectors (0% - 0.01% density)**: - - NEON performs better for empty bitvectors due to lower overhead - - NEON achieves up to 127.41x speedup over generic scalar - - SVE performs better for very sparse bitvectors (0.001% density) - - SVE achieves up to 29.07x speedup over generic scalar at 0.001% density - -2. **Higher Density Bit Vectors (0.1% - 10% density)**: - - SVE consistently outperforms NEON - - SVE achieves up to 1.66x speedup over NEON at 0.01% density - -# Key Optimizations in SVE Implementation - -The SVE implementation includes several key optimizations: - -1. **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. - -2. **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. - -3. **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. - -4. **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. - -5. **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. - - -## Application to Database Systems - -These bitmap scanning optimizations can be applied to various database operations: - -### 1. Bitmap Index Scans - -Bitmap indexes are commonly used in analytical databases to accelerate queries with multiple filter conditions. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. - -### 2. Bloom Filter Checks - -Bloom filters are probabilistic data structures used to test set membership. They are often used in database systems to quickly filter out rows that don't match certain conditions. The NEON and SVE implementations can accelerate these bloom filter checks. - -### 3. Column Filtering - -In column-oriented databases, bitmap filters are often used to represent which rows match certain predicates. The NEON and SVE implementation can speed up the scanning of these bitmap filters, improving query performance. - -## Best Practices - -Based on our benchmark results, here are some best practices for optimizing bitmap scanning operations: - -1. **Choose the Right Implementation**: Select the appropriate implementation based on the expected bit density: - - For empty bit vectors: NEON is optimal - - For very sparse bit vectors (0.001% - 0.1% density): SVE is optimal - - For higher densities (> 0.1% density): SVE still outperforms NEON - -2. **Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. - -3. **Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. - -4. **Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. - -5. **Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. - -## Conclusion - -The SVE instructions provides a powerful way to accelerate bitmap scanning operations in database systems. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your database workloads. - -The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it continues to provide substantial speedups over traditional approaches. - -These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md new file mode 100644 index 0000000000..616f37d088 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md @@ -0,0 +1,55 @@ +--- +title: Optimizing Arm binaries and libraries with LLVM-BOLT and profile merging + +draft: true +cascade: + draft: true + +minutes_to_complete: 30 + +who_is_this_for: Performance engineers, software developers working on Arm platforms who want to optimize both application binaries and shared libraries using LLVM-BOLT. + +learning_objectives: + - Instrument and optimize binaries for individual workload features using LLVM-BOLT. + - Collect separate BOLT profiles and merge them for comprehensive code coverage. + - Optimize shared libraries independently. + - Integrate optimized shared libraries into applications. + - Evaluate and compare application and library performance across baseline, isolated, and merged optimization scenarios. + +prerequisites: + - An Arm based system running Linux with BOLT and Linux Perf installed. The Linux kernel should be version 5.15 or later. + - (Optional) A second, more powerful Linux system to build the software executable and run BOLT. + +author: Gayathri Narayana Yegna Narayanan + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse + - Cortex-A +tools_software_languages: + - BOLT + - perf + - Runbook +operatingsystems: + - Linux + +further_reading: + - resource: + title: BOLT README + link: https://github.com/llvm/llvm-project/tree/main/bolt + type: documentation + - resource: + title: BOLT - A Practical Binary Optimizer for Data Centers and Beyond + link: https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png b/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png new file mode 100644 index 0000000000..c69844bed4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md new file mode 100644 index 0000000000..1d80f6e6e7 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md @@ -0,0 +1,27 @@ +--- +title: Overview of BOLT Merge +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) is a post-link binary optimizer that uses Linux Perf data to re-order the executable code layout to reduce memory overhead and improve performance. + +In this Learning Path, you'll learn how to: +- Collect and merge BOLT profiles from multiple workload features (e.g., read-only and write-only) +- Independently optimize application binaries and external user-space libraries (e.g., `libssl.so`, `libcrypto.so`) +- Link the final optimized binary with the separately bolted libraries to deploy a fully optimized runtime stack + +While MySQL and sysbench are used as examples, this method applies to **any feature-rich application** that: +- Exhibits multiple runtime paths +- Uses dynamic libraries +- Requires full-stack binary optimization for performance-critical deployment + +The workflow includes: +1. Profiling each workload feature separately +2. Profiling external libraries independently +3. Merging profiles for broader code coverage +4. Applying BOLT to each binary and library +5. Linking bolted libraries with the merged-profile binary + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md new file mode 100644 index 0000000000..c67ed17850 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md @@ -0,0 +1,89 @@ +--- +title: BOLT Optimization - First feature +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this step, you will instrument an application binary (such as `mysqld`) with BOLT to collect runtime profile data for a specific feature — for example, a **read-only workload**. + +The collected profile will later be merged with others and used to optimize the application's code layout. + +### Step 1: Build or obtain the uninstrumented binary + +Make sure your application binary is: + +- Built from source (e.g., `mysqld`) +- Unstripped, with symbol information available +- Compiled with frame pointers enabled (`-fno-omit-frame-pointer`) + +You can verify this with: + +```bash +readelf -s /path/to/mysqld | grep main +``` + +If the symbols are missing, rebuild the binary with debug info and no stripping. + +--- + +### Step 2: Instrument the binary with BOLT + +Use `llvm-bolt` to create an instrumented version of the binary: + +```bash +llvm-bolt /path/to/mysqld \\ + -instrument \\ + -o /path/to/mysqld.instrumented \\ + --instrumentation-file=/path/to/profile-readonly.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +### Explanation of key options + +- `-instrument`: Enables profile generation instrumentation +- `--instrumentation-file`: Path where the profile output will be saved +- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes) + +--- + +### Step 3: Run the instrumented binary under a feature-specific workload + +Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_only.lua run +``` + +> Adjust this command as needed for your workload and CPU/core binding. + +The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data. + +--- + +### Step 4: Verify the profile was created + +After running the workload: + +```bash +ls -lh /path/to/profile-readonly.fdata +``` + +You should see a non-empty file. This file will later be merged with other profiles (e.g., for write-only traffic) to generate a complete merged profile. + +--- + + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md new file mode 100644 index 0000000000..f1ea41f09c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md @@ -0,0 +1,100 @@ +--- +title: BOLT Optimization - Second Feature & BOLT Merge to combine +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this step, you'll collect profile data for a **write-heavy** workload and also **instrument external libraries** such as `libcrypto.so` and `libssl.so` used by the application (e.g., MySQL). + + +### Step 1: Run Write-Only Workload for Application Binary + +Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_write_only.lua run +``` + +Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`. +--- +### Step 2: Verify the Second Profile Was Generated + +```bash +ls -lh /path/to/profile-writeonly.fdata +``` + +Both `.fdata` files should now exist and contain valid data: + +- `profile-readonly.fdata` +- `profile-writeonly.fdata` + +--- + +### Step 3: Merge the Feature Profiles + +Use `merge-fdata` to combine the feature-specific profiles into one comprehensive `.fdata` file: + +```bash +merge-fdata /path/to/profile-readonly.fdata /path/to/profile-writeonly.fdata \\ + -o /path/to/profile-merged.fdata +``` + +**Example command from an actual setup:** + +```bash +/home/ubuntu/llvm-latest/build/bin/merge-fdata prof-instrumentation-readonly.fdata prof-instrumentation-writeonly.fdata \\ + -o prof-instrumentation-readwritemerged.fdata +``` + +Output: + +``` +Using legacy profile format. +Profile from 2 files merged. +``` + +This creates a single merged profile (`profile-merged.fdata`) covering both read-only and write-only workload behaviors. + +--- + +### Step 4: Verify the Merged Profile + +Check the merged `.fdata` file: + +```bash +ls -lh /path/to/profile-merged.fdata +``` + +--- +### Step 5: Generate the Final Binary with the Merged Profile + +Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file: + +```bash +llvm-bolt build/bin/mysqld \\ + -o build/bin/mysqldreadwrite_merged.bolt_instrumentation \\ + -data=/home/ubuntu/mysql-server-8.0.33/sysbench/prof-instrumentation-readwritemerged.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats 2>&1 | tee bolt_orig.log +``` + +This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features. + + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md new file mode 100644 index 0000000000..376c249164 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md @@ -0,0 +1,154 @@ +--- +title: BOLT the Libraries separately +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +### Step 1: Instrument Shared Libraries (e.g., libcrypto, libssl) + +If system libraries like `/usr/lib/libssl.so` are stripped, rebuild OpenSSL from source with relocations: + +```bash +git clone https://github.com/openssl/openssl.git +cd openssl +./config -O2 -Wl,--emit-relocs --prefix=$HOME/bolt-libs/openssl +make -j$(nproc) +make install +``` + +--- + +### Step 2: BOLT-Instrument libssl.so.3 + +Use `llvm-bolt` to instrument `libssl.so.3`: + +```bash +llvm-bolt $HOME/bolt-libs/openssl/lib/libssl.so.3 \\ + -instrument \\ + -o $HOME/bolt-libs/openssl/lib/libssl.so.3.instrumented \\ + --instrumentation-file=libssl-readwrite.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +Then launch MySQL using the **instrumented shared library** and run a **read+write** sysbench test to populate the profile: + +--- + +### Step 3: Optimize 'libssl.so' Using Its Profile + +After running the read+write test, ensure `libssl-readwrite.fdata` is populated. + + +Run BOLT on the uninstrumented `libssl.so` with the collected read-write profile: + +```bash +llvm-bolt /path/to/libssl.so.3 \\ + -o /path/to/libssl.so.optimized \\ + -data=/path/to/prof-instrumentation-libssl-readwrite.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats +``` + +--- + +### Step 3: Replace the Library at Runtime + +Copy the optimized version over the original and export the path: + +```bash +cp /path/to/libssl.so.optimized /path/to/libssl.so.3 +export LD_LIBRARY_PATH=/path/to/ +``` + +This ensures MySQL will dynamically load the optimized `libssl.so`. + +--- + +### Step 4: Run Final Workload and Validate Performance + +Start the BOLT-optimized MySQL binary and link it against the optimized `libssl.so`. Run the combined workload: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_write.lua run +``` + +--- + +In the next step, you'll optimize an additional critical external library (`libcrypto.so`) using BOLT, following a similar process as `libssl.so`. Afterward, you'll interpret performance results to validate and compare optimizations across baseline and merged + scenarios. + +### Step 1: BOLT optimization for 'libcrypto.so' + +Follow these steps to instrument and optimize `libcrypto.so`: + +#### Instrument `libcrypto.so`: + +```bash +llvm-bolt /path/to/libcrypto.so.3 \\ + -instrument \\ + -o /path/to/libcrypto.so.3.instrumented \\ + --instrumentation-file=libcrypto-readwrite.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +Run MySQL under the read-write workload to populate `libcrypto-readwrite.fdata`: + +```bash +export LD_LIBRARY_PATH=/path/to/libcrypto-instrumented +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_write.lua run +``` + +#### Optimize the `libcrypto.so` library: + +```bash +llvm-bolt /path/to/original/libcrypto.so.3 \\ + -o /path/to/libcrypto.so.optimized \\ + -data=libcrypto-readwrite.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats +``` + +Replace the original at runtime: + +```bash +cp /path/to/libcrypto.so.optimized /path/to/libcrypto.so.3 +export LD_LIBRARY_PATH=/path/to/ +``` + +Run a final validation workload to ensure functionality and measure performance improvements. + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md new file mode 100644 index 0000000000..07cd298c5f --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md @@ -0,0 +1,68 @@ +--- +title: Performance Results - Baseline, BOLT Merge, and Full Optimization +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +This step presents the performance comparisons across various BOLT optimization scenarios. You'll see how baseline performance compares with BOLT-optimized binaries using merged profiles and bolted external libraries. + +### 1. Baseline Performance (No BOLT) + +| Metric | Read-Only (Baseline) | Write-Only (Baseline) | Read+Write (Baseline) | +|---------------------------|----------------------|------------------------|------------------------| +| Transactions/sec (TPS) | 1006.33 | 2113.03 | 649.15 | +| Queries/sec (QPS) | 16,101.24 | 12,678.18 | 12,983.09 | +| Latency avg (ms) | 0.99 | 0.47 | 1.54 | +| Latency 95th % (ms) | 1.04 | 0.83 | 1.79 | +| Total time (s) | 9.93 | 4.73 | 15.40 | + +--- + +### 2. Performance Comparison: Merged vs Non-Merged Instrumentation + +| Metric | Regular BOLT R+W (No Merge, system libssl) | Merged BOLT (BOLTed Read+Write + BOLTed libssl) | +|---------------------------|---------------------------------------------|-------------------------------------------------| +| Transactions/sec (TPS) | 850.32 | 879.18 | +| Queries/sec (QPS) | 17,006.35 | 17,583.60 | +| Latency avg (ms) | 1.18 | 1.14 | +| Latency 95th % (ms) | 1.52 | 1.39 | +| Total time (s) | 11.76 | 11.37 | + +Second run: + +| Metric | Regular BOLT R+W (No Merge, system libssl) | Merged BOLT (BOLTed Read+Write + BOLTed libssl) | +|---------------------------|---------------------------------------------|-------------------------------------------------| +| Transactions/sec (TPS) | 853.16 | 887.14 | +| Queries/sec (QPS) | 17,063.22 | 17,742.89 | +| Latency avg (ms) | 1.17 | 1.13 | +| Latency 95th % (ms) | 1.39 | 1.37 | +| Total time (s) | 239.9 | 239.9 | + +--- + +### 3. BOLTed READ, BOLTed WRITE, MERGED BOLT (Read+Write+BOLTed Libraries) + +| Metric | Bolted Read-Only | Bolted Write-Only | Merged BOLT (Read+Write+libssl) | Merged BOLT (Read+Write+libcrypto) | Merged BOLT (Read+Write+libssl+libcrypto) | +|---------------------------|---------------------|-------------------|----------------------------------|------------------------------------|-------------------------------------------| +| Transactions/sec (TPS) | 1348.47 | 3170.92 | 887.14 | 896.58 | 902.98 | +| Queries/sec (QPS) | 21575.45 | 19025.52 | 17742.89 | 17931.57 | 18059.52 | +| Latency avg (ms) | 0.74 | 0.32 | 1.13 | 1.11 | 1.11 | +| Latency 95th % (ms) | 0.77 | 0.55 | 1.37 | 1.34 | 1.34 | +| Total time (s) | 239.8 | 239.72 | 239.9 | 239.9 | 239.9 | + +--- + +### Key Metrics to Analyze + +- **TPS (Transactions Per Second)**: Higher is better. +- **QPS (Queries Per Second)**: Higher is better. +- **Latency (Average and 95th Percentile)**: Lower is better. + +--- + +### Conclusion +- BOLT substantially improves performance over non-optimized binaries due to better instruction cache utilization and reduced execution path latency. +- Merging feature-specific profiles does not negatively affect performance; instead, it captures a broader set of runtime behaviors, making the binary better tuned for varied real-world workloads. +- Separately optimizing external user-space libraries, even though providing smaller incremental gains, further complements the overall application optimization, delivering a fully optimized execution environment. diff --git a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/_index.md b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/_index.md index 38398a2904..933c5aea16 100644 --- a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/_index.md @@ -1,20 +1,17 @@ --- -title: Get started with network microbenchmarking and tuning with iperf3 - -draft: true -cascade: - draft: true +title: Microbenchmark and tune network performance with iPerf3 and Linux traffic control minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for performance engineers, Linux system administrators, or application developers who want to microbenchmark, simulate, or tune the networking performance of distributed systems. +who_is_this_for: This is an introductory topic for performance engineers, Linux system administrators, and application developers who want to microbenchmark, simulate, or tune the networking performance of distributed systems. learning_objectives: - - Understand how to use iperf3 and tc for network performance testing and traffic control to microbenchmark different network conditions. - - Identify and apply basic runtime parameters to tune application performance. + - Run accurate network microbenchmark tests using iPerf3. + - Simulate real-world network conditions using Linux Traffic Control (tc). + - Tune basic Linux kernel parameters to improve network performance. prerequisites: - - Foundational understanding of networking principles such as TCP/IP and UDP. + - Basic understanding of networking principles such as Transmission Control Protocol/Internet Protocol (TCP/IP) and User Datagram Protocol (UDP). - Access to two [Arm-based cloud instances](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/). author: Kieran Hejmadi @@ -25,13 +22,13 @@ subjects: Performance and Architecture armips: - Neoverse tools_software_languages: - - iperf3 + - iPerf3 operatingsystems: - Linux further_reading: - resource: - title: iperf3 user manual + title: iPerf3 user manual link: https://iperf.fr/iperf-doc.php type: documentation diff --git a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/basic-microbenchmarking.md b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/basic-microbenchmarking.md index 6cf4f0aeef..56ecf0e796 100644 --- a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/basic-microbenchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/basic-microbenchmarking.md @@ -6,17 +6,19 @@ weight: 3 layout: learningpathall --- -## Microbenchmark the TCP connection +With your systems configured and reachable, you can now use iPerf3 to microbenchmark TCP and UDP performance between your Arm-based systems. -You can microbenchmark the bandwidth between the client and server. +## Microbenchmark the TCP connection -First, start `iperf` in server mode on the server system with the following command: +Start by running `iperf` in server mode on the `SERVER` system: ```bash iperf3 -s ``` -You see the output, indicating the server is ready: +This starts the server on the default TCP port 5201. + +You should see: ```output ----------------------------------------------------------- @@ -25,20 +27,23 @@ Server listening on 5201 (test #1) ``` -The default server port is 5201. Use the `-p` flag to specify another port if it is in use. +The default server port is 5201. If it is already in use, use the `-p` flag to specify another. {{% notice Tip %}} -If you already have an `iperf3` server running, you can manually kill the process with the following command. +If you already have an `iperf3` server running, terminate it with: ```bash sudo kill $(pgrep iperf3) ``` {{% /notice %}} -Next, on the client node, run the following command to run a simple 10-second microbenchmark using the TCP protocol. +## Run a TCP test from the client + +On the client node, run the following command to run a simple 10-second microbenchmark using the TCP protocol: ```bash -iperf3 -c SERVER -V +iperf3 -c SERVER -v ``` +Replace `SERVER` with your server’s hostname or private IP address. The `-v` flag enables verbose output. The output is similar to: @@ -68,28 +73,47 @@ rcv_tcp_congestion cubic iperf Done. ``` +## TCP result highlights + +- The`Cwnd` column prints the control window size and corresponds to the allowed number of TCP transactions in flight before receiving an acknowledgment `ACK` from the server. This value grows as the connection stabilizes and adapts to link quality. + +- The `CPU Utilization` row shows both the usage on the sender and receiver. If you are migrating your workload to a different platform, such as from x86 to Arm, this is a useful metric. + +- The `snd_tcp_congestion cubic` and `rcv_tcp_congestion cubic` variables show the congestion control algorithm used. + +- `Bitrate` shows the throughput achieved. In this example, the the `t4g.xlarge` AWS instance saturates its 5 Gbps bandwidth available. -- The`Cwnd` column prints the control window size and corresponds to the allowed number of TCP transactions in flight before receiving an acknowledgment `ACK` from the server. This adjusts dynamically to not overwhelm the receiver and adjust for variable link connection strengths. +![instance-network-size#center](./instance-network-size.png "Instance network size") -- The `CPU Utilization` row shows both the usage on the sender and receiver. If you are migrating your workload to a different platform, such as from x86 to Arm, there may be variations. +## UDP result highlights -- The `snd_tcp_congestion cubic` abd `rcv_tcp_congestion cubic` variables show the congestion control algorithm used. +You can also microbenchmark the `UDP` protocol using the `-u` flag with iPerf3. Unlike TCP, UDP does not guarantee packet delivery which means some packets might be lost in transit. -- This `bitrate` shows the throughput achieved under this microbenchmark. As you can see, the 5 Gbps bandwidth available to the `t4g.xlarge` AWS instance is saturated. +To evaluate UDP performance, focus on the server-side statistics, particularly: -![instance-network-size](./instance-network-size.png) +* Packet loss percentage -### Microbenchmark UDP connection +* Jitter (variation in packet arrival time) -You can also microbenchmark the `UDP` protocol with the `-u` flag. As a reminder, UDP does not guarantee packet delivery with some packets being lost. As such you need to observe the statistics on the server side to see the percent of packets lost and the variation in packet arrival time (jitter). The UDP protocol is widely used in applications that need timely packet delivery, such as online gaming and video calls. +These metrics help assess reliability and responsiveness under real-time conditions. -Run the following command from the client to send 2 parallel UDP streams with the `-P 2` option. +UDP is commonly used in latency-sensitive applications such as: + +* Online gaming + +* Voice over IP (VoIP) + +* Video conferencing and streaming + +Because it avoids the overhead of retransmission and ordering, UDP is ideal for scenarios where timely delivery matters more than perfect accuracy. + +Run the following command from the client to send two parallel UDP streams with the `-P 2` option: ```bash -iperf3 -c SERVER -V -u -P 2 +iperf3 -c SERVER -v -u -P 2 ``` -Looking at the server output you observe 0% of packets where lost for the short test. +Look at the server output and you can see that none (0%) of packets were lost for the short test: ```output [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams @@ -98,8 +122,10 @@ Looking at the server output you observe 0% of packets where lost for the short [SUM] 0.00-10.00 sec 2.51 MBytes 2.10 Mbits/sec 0.015 ms 0/294 (0%) receiver ``` -Additionally on the client side, the 2 streams saturated 2 of the 4 cores in the system. +Additionally on the client side, the two streams saturated two of the four cores in the system: ```output CPU Utilization: local/sender 200.3% (200.3%u/0.0%s), remote/receiver 0.2% (0.0%u/0.2%s) -``` \ No newline at end of file +``` + +This demonstrates that UDP throughput is CPU-bound when pushing multiple streams. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/setup.md b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/setup.md index 29a34aa5c7..3495acf973 100644 --- a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/setup.md +++ b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/setup.md @@ -1,26 +1,36 @@ --- -title: Prepare for network performance testing +title: Set up Arm-based Linux systems for network performance testing with iPerf3 weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Configure two Arm-based Linux computers +## Environment setup and Learning Path focus -To perform network performance testing you need two Linux computers. You can use AWS EC2 instances with Graviton processors or any other Linux virtual machines from another cloud service provider. +To benchmark bandwidth and latency between Arm-based systems, you'll need to configure two Linux machines running on Arm. -You will also experiment with a local computer and a cloud instance to learn the networking performance differences compared to two cloud instances. +You can use AWS EC2 instances with Graviton processors, or Linux virtual machines from any other cloud service provider. -The instructions below use EC2 instances from AWS connected in a virtual private cloud (VPC). +This tutorial walks you through a local-to-cloud test to compare performance between: -To get started, create two Arm-based Linux instances, one system to act as the server and the other to act as the client. The instructions below use two `t4g.xlarge` instances running Ubuntu 24.04 LTS. +* Two cloud-based instances +* One local system and one cloud instance -### Install software dependencies +The setup instructions below use AWS EC2 instances connected within a Virtual Private Cloud (VPC). -Use the commands below to install `iperf3`, a powerful and flexible open-source command-line tool used for network performance measurement and tuning. It allows network administrators and engineers to actively measure the maximum achievable bandwidth on IP networks. +To get started, create two Arm-based Linux instances, with each instance serving a distinct role: -Run the following on both systems: +* One acting as a client +* One acting as a server + +The instructions below use two `t4g.xlarge` instances running Ubuntu 24.04 LTS. + +## Install software dependencies + +Use the commands below to install iPerf3, which is a powerful open-source CLI tool for measuring maximum achievable network bandwidth. + +Begin by installing iPerf3 on both the client and server systems: ```bash sudo apt update @@ -28,45 +38,81 @@ sudo apt install iperf3 -y ``` {{% notice Note %}} -If you are prompted to start `iperf3` as a daemon you can answer no. +If you're prompted to run `iperf3` as a daemon, answer "no". {{% /notice %}} -## Update Security Rules +## Update security rules -If you are working in a cloud environment like AWS, you need to update the default security rules to enable specific inbound and outbound protocols. +If you're working in a cloud environment like AWS, you must update the default security rules to enable specific inbound and outbound protocols. -From the AWS console, navigate to the security tab. Edit the inbound rules to enable `ICMP`, `UDP` and `TCP` traffic to enable communication between the client and server systems. +To do this, follow these instructions below using the AWS console: -![example_traffic](./example_traffic_rules.png) +* Navigate to the **Security** tab for each instance. +* Configure the **Inbound rules** to allow the following protocols: + * `ICMP` (for ping) + * All UDP ports (for UDP tests) + * TCP port 5201 (for traffic to enable communication between the client and server systems) -{{% notice Note %}} -For additional security set the source and port ranges to the values being used. A good solution is to open TCP port 5201 and all UDP ports and use your security group as the source. This doesn't open any traffic from outside AWS. +![example_traffic#center](./example_traffic_rules.png "AWS console view") + +{{% notice Warning %}} +For secure internal communication, set the source to your instance’s security group. This avoids exposing traffic to the internet while allowing traffic between your systems. + +You can restrict the range further by: + +* Opening only TCP port 5201 + +* Allowing all UDP ports (or a specific range) {{% /notice %}} ## Update the local DNS -To avoid using IP addresses directly, add the IP address of the other system to the `/etc/hosts` file. +To avoid using IP addresses directly, add the other system's IP address to the `/etc/hosts` file. -The local IP address of the server and client can be found in the AWS dashboard. You can also use commands like `ifconfig`, `hostname -I`, or `ip address` to find your local IP address. +You can find private IPs in the AWS dashboard, or by running: + +```bash +hostname -I +ip address +ifconfig +``` +## On the client -On the client, add the IP address of the server to the `/etc/hosts` file with name `SERVER`. +Add the server's IP address, and assign it the name `SERVER`: ```output 127.0.0.1 localhost 10.248.213.104 SERVER ``` -Repeat the same thing on the server and add the IP address of the client to the `/etc/hosts` file with the name `CLIENT`. +## On the server + +Add the client's IP address, and assign it the name `CLIENT`: + +```output +127.0.0.1 localhost +10.248.213.105 CLIENT +``` -## Confirm server is reachable +| Instance Name | Role | Description | +|---------------|--------|------------------------------------| +| SERVER | Server | Runs `iperf3` in listen mode | +| CLIENT | Client | Initiates performance tests | -Finally, confirm the client can reach the server with the ping command below. As a reference you can also ping the localhost. + + + +## Confirm the server is reachable + +Finally, confirm the client can reach the server by using the ping command below. If required, you can also ping the localhost: ```bash ping SERVER -c 3 && ping 127.0.0.1 -c 3 ``` -The output below shows that both SERVER and localhost (127.0.0.1) are reachable. Naturally, the local host response time is ~10x faster than the server. Your results will vary depending on geographic location of the systems and other networking factors. +The output below shows that both SERVER and localhost (127.0.0.1) are reachable. + +Localhost response times are typically ~10× faster than remote systems, though actual values vary based on system location and network conditions. ```output PING SERVER (10.248.213.104) 56(84) bytes of data. @@ -87,4 +133,4 @@ PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. rtt min/avg/max/mdev = 0.022/0.027/0.032/0.004 ms ``` -Continue to the next section to learn how to measure the network bandwidth between the systems. \ No newline at end of file +Now that your systems are configured, the next step is to measure the available network bandwidth between them. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/simulating-network-conditions.md b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/simulating-network-conditions.md index 03e747efcf..590c7997be 100644 --- a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/simulating-network-conditions.md +++ b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/simulating-network-conditions.md @@ -1,22 +1,24 @@ --- -title: Simulating different network conditions +title: Simulate different network conditions weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Add a delay to the TCP connection +You can simulate latency and packet loss to test how your application performs under adverse network conditions. This is especially useful when evaluating the impact of congestion, jitter, or unreliable connections in distributed systems. -The Linux `tc` utility can be used to manipulate traffic control settings. +## Add delay to the TCP connection -First, on the client system, find the name of network interface with the following command: +The Linux `tc` (traffic control) lets you manipulate network interface behavior such as delay, loss, or reordering. + +First, on the client system, identify the name of your network interface: ```bash ip addr show ``` -The output below shows the `ens5` network interface device (NIC) is the device we want to manipulate. +The output below shows that the `ens5` network interface device (NIC) is the device that you want to manipulate. ```output 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 @@ -34,7 +36,7 @@ The output below shows the `ens5` network interface device (NIC) is the device w ``` -Run the following command on the client system to add an emulated delay of 10ms on `ens5`. +Run the following command on the client system to add an emulated delay of 10ms on `ens5`: ```bash sudo tc qdisc add dev ens5 root netem delay 10ms @@ -43,13 +45,9 @@ sudo tc qdisc add dev ens5 root netem delay 10ms Rerun the basic TCP test as before on the client: ```bash -iperf3 -c SERVER -V +iperf3 -c SERVER -v ``` -Observe that the `Cwnd` size has grew larger to compensate for the longer response time. - -Additionally, the bitrate has dropped from ~4.9 to ~2.3 `Gbit/sec`. - ```output [ 5] local 10.248.213.97 port 43170 connected to 10.248.213.104 port 5201 Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 @@ -75,22 +73,29 @@ rcv_tcp_congestion cubic iperf Done. ``` +## Observations + +* The `Cwnd` size has grown larger to compensate for the longer response time. + +* The bitrate has dropped from ~4.9 to ~2.3 `Gbit/sec` - demonstrating how even modest latency impacts throughput. -### Simulate packet loss +## Simulate packet loss -To test the resiliency of a distributed application you can add a simulated packet loss of 1%. As opposed to a 10ms delay, this will result in no acknowledgment being received for 1% of packets. Given TCP is a lossless protocol a retry must be sent. +To test the resiliency of a distributed application, you can add a simulated packet loss of 1%. As opposed to a 10ms delay, this will result in no acknowledgment being received for 1% of packets. -Run these commands on the client system: +Given TCP is a lossless protocol, a retry must be sent. + +Run these commands on the client system. The first removes the delay configuration, and the second command introduces a 1% packet loss: ```bash sudo tc qdisc del dev ens5 root sudo tc qdisc add dev ens5 root netem loss 1% ``` -Rerunning the basic TCP test you see an increased number of retries (`Retr`) and a corresponding drop in bitrate. +Now rerunning the basic TCP test, and you will see an increased number of retries (`Retr`) and a corresponding drop in bitrate: ```bash -iperf3 -c SERVER -V +iperf3 -c SERVER -v ``` The output is now: @@ -102,4 +107,12 @@ Test Complete. Summary Results: [ 5] 0.00-10.00 sec 4.40 GBytes 3.78 Gbits/sec receiver ``` -Refer to the `tc` [user documentation](https://man7.org/linux/man-pages/man8/tc.8.html) for the different ways to simulate perturbation and check resiliency. +## Explore further with tc + +The tc tool can simulate: + +* Variable latency and jitter +* Packet duplication or reordering +* Bandwidth throttling + +For advanced options, refer to Refer to the [tc man page](https://man7.org/linux/man-pages/man8/tc.8.html). diff --git a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/tuning.md b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/tuning.md index dd6ddb40ec..c58b6aec3a 100644 --- a/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/tuning.md +++ b/content/learning-paths/servers-and-cloud-computing/microbenchmark-network-iperf3/tuning.md @@ -1,28 +1,33 @@ --- -title: Tuning kernel parameters +title: Tune kernel parameters weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -### Connect from a local machine +You can further optimize network performance by adjusting Linux kernel parameters and testing across different environments, including local-to-cloud scenarios. + +## Connect from a local machine You can look at ways to mitigate performance degradation due to events such as packet loss. -In this example, you will connect to the server node a local machine to demonstrate a longer response time. Check the `iperf3` [installation guide](https://iperf.fr/iperf-download.php) to install `iperf3` on other operating systems. +In this example, you will connect to the server node a local machine to demonstrate a longer response time. Check the iPerf3 [installation guide](https://iperf.fr/iperf-download.php) to install iPerf3 on other operating systems. + +Before starting the test: -Make sure to set the server security group to accept the TCP connection from your local computer IP address. You will also need to use the public IP for the cloud instance. +- Update your cloud server’s **security group** to allow incoming TCP connections from your local machine’s public IP. +- Use the **public IP address** of the cloud instance when connecting. -Running `iperf3` on the local machine and connecting to the cloud server shows a longer round trip time, in this example more than 40ms. +Running iPerf3 on the local machine and connecting to the cloud server shows a longer round trip time, in this example more than 40ms. -On your local computer run: +Run this command on your local computer: ```bash -iperf3 -c -V +iperf3 -c -v ``` -Running a standard TCP client connection with `iperf3` shows an average bitrate of 157 Mbps compared to over 2 Gbps when the client and server are both in AWS. +Compared to over 2 Gbit/sec within AWS, this test shows a reduced bitrate (~157 Mbit/sec) due to longer round-trip times (for example, >40ms). ```output Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 @@ -33,9 +38,9 @@ Test Complete. Summary Results: [ 8] 0.00-10.03 sec 187 MBytes 156 Mbits/sec receiver ``` -### Modify kernel parameters +## Modify kernel parameters on the server -On the server, your can configure Linux kernel runtime parameters with the `sysctl` command. +On the server, you can configure Linux kernel runtime parameters with the `sysctl` command. There are a plethora of values to tune that relate to performance and security. The following command can be used to list all available options. The [Linux kernel documentation](https://docs.kernel.org/networking/ip-sysctl.html#ip-sysctl) provides a more detailed description of each parameter. @@ -44,13 +49,15 @@ sysctl -a | grep tcp ``` {{% notice Note %}} -Depending on your operating system, some parameters may not be available. For example on AWS Ubuntu 22.04 LTS only the `cubic` and `reno` congestion control algorithms are available. +Depending on your operating system, some parameters might not be available. For example, on AWS Ubuntu 22.04 LTS, only the `cubic` and `reno` congestion control algorithms are supported: ```bash net.ipv4.tcp_available_congestion_control = reno cubic ``` {{% /notice %}} -You can increase the read and write max buffer sizes of the kernel on the server to enable more data to be held. This tradeoff results in increased memory utilization. +## Increase TCP buffer sizes + +You can increase the kernel's read and write buffer sizes on the server improve throughput on high-latency connections. This consumes more system memory but allows more in-flight data. To try it, run the following commands on the server: @@ -59,19 +66,19 @@ sudo sysctl net.core.rmem_max=134217728 # default = 212992 sudo sysctl net.core.wmem_max=134217728 # default = 212992 ``` -Restart the `iperf3` server. +Then, restart the iPerf3 server: ```bash iperf3 -s ``` -Run `iperf3` again on the local machine. +Now rerun iPerf3 again on your local machine: ```bash -iperf3 -c -V +iperf3 -c -v ``` -You see a significantly improved bitrate with no modification on the client side. +Without changing anything on the client, the throughput improved by over 60%. ```output Test Complete. Summary Results: @@ -81,4 +88,10 @@ Test Complete. Summary Results: ``` -You now have an introduction to networking microbenchmarking and performance tuning. \ No newline at end of file +You’ve now completed a guided introduction to: + +* Network performance microbenchmarking +* Simulating real-world network conditions +* Tuning kernel parameters for high-latency links + +You can now explore this area further by testing other parameters, tuning for specific congestion control algorithms, or integrating these benchmarks into CI pipelines for continuous performance evaluation. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md index ce7d36d129..e756bb05a2 100644 --- a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md @@ -87,7 +87,7 @@ Enter the following in your command prompt (or cloud shell), and make sure to re ```bash export ZONE=us-central1 -export CLUSTER_NAME=ollama-on-multiarch +export CLUSTER_NAME=ollama-on-arm export PROJECT_ID=YOUR_PROJECT_ID gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID ``` @@ -110,4 +110,4 @@ Finally, test the connection to the cluster with this command: kubectl cluster-info ``` -If you receive a non-error response, you're successfully connected to the K8s cluster. \ No newline at end of file +If you receive a non-error response, you're successfully connected to the K8s cluster. diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md index a896df69f4..3e1ab760cb 100644 --- a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md @@ -105,7 +105,7 @@ service/ollama-amd64-svc created 2. Optionally, set the `default Namespace` to `ollama` to simplify future commands: ```bash -config set-context --current --namespace=ollama +kubectl config set-context --current --namespace=ollama ``` 3. Get the status of nodes, pods and services by running: diff --git a/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md b/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md index 6fed602968..8e9369652f 100644 --- a/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md @@ -1,9 +1,6 @@ --- -title: Accelerate Search Operations with SVE2 MATCH Instruction on Arm servers +title: Accelerate search performance with SVE2 MATCH on Arm servers -draft: true -cascade: - draft: true minutes_to_complete: 20 @@ -11,13 +8,13 @@ who_is_this_for: This is an introductory topic for database developers, performa learning_objectives: - - Understand how SVE2 MATCH instructions work - - Implement search algorithms using scalar and SVE2 implementations using the MATCH instruction - - Compare performance between different implementations - - Measure performance improvements on Graviton4 instances + - Understand the purpose and function of SVE2 MATCH instructions. + - Implement a search algorithm using both scalar and SVE2-based MATCH approaches. + - Benchmark and compare performance between scalar and vectorized implementations. + - Analyze speedups and efficiency gains on Arm Neoverse-based instances with SVE2. + prerequisites: -- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate - cloud service provider. +- Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider. author: Pareena Verma diff --git a/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md b/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md index 2e0e3c2da5..619e785b49 100644 --- a/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md +++ b/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md @@ -1,6 +1,6 @@ --- # User change -title: "Compare performance of different Search implementations" +title: "Compare search performance using scalar and SVE2 MATCH on Arm Servers" weight: 2 @@ -10,54 +10,51 @@ layout: "learningpathall" --- ## Introduction -Searching for specific values in large arrays is a fundamental operation in many applications, from databases to text processing. The performance of these search operations can significantly impact overall application performance, especially when dealing with large datasets. - -In this learning path, you will learn how to use the SVE2 MATCH instructions available on Arm Neoverse V2 based AWS Graviton4 processors to optimize search operations in byte and half word arrays. You will compare the performance of scalar and SVE2 MATCH implementations to demonstrate the significant performance benefits of using specialized vector instructions. +Searching large arrays for specific values is a core task in performance-sensitive applications, from filtering records in a database to detecting patterns in text or images. On Arm Neoverse-based servers, SVE2 MATCH instructions unlock significant performance gains by vectorizing these operations. In this Learning Path, you’ll implement and benchmark both scalar and vectorized versions of search functions to see just how much faster workloads can run. ## What is SVE2 MATCH? SVE2 (Scalable Vector Extension 2) is an extension to the Arm architecture that provides vector processing capabilities with a length-agnostic programming model. The MATCH instruction is a specialized SVE2 instruction that efficiently searches for elements in a vector that match any element in another vector. -## Set Up Your Environment +## Set up your environment -To follow this learning path, you will need: +To work through these examples, you need: -1. An AWS Graviton4 instance running `Ubuntu 24.04`. -2. GCC compiler with SVE support +* A cloud instance with SVE2 support running Ubuntu 24.04 +* GCC compiler with SVE support -Let's start by setting up our environment: +Start by setting up your environment: ```bash sudo apt-get update sudo apt-get install -y build-essential gcc g++ ``` -An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most -recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you -can run it with newer versions of GCC as well. +An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on Ubuntu 24.04, but you can run it with newer versions of GCC as well. + +Create a directory for your implementations: -Create a directory for our implementations: ```bash mkdir -p sve2_match_demo cd sve2_match_demo ``` -## Understanding the Problem +## Understanding the problem -Our goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise. +Your goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise. This type of search operation is common in many applications: -1. **Database Systems**: Checking if a value exists in a column -2. **Text Processing**: Finding specific characters in a text -3. **Network Packet Inspection**: Looking for specific byte patterns -4. **Image Processing**: Finding specific pixel values +* **Database systems**: checking if a value exists in a column +* **Text processing**: finding specific characters in a text +* **Network packet inspection**: looking for specific byte patterns +* **Image processing**: finding specific pixel values -## Implementing Search Algorithms +## Implementing search algorithms -Let's implement three versions of our search function: +To understand the alternatives, you can implement three versions of the search function: -### 1. Generic Scalar Implementation +### 1. Generic scalar implementation -Create a generic implementation in C, checking each element individually against each key. Open a editor of your choice and copy the code shown into a file named `sve2_match_demo.c`: +Create a generic implementation in C that checks each element individually against each key. Open an editor of your choice and copy the code shown into a file named `sve2_match_demo.c`: ```c #include @@ -89,11 +86,14 @@ int search_generic_u16(const uint16_t *hay, size_t n, const uint16_t *keys, return 0; } ``` + The `search_generic_u8()` and `search_generic_u16()` functions both return 1 immediately when a match is found in the inner loop. -### 2. SVE2 MATCH Implementation +### 2. SVE2 MATCH implementation -Now create an implementation that uses SVE2 MATCH instructions to process multiple elements in parallel. Copy the code shown into the same file: +Now create an implementation that uses SVE2 MATCH instructions to process multiple elements in parallel. + +Copy the code shown into the same source file: ```c int search_sve2_match_u8(const uint8_t *hay, size_t n, const uint8_t *keys, @@ -109,7 +109,7 @@ int search_sve2_match_u8(const uint8_t *hay, size_t n, const uint8_t *keys, svuint8_t block = svld1(pg, &hay[i]); if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1; } -for (; i < n; ++i) { + for (; i < n; ++i) { uint8_t v = hay[i]; for (size_t k = 0; k < nkeys; ++k) if (v == keys[k]) return 1; @@ -138,16 +138,19 @@ int search_sve2_match_u16(const uint16_t *hay, size_t n, const uint16_t *keys, return 0; } ``` + The SVE MATCH implementation with the `search_sve2_match_u8()` and `search_sve2_match_u16()` functions provide an efficient vectorized search approach with these key technical aspects: - Uses SVE2's specialized MATCH instruction to compare multiple elements against multiple keys in parallel - - Leverages hardware-specific vector length through svcntb() for scalability + - Leverages hardware-specific vector length through `svcntb()` for scalability - Prepares a vector of search keys that's compared against blocks of data - Processes data in vector-sized chunks with early termination when matches are found. Stops immediately when any element in the vector matches. - Falls back to scalar code for remainder elements -### 3. Optimized SVE2 MATCH Implementation +### 3. Optimized SVE2 MATCH implementation + +In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. -In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. Copy the code shown into the same source file: +Copy the code shown into the same source file: ```c int search_sve2_match_u8_unrolled(const uint8_t *hay, size_t n, const uint8_t *keys, @@ -180,7 +183,7 @@ int search_sve2_match_u8_unrolled(const uint8_t *hay, size_t n, const uint8_t *k if (svptest_any(pg, match1) || svptest_any(pg, match2) || svptest_any(pg, match3) || svptest_any(pg, match4)) return 1; -} + } // Process remaining vectors one at a time for (; i + VL <= n; i += VL) { @@ -225,7 +228,7 @@ int search_sve2_match_u16_unrolled(const uint16_t *hay, size_t n, const uint16_t svbool_t match3 = svmatch_u16(pg, block3, keyvec); svbool_t match4 = svmatch_u16(pg, block4, keyvec); -if (svptest_any(pg, match1) || svptest_any(pg, match2) || + if (svptest_any(pg, match1) || svptest_any(pg, match2) || svptest_any(pg, match3) || svptest_any(pg, match4)) return 1; } @@ -245,15 +248,18 @@ if (svptest_any(pg, match1) || svptest_any(pg, match2) || return 0; } ``` + The main highlights of this implementation are: - - Processes 4 vectors per iteration instead of just one and stops immediately when any match is found in any of the 4 vectors. - - Uses prefetching (__builtin_prefetch) to reduce memory latency - - Leverages the svmatch_u8/u16 instruction to efficiently compare each element against multiple keys in a single operation + - Processes four vectors per iteration instead of just one and stops immediately when any match is found in any of the four vectors. + - Uses prefetching (`__builtin_prefetch`) to reduce memory latency + - Leverages the `svmatch_u8` and `svmatch_u16` instructions to efficiently compare each element against multiple keys in a single operation - Aligns memory to 64-byte boundaries for better memory access performance -## Benchmarking Framework +## Benchmarking framework -To compare the performance of the three implementations, you will use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates: +To compare the performance of the three implementations, use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates. + +Copy the code below into the bottom of the same source code file: ```c // Timing function @@ -341,7 +347,7 @@ int main(int argc, char **argv) { volatile int r = search_generic_u8(hay8, len, keys8, NKEYS8); (void)r; t_gen8 += nsec_now() - t0; - #if defined(__ARM_FEATURE_SVE2) +#if defined(__ARM_FEATURE_SVE2) t0 = nsec_now(); r = search_sve2_match_u8(hay8, len, keys8, NKEYS8); (void)r; t_sve8 += nsec_now() - t0; @@ -351,7 +357,7 @@ int main(int argc, char **argv) { t_sve8_unrolled += nsec_now() - t0; #endif -t0 = nsec_now(); + t0 = nsec_now(); r = search_generic_u16(hay16, len, keys16, NKEYS16); (void)r; t_gen16 += nsec_now() - t0; @@ -416,7 +422,9 @@ t0 = nsec_now(); free(hay8); free(hay16); return 0; +} ``` + ## Compiling and Running You can now compile the different search implementations: @@ -425,12 +433,13 @@ You can now compile the different search implementations: gcc -O3 -march=armv9-a+sve2 -mcpu=neoverse-v2 sve2_match_demo.c -o sve2_match_demo ``` -Now run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate: +Run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate: ```bash ./sve2_match_demo $((1<<16)) 3 0.00001 ``` -The output will look like: + +The output is similar to: ```output Haystack length : 65536 elements @@ -450,14 +459,16 @@ Average latency over 3 iterations (ns): speed‑up (orig) : 20.74x speed‑up (unroll): 27.23x ``` + You can experiment with different haystack lengths, iterations and hit probabilities. ```bash ./sve2_match_demo [length] [iterations] [hit_prob] ``` + ## Performance Results -When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will observe the following results for different hit probabilities: +When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will see different hit probabilities, as shown in the following results: ### Latency (ns per iteration) for Different Hit Rates (8-bit) @@ -498,23 +509,24 @@ When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 e ### Impact of Hit Rate on Performance The benchmark results reveal several important insights about the performance characteristics of SVE2 MATCH instructions. The most striking observation is how the performance advantage of SVE2 MATCH varies with the hit rate: -1. **Very Low Hit Rates (0% - 0.001%)**: +##### **Very Low Hit Rates (0% - 0.001%)**: - For 8-bit data, SVE2 MATCH Unrolled achieves an impressive 90-95x speedup - For 16-bit data, the speedup is around 27-28x - This is where SVE2 MATCH truly shines, as it can quickly process large chunks of data with few or no matches -2. **Low Hit Rates (0.01%)**: +##### **Low Hit Rates (0.01%)**: - Still excellent performance with 70x speedup for 8-bit and 27x for 16-bit - The vectorized approach continues to be highly effective -3. **Medium Hit Rates (0.1%)**: +##### **Medium Hit Rates (0.1%)**: - Good performance with 25x speedup for 8-bit and 21x for 16-bit - Early termination starts to reduce the advantage somewhat -4. **High Hit Rates (1%)**: +##### **High Hit Rates (1%)**: - Moderate speedup of 6x for 8-bit and 2x for 16-bit - With frequent matches, early termination limits the benefits of vectorization +##### **Summary** This pattern makes SVE2 MATCH particularly well-suited for applications where matches are rare but important to find, such as: - Searching for specific patterns in large datasets - Filtering operations with high selectivity @@ -524,8 +536,8 @@ This pattern makes SVE2 MATCH particularly well-suited for applications where ma The unrolled implementation consistently outperforms the basic SVE2 MATCH implementation: -1. **Low Hit Rates**: Up to 30% additional speedup -2. **Higher Hit Rates**: 5-20% additional speedup +* **Low Hit Rates**: Up to 30% additional speedup +* **Higher Hit Rates**: 5-20% additional speedup This demonstrates the value of combining algorithmic optimizations (loop unrolling, prefetching) with hardware-specific instructions for maximum performance. @@ -563,6 +575,5 @@ For image processing, MATCH can accelerate: ## Conclusion -The SVE2 MATCH instruction provides a powerful way to accelerate search operations in byte and half word arrays. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your applications. - +The SVE2 MATCH instruction provides a powerful way to accelerate search operations in byte and half word arrays. By implementing these optimizations on cloud instances with SVE2, you can achieve significant performance improvements for your applications. diff --git a/data/stats_current_test_info.yml b/data/stats_current_test_info.yml index eabfc682a2..b5f27fbeb1 100644 --- a/data/stats_current_test_info.yml +++ b/data/stats_current_test_info.yml @@ -1,5 +1,5 @@ summary: - content_total: 371 + content_total: 373 content_with_all_tests_passing: 0 content_with_tests_enabled: 61 sw_categories: diff --git a/data/stats_weekly_data.yml b/data/stats_weekly_data.yml index 5efe046e36..c891aa921e 100644 --- a/data/stats_weekly_data.yml +++ b/data/stats_weekly_data.yml @@ -6117,3 +6117,113 @@ avg_close_time_hrs: 0 num_issues: 12 percent_closed_vs_total: 0.0 +- a_date: '2025-06-09' + content: + automotive: 2 + cross-platform: 33 + embedded-and-microcontrollers: 41 + install-guides: 101 + iot: 6 + laptops-and-desktops: 37 + mobile-graphics-and-gaming: 34 + servers-and-cloud-computing: 119 + total: 373 + contributions: + external: 95 + internal: 500 + github_engagement: + num_forks: 30 + num_prs: 8 + individual_authors: + adnan-alsinan: 2 + alaaeddine-chakroun: 2 + albin-bernhardsson: 1 + alex-su: 1 + alexandros-lamprineas: 1 + andrew-choi: 2 + andrew-kilroy: 1 + annie-tallund: 4 + arm: 3 + arnaud-de-grandmaison: 4 + arnaud-de-grandmaison.: 1 + aude-vuilliomenet: 1 + avin-zarlez: 1 + barbara-corriero: 1 + basma-el-gaabouri: 1 + ben-clark: 1 + bolt-liu: 2 + brenda-strech: 1 + chaodong-gong: 1 + chen-zhang: 1 + christophe-favergeon: 1 + christopher-seidl: 7 + cyril-rohr: 1 + daniel-gubay: 1 + daniel-nguyen: 2 + david-spickett: 2 + dawid-borycki: 33 + diego-russo: 2 + dominica-abena-o.-amanfo: 1 + elham-harirpoush: 2 + florent-lebeau: 5 + "fr\xE9d\xE9ric--lefred--descamps": 2 + gabriel-peterson: 5 + gayathri-narayana-yegna-narayanan: 1 + georgios-mermigkis: 1 + geremy-cohen: 2 + gian-marco-iodice: 1 + graham-woodward: 1 + han-yin: 1 + iago-calvo-lista: 1 + james-whitaker: 1 + jason-andrews: 102 + joe-stech: 4 + johanna-skinnider: 2 + jonathan-davies: 2 + jose-emilio-munoz-lopez: 1 + julie-gaskin: 5 + julio-suarez: 6 + jun-he: 1 + kasper-mecklenburg: 1 + kieran-hejmadi: 10 + koki-mitsunami: 2 + konstantinos-margaritis: 8 + kristof-beyls: 1 + leandro-nunes: 1 + liliya-wu: 1 + mark-thurman: 1 + masoud-koleini: 1 + mathias-brossard: 1 + michael-hall: 5 + na-li: 1 + nader-zouaoui: 2 + nikhil-gupta: 1 + nina-drozd: 1 + nobel-chowdary-mandepudi: 6 + odin-shen: 7 + owen-wu: 2 + pareena-verma: 44 + paul-howard: 3 + peter-harris: 1 + pranay-bakre: 5 + preema-merlin-dsouza: 1 + przemyslaw-wirkus: 2 + rin-dobrescu: 1 + roberto-lopez-mendez: 2 + ronan-synnott: 45 + shuheng-deng: 1 + thirdai: 1 + tianyu-li: 2 + tom-pilar: 1 + uma-ramalingam: 1 + varun-chari: 2 + visualsilicon: 1 + willen-yang: 1 + ying-yu: 2 + yiyang-fan: 1 + zach-lasiuk: 2 + zhengjun-xing: 2 + issues: + avg_close_time_hrs: 0 + num_issues: 14 + percent_closed_vs_total: 0.0