From c5292b63fbde8f6adacd26a10ac5a13cd57d4b56 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Wed, 11 Jun 2025 22:24:27 +0000 Subject: [PATCH 01/11] Starting content dev --- .../bitmap_scan_sve2/01-introduction.md | 53 ++++++++++++ .../02-bitmap-data-structure.md | 84 +++++++++++++++++++ .../03-scalar-implementations.md | 10 +++ .../04-vector-implementations.md | 10 +++ .../05-benchmarking-and-results.md | 10 +++ .../06-application-and-best-practices.md | 10 +++ .../bitmap_scan_sve2/_index.md | 4 - .../bitmap_scan_sve2/bitmap-scan-sve.md | 18 ++-- 8 files changed, 186 insertions(+), 13 deletions(-) create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md create mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md new file mode 100644 index 0000000000..cd454a709e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -0,0 +1,53 @@ +--- +# User change +title: "Introduction to Bitmap Scanning and Vectorization on Arm" + +weight: 2 + +layout: "learningpathall" +--- +## Introduction + +Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly affect query execution times, especially for large datasets. + +In this Learning Path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. + +## What is Bitmap Scanning? + +Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: + +1. **Bitmap Indexes**: Each bit represents whether a row satisfies a particular condition +2. **Bloom Filters**: Probabilistic data structures used to test set membership +3. **Column Filters**: Bit vectors indicating which rows match certain predicates + +The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. + +## The Evolution of Vector Processing for Bitmap Scanning + +Let's look at how vector processing has evolved for bitmap scanning: + +1. **Generic Scalar Processing**: Traditional bit-by-bit processing with conditional branches +2. **Optimized Scalar Processing**: Byte-level skipping to avoid processing empty bytes +3. **NEON**: Fixed-length 128-bit SIMD processing with vector operations +4. **SVE**: Scalable vector processing with predication and specialized instructions + +## Set up your environment + +To follow this learning path, you will need: + +1. An AWS Graviton4 instance running `Ubuntu 24.04`. +2. GCC compiler with SVE support + +Let's start by setting up our environment: + +```bash +sudo apt-get update +sudo apt-get install -y build-essential gcc g++ +``` +An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you can run it with newer versions of GCC as well. + +Create a directory for your implementations: +```bash +mkdir -p bitmap_scan +cd bitmap_scan +``` diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md new file mode 100644 index 0000000000..75d6f7c029 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -0,0 +1,84 @@ +--- +# User change +title: "Building and Managing the Bit Vector Structure" + +weight: 3 + +layout: "learningpathall" + +--- +## Bitmap Data Structure + +Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: + - A byte array to store the actual bits + - Tracking of the physical size(bytes) + - Tracking of the logical size(bits) + +For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps. + +Use a file editor of your choice and the copy the code below into `bitvector_scan_benchmark.c`: + +```c +// Define a simple bit vector structure +typedef struct { + uint8_t* data; + size_t size_bytes; + size_t size_bits; +} bitvector_t; + +// Create a new bit vector +bitvector_t* bitvector_create(size_t size_bits) { + bitvector_t* bv = (bitvector_t*)malloc(sizeof(bitvector_t)); + bv->size_bits = size_bits; + bv->size_bytes = (size_bits + 7) / 8; + bv->data = (uint8_t*)calloc(bv->size_bytes, 1); + return bv; +} + +// Free bit vector resources +void bitvector_free(bitvector_t* bv) { + free(bv->data); + free(bv); +} + +// Set a bit in the bit vector +void bitvector_set_bit(bitvector_t* bv, size_t pos) { + if (pos < bv->size_bits) { + bv->data[pos / 8] |= (1 << (pos % 8)); + } +} + +// Get a bit from the bit vector +bool bitvector_get_bit(bitvector_t* bv, size_t pos) { + if (pos < bv->size_bits) { + return (bv->data[pos / 8] & (1 << (pos % 8))) != 0; + } + return false; +} + +// Generate a bit vector with specified density +bitvector_t* generate_bitvector(size_t size_bits, double density) { + bitvector_t* bv = bitvector_create(size_bits); + + // Set bits according to density + size_t num_bits_to_set = (size_t)(size_bits * density); + + for (size_t i = 0; i < num_bits_to_set; i++) { + size_t pos = rand() % size_bits; + bitvector_set_bit(bv, pos); + } + + return bv; +} + +// Count set bits in the bit vector +size_t bitvector_count_scalar(bitvector_t* bv) { + size_t count = 0; + for (size_t i = 0; i < bv->size_bits; i++) { + if (bitvector_get_bit(bv, i)) { + count++; + } + } + return count; +} +``` \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md new file mode 100644 index 0000000000..5f56fdc479 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -0,0 +1,10 @@ +--- +# User change +title: "Scalar Implementations of Bitmap Scanning" + +weight: 4 + +layout: "learningpathall" + + +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md new file mode 100644 index 0000000000..051c30f5c3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -0,0 +1,10 @@ +--- +# User change +title: "Accelerating Scans with NEON and SVE" + +weight: 5 + +layout: "learningpathall" + + +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md new file mode 100644 index 0000000000..628a490f6e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -0,0 +1,10 @@ +--- +# User change +title: "Benchmarking Bitmap Scanning Across Implementations" + +weight: 6 + +layout: "learningpathall" + + +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md new file mode 100644 index 0000000000..304c2a73a3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -0,0 +1,10 @@ +--- +# User change +title: "Introduction" + +weight: 7 + +layout: "learningpathall" + + +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md index e1f27e22ea..53c4d0bb83 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md @@ -1,10 +1,6 @@ --- title: Accelerate Bitmap Scanning with NEON and SVE Instructions on Arm servers -draft: true -cascade: - draft: true - minutes_to_complete: 20 who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md index 2b7eb102b3..60053458d0 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md @@ -1,8 +1,8 @@ --- # User change -title: "Compare performance of different Bitmap Scanning implementations" +title: "Accelerate Bitmap Scanning on Arm Servers with NEON and SVE" -weight: 2 +weight: 0 layout: "learningpathall" @@ -11,15 +11,15 @@ layout: "learningpathall" ## Introduction -Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly impact query execution times, especially for large datasets. +Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly affect query execution times, especially for large datasets. -In this learning path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. +In this Learning Path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. ## What is Bitmap Scanning? Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: -1. **Bitmap Indexes**: Where each bit represents whether a row satisfies a particular condition +1. **Bitmap Indexes**: Each bit represents whether a row satisfies a particular condition 2. **Bloom Filters**: Probabilistic data structures used to test set membership 3. **Column Filters**: Bit vectors indicating which rows match certain predicates @@ -34,7 +34,7 @@ Let's look at how vector processing has evolved for bitmap scanning: 3. **NEON**: Fixed-length 128-bit SIMD processing with vector operations 4. **SVE**: Scalable vector processing with predication and specialized instructions -## Set Up Your Environment +## Set up your environment To follow this learning path, you will need: @@ -57,12 +57,12 @@ cd bitmap_scan ## Bitmap Data Structure -First, let's define a simple bitmap data structure that will serve as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: +Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: - A byte array to store the actual bits - Tracking of the physical size(bytes) - Tracking of the logical size(bits) -For testing the different implementations in this Learning Path, you will also need functions to generate and analyze the bitmaps. +For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps. Use a file editor of your choice and the copy the code below into `bitvector_scan_benchmark.c`: @@ -131,7 +131,7 @@ size_t bitvector_count_scalar(bitvector_t* bv) { } ``` -## Bitmap Scanning Implementations +## Bitmap scanning implementations Now, let's implement four versions of a bitmap scanning operation that finds all positions where a bit is set: From f0d80a5101f46a515f26e7027e9206ad4495391d Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 12 Jun 2025 11:50:10 +0000 Subject: [PATCH 02/11] Created new sections and divided content appropriately. --- .../bitmap_scan_sve2/01-introduction.md | 4 +- .../03-scalar-implementations.md | 62 +- .../04-vector-implementations.md | 146 ++++- .../05-benchmarking-and-results.md | 198 +++++- .../06-application-and-best-practices.md | 42 +- .../bitmap_scan_sve2/bitmap-scan-sve.md | 570 ------------------ 6 files changed, 445 insertions(+), 577 deletions(-) delete mode 100644 content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index cd454a709e..b5c5cae5e9 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -1,6 +1,6 @@ --- # User change -title: "Introduction to Bitmap Scanning and Vectorization on Arm" +title: "Bitmap Scanning and Vectorization on Arm" weight: 2 @@ -51,3 +51,5 @@ Create a directory for your implementations: mkdir -p bitmap_scan cd bitmap_scan ``` +## Next Steps +In the next section, you’ll define the core bitmap data structure and utility functions for setting, clearing, and inspecting bits. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md index 5f56fdc479..653dc8112e 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -7,4 +7,64 @@ weight: 4 layout: "learningpathall" ---- \ No newline at end of file +--- +## Bitmap scanning implementations + +Now, let's implement four versions of a bitmap scanning operation that finds all positions where a bit is set: + +### 1. Generic Scalar Implementation + +This is the most straightforward implementation, checking each bit individually. It serves as our baseline for comparison against the other implementations to follow. Copy the code below into the same file: + +```c +// Generic scalar implementation of bit vector scanning (bit-by-bit) +size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + + for (size_t i = 0; i < bv->size_bits; i++) { + if (bitvector_get_bit(bv, i)) { + result_positions[result_count++] = i; + } + } + + return result_count; +} +``` + +You will notice this generic C implementation processes every bit, even when most bits are not set. It has high function call overhead and does not advantage of vector instructions. + +In the following implementations, you will address these inefficiencies with more optimized techniques. + +### 2. Optimized Scalar Implementation + +This implementation adds byte-level skipping to avoid processing empty bytes. Copy this optimized C scalar implementation code into the same file: + +```c +// Optimized scalar implementation of bit vector scanning (byte-level) +size_t scan_bitvector_scalar(bitvector_t* bv, uint32_t* result_positions) { +size_t result_count = 0; + + for (size_t byte_idx = 0; byte_idx < bv->size_bytes; byte_idx++) { + uint8_t byte = bv->data[byte_idx]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = byte_idx * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + + return result_count; +} +``` +Instead of iterating through each bit, this implementation processes one byte(8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely, For sparse bitmaps, this can dramatically reduce the number of bit checks. + diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md index 051c30f5c3..e1279aa0a8 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -1,10 +1,152 @@ --- # User change -title: "Accelerating Scans with NEON and SVE" +title: "Vectorized Bitmap Scanning with NEON and SVE" weight: 5 layout: "learningpathall" ---- \ No newline at end of file +--- +### 3. NEON Implementation + +This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. Copy the NEON implementation shown below into the same file: +```c +// NEON implementation of bit vector scanning +size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + + // Process 16 bytes at a time using NEON + size_t i = 0; + for (; i + 16 <= bv->size_bytes; i += 16) { + uint8x16_t data = vld1q_u8(&bv->data[i]); + + // Quick check if all bytes are zero + uint8x16_t zero = vdupq_n_u8(0); + uint8x16_t cmp = vceqq_u8(data, zero); + uint64x2_t cmp64 = vreinterpretq_u64_u8(cmp); + + // If all bytes are zero (all comparisons are true/0xFF), skip this chunk + if (vgetq_lane_u64(cmp64, 0) == UINT64_MAX && + vgetq_lane_u64(cmp64, 1) == UINT64_MAX) { + continue; + } + + // Process each byte + uint8_t bytes[16]; + vst1q_u8(bytes, data); + + for (int j = 0; j < 16; j++) { + uint8_t byte = bytes[j]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = (i + j) * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + } + + // Handle remaining bytes with scalar code + for (; i < bv->size_bytes; i++) { + uint8_t byte = bv->data[i]; + + // Skip empty bytes + if (byte == 0) { + continue; + } + + // Process each bit in the byte + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte & (1 << bit_pos)) { + size_t global_pos = i * 8 + bit_pos; + if (global_pos < bv->size_bits) { + result_positions[result_count++] = global_pos; + } + } + } + } + + return result_count; +} +``` +This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. + +### 4. SVE Implementation + +This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. Copy this SVE implementation into the same file: + +```c +// SVE implementation using svcmp_u8, PNEXT, and LASTB +size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { + size_t result_count = 0; + size_t sve_len = svcntb(); + svuint8_t zero = svdup_n_u8(0); + + // Process the bitvector to find all set bits + for (size_t offset = 0; offset < bv->size_bytes; offset += sve_len) { + svbool_t pg = svwhilelt_b8((uint64_t)offset, (uint64_t)bv->size_bytes); + svuint8_t data = svld1_u8(pg, bv->data + offset); + + // Prefetch next chunk + if (offset + sve_len < bv->size_bytes) { + __builtin_prefetch(bv->data + offset + sve_len, 0, 0); + } + + // Find non-zero bytes + svbool_t non_zero = svcmpne_u8(pg, data, zero); + + // Skip if all bytes are zero + if (!svptest_any(pg, non_zero)) { + continue; + } + + // Create an index vector for byte positions + svuint8_t indexes = svindex_u8(0, 1); // 0, 1, 2, 3, ... + + // Initialize next with false predicate + svbool_t next = svpfalse_b(); + + // Find the first non-zero byte + next = svpnext_b8(non_zero, next); + + // Process each non-zero byte using PNEXT + while (svptest_any(pg, next)) { + // Get the index of this byte + uint8_t byte_idx = svlastb_u8(next, indexes); + + // Get the actual byte value + uint8_t byte_value = svlastb_u8(next, data); + + // Calculate the global byte position + size_t global_byte_pos = offset + byte_idx; + + // Process each bit in the byte using scalar code + for (int bit_pos = 0; bit_pos < 8; bit_pos++) { + if (byte_value & (1 << bit_pos)) { + size_t global_bit_pos = global_byte_pos * 8 + bit_pos; + if (global_bit_pos < bv->size_bits) { + result_positions[result_count++] = global_bit_pos; + } + } + } + + // Find the next non-zero byte + next = svpnext_b8(non_zero, next); + } + } + + return result_count; +} +``` +The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. + diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md index 628a490f6e..119d2d941c 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -7,4 +7,200 @@ weight: 6 layout: "learningpathall" ---- \ No newline at end of file +--- +## Benchmarking Code + +Now, that you have created four different implementations of a bitmap scanning algorithm, let's create a benchmarking framework to compare the performance of our implementations. Copy the code shown below into `bitvector_scan_benchmark.c` : + +```c +// Timing function for bit vector scanning +double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), + bitvector_t* bv, uint32_t* result_positions, + int iterations, size_t* found_count) { + struct timespec start, end; + *found_count = 0; + + clock_gettime(CLOCK_MONOTONIC, &start); + + for (int iter = 0; iter < iterations; iter++) { + size_t count = scan_func(bv, result_positions); + if (iter == 0) { + *found_count = count; + } + } + + clock_gettime(CLOCK_MONOTONIC, &end); + + double elapsed = (end.tv_sec - start.tv_sec) * 1000.0 + + (end.tv_nsec - start.tv_nsec) / 1000000.0; + return elapsed / iterations; +} +``` + +## Main Function +The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. Copy the main function code below into `bitvector_scan_benchmark.c`: + +```C +int main() { + srand(time(NULL)); + + printf("Bit Vector Scanning Performance Benchmark\n"); + printf("========================================\n\n"); + + // Parameters + size_t bitvector_size = 10000000; // 10 million bits + int iterations = 10; // 10 iterations for timing + + // Test different densities + double densities[] = {0.0, 0.0001, 0.001, 0.01, 0.1}; + int num_densities = sizeof(densities) / sizeof(densities[0]); + + printf("Bit vector size: %zu bits\n", bitvector_size); + printf("Iterations: %d\n\n", iterations); + + // Allocate result array + uint32_t* result_positions = (uint32_t*)malloc(bitvector_size * sizeof(uint32_t)); + + printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", + "Density", "Set Bits", "Scalar Gen (ms)", "Scalar Opt (ms)", "NEON (ms)", "SVE (ms)"); + printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", + "-------", "--------", "--------------", "--------------", "--------", "---------------"); + + for (int d = 0; d < num_densities; d++) { + double density = densities[d]; + + // Generate bit vector with specified density + bitvector_t* bv = generate_bitvector(bitvector_size, density); + + // Count actual set bits + size_t actual_set_bits = bitvector_count_scalar(bv); + + // Benchmark implementations + size_t found_scalar_gen, found_scalar, found_neon, found_sve2; + + double scalar_gen_time = benchmark_scan(scan_bitvector_scalar_generic, bv, result_positions, + iterations, &found_scalar_gen); + + double scalar_time = benchmark_scan(scan_bitvector_scalar, bv, result_positions, + iterations, &found_scalar); + + double neon_time = benchmark_scan(scan_bitvector_neon, bv, result_positions, + iterations, &found_neon); + + double sve2_time = benchmark_scan(scan_bitvector_sve2_pnext, bv, result_positions, + iterations, &found_sve2); + + // Print results + printf("%-10.4f %-15zu %-15.3f %-15.3f %-15.3f %-15.3f\n", + density, actual_set_bits, scalar_gen_time, scalar_time, neon_time, sve2_time); + + // Print speedups for this density + printf("Speedups at %.4f density:\n", density); + printf(" Scalar Opt vs Scalar Gen: %.2fx\n", scalar_gen_time / scalar_time); + printf(" NEON vs Scalar Gen: %.2fx\n", scalar_gen_time / neon_time); + printf(" SVE vs Scalar Gen: %.2fx\n", scalar_gen_time / sve2_time); + printf(" NEON vs Scalar Opt: %.2fx\n", scalar_time / neon_time); + printf(" SVE vs Scalar Opt: %.2fx\n", scalar_time / sve2_time); + printf(" SVE vs NEON: %.2fx\n\n", neon_time / sve2_time); + + // Verify results match + if (found_scalar_gen != found_scalar || found_scalar_gen != found_neon || found_scalar_gen != found_sve2) { + printf("WARNING: Result mismatch at %.4f density!\n", density); + printf(" Scalar Gen found %zu bits\n", found_scalar_gen); + printf(" Scalar Opt found %zu bits\n", found_scalar); + printf(" NEON found %zu bits\n", found_neon); + printf(" SVE found %zu bits\n\n", found_sve2); + } + + // Clean up + bitvector_free(bv); + } + + free(result_positions); + + return 0; +} +``` + +## Compiling and Running + +You are now ready to compile and run your bitmap scanning implementations. + +To compile our bitmap scanning implementations with the appropriate flags, run: + +```bash +gcc -O3 -march=armv9-a+sve2 -o bitvector_scan_benchmark bitvector_scan_benchmark.c -lm +``` + +## Performance Results + +When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results should look similar to: + +### Execution Time (ms) + +| Density | Set Bits | Scalar Generic | Scalar Optimized | NEON | SVE | +|---------|----------|----------------|------------------|-------|------------| +| 0.0000 | 0 | 7.169 | 0.456 | 0.056 | 0.093 | +| 0.0001 | 1,000 | 7.176 | 0.477 | 0.090 | 0.109 | +| 0.0010 | 9,996 | 7.236 | 0.591 | 0.377 | 0.249 | +| 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | +| 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | + +### Speedup vs Generic Scalar + +| Density | Scalar Optimized | NEON | SVE | +|---------|------------------|---------|------------| +| 0.0000 | 15.72x | 127.41x | 77.70x | +| 0.0001 | 15.05x | 80.12x | 65.86x | +| 0.0010 | 12.26x | 19.35x | 29.07x | +| 0.0100 | 5.02x | 3.49x | 5.78x | +| 0.1000 | 1.54x | 1.40x | 1.90x | + +## Understanding the Performance Results + +### Generic Scalar vs Optimized Scalar + +The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: + +1. **Byte-level Skipping**: Avoiding processing empty bytes +2. **Reduced Function Calls**: Accessing bits directly rather than through function calls +3. **Better Cache Utilization**: More sequential memory access patterns + +### Optimized Scalar vs NEON + +The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: + +1. **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once +2. **Vectorized Comparison**: Checking multiple bytes in parallel +3. **Early Termination**: Quickly determining if a chunk contains any set bits + +### NEON vs SVE + +The performance comparison between NEON and SVE depends on the bit density: + +1. **Very Sparse Bit Vectors (0% - 0.01% density)**: + - NEON performs better for empty bitvectors due to lower overhead + - NEON achieves up to 127.41x speedup over generic scalar + - SVE performs better for very sparse bitvectors (0.001% density) + - SVE achieves up to 29.07x speedup over generic scalar at 0.001% density + +2. **Higher Density Bit Vectors (0.1% - 10% density)**: + - SVE consistently outperforms NEON + - SVE achieves up to 1.66x speedup over NEON at 0.01% density + +# Key Optimizations in SVE Implementation + +The SVE implementation includes several key optimizations: + +1. **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. + +2. **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. + +3. **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. + +4. **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. + +5. **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. + + + diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md index 304c2a73a3..4fdd71b2ae 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -1,10 +1,48 @@ --- # User change -title: "Introduction" +title: "Applications and Optimization Best Practices" weight: 7 layout: "learningpathall" +--- +## Application to Database Systems + +These bitmap scanning optimizations can be applied to various database operations: + +### 1. Bitmap Index Scans + +Bitmap indexes are commonly used in analytical databases to accelerate queries with multiple filter conditions. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. + +### 2. Bloom Filter Checks + +Bloom filters are probabilistic data structures used to test set membership. They are often used in database systems to quickly filter out rows that don't match certain conditions. The NEON and SVE implementations can accelerate these bloom filter checks. + +### 3. Column Filtering + +In column-oriented databases, bitmap filters are often used to represent which rows match certain predicates. The NEON and SVE implementation can speed up the scanning of these bitmap filters, improving query performance. + +## Best Practices + +Based on our benchmark results, here are some best practices for optimizing bitmap scanning operations: + +1. **Choose the Right Implementation**: Select the appropriate implementation based on the expected bit density: + - For empty bit vectors: NEON is optimal + - For very sparse bit vectors (0.001% - 0.1% density): SVE is optimal + - For higher densities (> 0.1% density): SVE still outperforms NEON + +2. **Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. + +3. **Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. + +4. **Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. + +5. **Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. + +## Conclusion + +The SVE instructions provides a powerful way to accelerate bitmap scanning operations in database systems. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your database workloads. +The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it continues to provide substantial speedups over traditional approaches. ---- \ No newline at end of file +These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md deleted file mode 100644 index 60053458d0..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/bitmap-scan-sve.md +++ /dev/null @@ -1,570 +0,0 @@ ---- -# User change -title: "Accelerate Bitmap Scanning on Arm Servers with NEON and SVE" - -weight: 0 - -layout: "learningpathall" - - ---- - -## Introduction - -Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly affect query execution times, especially for large datasets. - -In this Learning Path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. - -## What is Bitmap Scanning? - -Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: - -1. **Bitmap Indexes**: Each bit represents whether a row satisfies a particular condition -2. **Bloom Filters**: Probabilistic data structures used to test set membership -3. **Column Filters**: Bit vectors indicating which rows match certain predicates - -The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. - -## The Evolution of Vector Processing for Bitmap Scanning - -Let's look at how vector processing has evolved for bitmap scanning: - -1. **Generic Scalar Processing**: Traditional bit-by-bit processing with conditional branches -2. **Optimized Scalar Processing**: Byte-level skipping to avoid processing empty bytes -3. **NEON**: Fixed-length 128-bit SIMD processing with vector operations -4. **SVE**: Scalable vector processing with predication and specialized instructions - -## Set up your environment - -To follow this learning path, you will need: - -1. An AWS Graviton4 instance running `Ubuntu 24.04`. -2. GCC compiler with SVE support - -Let's start by setting up our environment: - -```bash -sudo apt-get update -sudo apt-get install -y build-essential gcc g++ -``` -An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you can run it with newer versions of GCC as well. - -Create a directory for your implementations: -```bash -mkdir -p bitmap_scan -cd bitmap_scan -``` - -## Bitmap Data Structure - -Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: - - A byte array to store the actual bits - - Tracking of the physical size(bytes) - - Tracking of the logical size(bits) - -For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps. - -Use a file editor of your choice and the copy the code below into `bitvector_scan_benchmark.c`: - -```c -// Define a simple bit vector structure -typedef struct { - uint8_t* data; - size_t size_bytes; - size_t size_bits; -} bitvector_t; - -// Create a new bit vector -bitvector_t* bitvector_create(size_t size_bits) { - bitvector_t* bv = (bitvector_t*)malloc(sizeof(bitvector_t)); - bv->size_bits = size_bits; - bv->size_bytes = (size_bits + 7) / 8; - bv->data = (uint8_t*)calloc(bv->size_bytes, 1); - return bv; -} - -// Free bit vector resources -void bitvector_free(bitvector_t* bv) { - free(bv->data); - free(bv); -} - -// Set a bit in the bit vector -void bitvector_set_bit(bitvector_t* bv, size_t pos) { - if (pos < bv->size_bits) { - bv->data[pos / 8] |= (1 << (pos % 8)); - } -} - -// Get a bit from the bit vector -bool bitvector_get_bit(bitvector_t* bv, size_t pos) { - if (pos < bv->size_bits) { - return (bv->data[pos / 8] & (1 << (pos % 8))) != 0; - } - return false; -} - -// Generate a bit vector with specified density -bitvector_t* generate_bitvector(size_t size_bits, double density) { - bitvector_t* bv = bitvector_create(size_bits); - - // Set bits according to density - size_t num_bits_to_set = (size_t)(size_bits * density); - - for (size_t i = 0; i < num_bits_to_set; i++) { - size_t pos = rand() % size_bits; - bitvector_set_bit(bv, pos); - } - - return bv; -} - -// Count set bits in the bit vector -size_t bitvector_count_scalar(bitvector_t* bv) { - size_t count = 0; - for (size_t i = 0; i < bv->size_bits; i++) { - if (bitvector_get_bit(bv, i)) { - count++; - } - } - return count; -} -``` - -## Bitmap scanning implementations - -Now, let's implement four versions of a bitmap scanning operation that finds all positions where a bit is set: - -### 1. Generic Scalar Implementation - -This is the most straightforward implementation, checking each bit individually. It serves as our baseline for comparison against the other implementations to follow. Copy the code below into the same file: - -```c -// Generic scalar implementation of bit vector scanning (bit-by-bit) -size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - - for (size_t i = 0; i < bv->size_bits; i++) { - if (bitvector_get_bit(bv, i)) { - result_positions[result_count++] = i; - } - } - - return result_count; -} -``` - -You will notice this generic C implementation processes every bit, even when most bits are not set. It has high function call overhead and does not advantage of vector instructions. - -In the following implementations, you will address these inefficiencies with more optimized techniques. - -### 2. Optimized Scalar Implementation - -This implementation adds byte-level skipping to avoid processing empty bytes. Copy this optimized C scalar implementation code into the same file: - -```c -// Optimized scalar implementation of bit vector scanning (byte-level) -size_t scan_bitvector_scalar(bitvector_t* bv, uint32_t* result_positions) { -size_t result_count = 0; - - for (size_t byte_idx = 0; byte_idx < bv->size_bytes; byte_idx++) { - uint8_t byte = bv->data[byte_idx]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = byte_idx * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - - return result_count; -} -``` -Instead of iterating through each bit, this implementation processes one byte(8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely, For sparse bitmaps, this can dramatically reduce the number of bit checks. - -### 3. NEON Implementation - -This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. Copy the NEON implementation shown below into the same file: -```c -// NEON implementation of bit vector scanning -size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - - // Process 16 bytes at a time using NEON - size_t i = 0; - for (; i + 16 <= bv->size_bytes; i += 16) { - uint8x16_t data = vld1q_u8(&bv->data[i]); - - // Quick check if all bytes are zero - uint8x16_t zero = vdupq_n_u8(0); - uint8x16_t cmp = vceqq_u8(data, zero); - uint64x2_t cmp64 = vreinterpretq_u64_u8(cmp); - - // If all bytes are zero (all comparisons are true/0xFF), skip this chunk - if (vgetq_lane_u64(cmp64, 0) == UINT64_MAX && - vgetq_lane_u64(cmp64, 1) == UINT64_MAX) { - continue; - } - - // Process each byte - uint8_t bytes[16]; - vst1q_u8(bytes, data); - - for (int j = 0; j < 16; j++) { - uint8_t byte = bytes[j]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = (i + j) * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - } - - // Handle remaining bytes with scalar code - for (; i < bv->size_bytes; i++) { - uint8_t byte = bv->data[i]; - - // Skip empty bytes - if (byte == 0) { - continue; - } - - // Process each bit in the byte - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte & (1 << bit_pos)) { - size_t global_pos = i * 8 + bit_pos; - if (global_pos < bv->size_bits) { - result_positions[result_count++] = global_pos; - } - } - } - } - - return result_count; -} -``` -This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. - -### 4. SVE Implementation - -This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. Copy this SVE implementation into the same file: - -```c -// SVE implementation using svcmp_u8, PNEXT, and LASTB -size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { - size_t result_count = 0; - size_t sve_len = svcntb(); - svuint8_t zero = svdup_n_u8(0); - - // Process the bitvector to find all set bits - for (size_t offset = 0; offset < bv->size_bytes; offset += sve_len) { - svbool_t pg = svwhilelt_b8((uint64_t)offset, (uint64_t)bv->size_bytes); - svuint8_t data = svld1_u8(pg, bv->data + offset); - - // Prefetch next chunk - if (offset + sve_len < bv->size_bytes) { - __builtin_prefetch(bv->data + offset + sve_len, 0, 0); - } - - // Find non-zero bytes - svbool_t non_zero = svcmpne_u8(pg, data, zero); - - // Skip if all bytes are zero - if (!svptest_any(pg, non_zero)) { - continue; - } - - // Create an index vector for byte positions - svuint8_t indexes = svindex_u8(0, 1); // 0, 1, 2, 3, ... - - // Initialize next with false predicate - svbool_t next = svpfalse_b(); - - // Find the first non-zero byte - next = svpnext_b8(non_zero, next); - - // Process each non-zero byte using PNEXT - while (svptest_any(pg, next)) { - // Get the index of this byte - uint8_t byte_idx = svlastb_u8(next, indexes); - - // Get the actual byte value - uint8_t byte_value = svlastb_u8(next, data); - - // Calculate the global byte position - size_t global_byte_pos = offset + byte_idx; - - // Process each bit in the byte using scalar code - for (int bit_pos = 0; bit_pos < 8; bit_pos++) { - if (byte_value & (1 << bit_pos)) { - size_t global_bit_pos = global_byte_pos * 8 + bit_pos; - if (global_bit_pos < bv->size_bits) { - result_positions[result_count++] = global_bit_pos; - } - } - } - - // Find the next non-zero byte - next = svpnext_b8(non_zero, next); - } - } - - return result_count; -} -``` -The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. - -## Benchmarking Code - -Now, that you have created four different implementations of a bitmap scanning algorithm, let's create a benchmarking framework to compare the performance of our implementations. Copy the code shown below into `bitvector_scan_benchmark.c` : - -```c -// Timing function for bit vector scanning -double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), - bitvector_t* bv, uint32_t* result_positions, - int iterations, size_t* found_count) { - struct timespec start, end; - *found_count = 0; - - clock_gettime(CLOCK_MONOTONIC, &start); - - for (int iter = 0; iter < iterations; iter++) { - size_t count = scan_func(bv, result_positions); - if (iter == 0) { - *found_count = count; - } - } - - clock_gettime(CLOCK_MONOTONIC, &end); - - double elapsed = (end.tv_sec - start.tv_sec) * 1000.0 + - (end.tv_nsec - start.tv_nsec) / 1000000.0; - return elapsed / iterations; -} -``` - -## Main Function -The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. Copy the main function code below into `bitvector_scan_benchmark.c`: - -```C -int main() { - srand(time(NULL)); - - printf("Bit Vector Scanning Performance Benchmark\n"); - printf("========================================\n\n"); - - // Parameters - size_t bitvector_size = 10000000; // 10 million bits - int iterations = 10; // 10 iterations for timing - - // Test different densities - double densities[] = {0.0, 0.0001, 0.001, 0.01, 0.1}; - int num_densities = sizeof(densities) / sizeof(densities[0]); - - printf("Bit vector size: %zu bits\n", bitvector_size); - printf("Iterations: %d\n\n", iterations); - - // Allocate result array - uint32_t* result_positions = (uint32_t*)malloc(bitvector_size * sizeof(uint32_t)); - - printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", - "Density", "Set Bits", "Scalar Gen (ms)", "Scalar Opt (ms)", "NEON (ms)", "SVE (ms)"); - printf("%-10s %-15s %-15s %-15s %-15s %-15s\n", - "-------", "--------", "--------------", "--------------", "--------", "---------------"); - - for (int d = 0; d < num_densities; d++) { - double density = densities[d]; - - // Generate bit vector with specified density - bitvector_t* bv = generate_bitvector(bitvector_size, density); - - // Count actual set bits - size_t actual_set_bits = bitvector_count_scalar(bv); - - // Benchmark implementations - size_t found_scalar_gen, found_scalar, found_neon, found_sve2; - - double scalar_gen_time = benchmark_scan(scan_bitvector_scalar_generic, bv, result_positions, - iterations, &found_scalar_gen); - - double scalar_time = benchmark_scan(scan_bitvector_scalar, bv, result_positions, - iterations, &found_scalar); - - double neon_time = benchmark_scan(scan_bitvector_neon, bv, result_positions, - iterations, &found_neon); - - double sve2_time = benchmark_scan(scan_bitvector_sve2_pnext, bv, result_positions, - iterations, &found_sve2); - - // Print results - printf("%-10.4f %-15zu %-15.3f %-15.3f %-15.3f %-15.3f\n", - density, actual_set_bits, scalar_gen_time, scalar_time, neon_time, sve2_time); - - // Print speedups for this density - printf("Speedups at %.4f density:\n", density); - printf(" Scalar Opt vs Scalar Gen: %.2fx\n", scalar_gen_time / scalar_time); - printf(" NEON vs Scalar Gen: %.2fx\n", scalar_gen_time / neon_time); - printf(" SVE vs Scalar Gen: %.2fx\n", scalar_gen_time / sve2_time); - printf(" NEON vs Scalar Opt: %.2fx\n", scalar_time / neon_time); - printf(" SVE vs Scalar Opt: %.2fx\n", scalar_time / sve2_time); - printf(" SVE vs NEON: %.2fx\n\n", neon_time / sve2_time); - - // Verify results match - if (found_scalar_gen != found_scalar || found_scalar_gen != found_neon || found_scalar_gen != found_sve2) { - printf("WARNING: Result mismatch at %.4f density!\n", density); - printf(" Scalar Gen found %zu bits\n", found_scalar_gen); - printf(" Scalar Opt found %zu bits\n", found_scalar); - printf(" NEON found %zu bits\n", found_neon); - printf(" SVE found %zu bits\n\n", found_sve2); - } - - // Clean up - bitvector_free(bv); - } - - free(result_positions); - - return 0; -} -``` - -## Compiling and Running - -You are now ready to compile and run your bitmap scanning implementations. - -To compile our bitmap scanning implementations with the appropriate flags, run: - -```bash -gcc -O3 -march=armv9-a+sve2 -o bitvector_scan_benchmark bitvector_scan_benchmark.c -lm -``` - -## Performance Results - -When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results should look similar to: - -### Execution Time (ms) - -| Density | Set Bits | Scalar Generic | Scalar Optimized | NEON | SVE | -|---------|----------|----------------|------------------|-------|------------| -| 0.0000 | 0 | 7.169 | 0.456 | 0.056 | 0.093 | -| 0.0001 | 1,000 | 7.176 | 0.477 | 0.090 | 0.109 | -| 0.0010 | 9,996 | 7.236 | 0.591 | 0.377 | 0.249 | -| 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | -| 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | - -### Speedup vs Generic Scalar - -| Density | Scalar Optimized | NEON | SVE | -|---------|------------------|---------|------------| -| 0.0000 | 15.72x | 127.41x | 77.70x | -| 0.0001 | 15.05x | 80.12x | 65.86x | -| 0.0010 | 12.26x | 19.35x | 29.07x | -| 0.0100 | 5.02x | 3.49x | 5.78x | -| 0.1000 | 1.54x | 1.40x | 1.90x | - -## Understanding the Performance Results - -### Generic Scalar vs Optimized Scalar - -The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: - -1. **Byte-level Skipping**: Avoiding processing empty bytes -2. **Reduced Function Calls**: Accessing bits directly rather than through function calls -3. **Better Cache Utilization**: More sequential memory access patterns - -### Optimized Scalar vs NEON - -The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: - -1. **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once -2. **Vectorized Comparison**: Checking multiple bytes in parallel -3. **Early Termination**: Quickly determining if a chunk contains any set bits - -### NEON vs SVE - -The performance comparison between NEON and SVE depends on the bit density: - -1. **Very Sparse Bit Vectors (0% - 0.01% density)**: - - NEON performs better for empty bitvectors due to lower overhead - - NEON achieves up to 127.41x speedup over generic scalar - - SVE performs better for very sparse bitvectors (0.001% density) - - SVE achieves up to 29.07x speedup over generic scalar at 0.001% density - -2. **Higher Density Bit Vectors (0.1% - 10% density)**: - - SVE consistently outperforms NEON - - SVE achieves up to 1.66x speedup over NEON at 0.01% density - -# Key Optimizations in SVE Implementation - -The SVE implementation includes several key optimizations: - -1. **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. - -2. **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. - -3. **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. - -4. **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. - -5. **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. - - -## Application to Database Systems - -These bitmap scanning optimizations can be applied to various database operations: - -### 1. Bitmap Index Scans - -Bitmap indexes are commonly used in analytical databases to accelerate queries with multiple filter conditions. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. - -### 2. Bloom Filter Checks - -Bloom filters are probabilistic data structures used to test set membership. They are often used in database systems to quickly filter out rows that don't match certain conditions. The NEON and SVE implementations can accelerate these bloom filter checks. - -### 3. Column Filtering - -In column-oriented databases, bitmap filters are often used to represent which rows match certain predicates. The NEON and SVE implementation can speed up the scanning of these bitmap filters, improving query performance. - -## Best Practices - -Based on our benchmark results, here are some best practices for optimizing bitmap scanning operations: - -1. **Choose the Right Implementation**: Select the appropriate implementation based on the expected bit density: - - For empty bit vectors: NEON is optimal - - For very sparse bit vectors (0.001% - 0.1% density): SVE is optimal - - For higher densities (> 0.1% density): SVE still outperforms NEON - -2. **Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. - -3. **Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. - -4. **Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. - -5. **Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. - -## Conclusion - -The SVE instructions provides a powerful way to accelerate bitmap scanning operations in database systems. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your database workloads. - -The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it continues to provide substantial speedups over traditional approaches. - -These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. From 0abbbb618bd9df0f4c0c88cb4254c44f6fda378f Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 12 Jun 2025 12:26:54 +0000 Subject: [PATCH 03/11] Updates --- .../bitmap_scan_sve2/01-introduction.md | 41 +++++++++++-------- 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index b5c5cae5e9..87127e4a65 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -1,6 +1,6 @@ --- # User change -title: "Bitmap Scanning and Vectorization on Arm" +title: "Optimizing Bitmap Scanning with SVE and NEON on Arm Servers" weight: 2 @@ -10,41 +10,48 @@ layout: "learningpathall" Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly affect query execution times, especially for large datasets. -In this Learning Path, you will explore how to use SVE instructions available on Arm Neoverse V2 based servers like AWS Graviton4 to optimize bitmap scanning operations. You will compare the performance of scalar, NEON, and SVE implementations to demonstrate the significant performance benefits of using specialized vector instructions. +In this Learning Path, you will: -## What is Bitmap Scanning? +* Explore how to use SVE instructions on Arm Neoverse V2–based servers like AWS Graviton4 to optimize bitmap scanning +* Compare scalar, NEON, and SVE implementations to demonstrate the performance benefits of specialized vector instructions + +## What is bitmap scanning? Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: -1. **Bitmap Indexes**: Each bit represents whether a row satisfies a particular condition -2. **Bloom Filters**: Probabilistic data structures used to test set membership -3. **Column Filters**: Bit vectors indicating which rows match certain predicates +* **Bitmap Indexes**: each bit represents whether a row satisfies a particular condition +* **Bloom Filters**: probabilistic data structures used to test set membership +* **Column Filters**: bit vectors indicating which rows match certain predicates The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. -## The Evolution of Vector Processing for Bitmap Scanning +## The evolution of vector processing for bitmap scanning -Let's look at how vector processing has evolved for bitmap scanning: +Here's how vector processing has evolved to improve bitmap scanning performance: -1. **Generic Scalar Processing**: Traditional bit-by-bit processing with conditional branches -2. **Optimized Scalar Processing**: Byte-level skipping to avoid processing empty bytes -3. **NEON**: Fixed-length 128-bit SIMD processing with vector operations -4. **SVE**: Scalable vector processing with predication and specialized instructions +* **Generic Scalar Processing**: traditional bit-by-bit processing with conditional branches +* **Optimized Scalar Processing**: byte-level skipping to avoid processing empty bytes +* **NEON**: fixed-length 128-bit SIMD processing with vector operations +* **SVE**: scalable vector processing with predication and specialized instructions like MATCH ## Set up your environment -To follow this learning path, you will need: +To follow this Learning Path, you will need: -1. An AWS Graviton4 instance running `Ubuntu 24.04`. -2. GCC compiler with SVE support +* An AWS Graviton4 instance running `Ubuntu 24.04`. +* A GCC compiler with SVE support -Let's start by setting up our environment: +First, install the required development tools: ```bash sudo apt-get update sudo apt-get install -y build-essential gcc g++ ``` -An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you can run it with newer versions of GCC as well. +{{% notice Tip %}} +An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. For best performance, use the latest available GCC version with SVE support. This Learning Path was tested with GCC 13, the default on Ubuntu 24.04. Newer versions should also work. +{{% /notice %}} + + Create a directory for your implementations: ```bash From 5282f698abcb5e8bc8acc4b9b4f760174ada96a3 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 12 Jun 2025 16:21:50 +0000 Subject: [PATCH 04/11] updates --- .../bitmap_scan_sve2/01-introduction.md | 8 ++++---- .../bitmap_scan_sve2/02-bitmap-data-structure.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index 87127e4a65..2e6666b88b 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -1,6 +1,6 @@ --- # User change -title: "Optimizing Bitmap Scanning with SVE and NEON on Arm Servers" +title: "Optimize Bitmap Scanning with SVE and NEON on Arm Servers" weight: 2 @@ -8,14 +8,14 @@ layout: "learningpathall" --- ## Introduction -Bitmap scanning is a fundamental operation in database systems, particularly for analytical workloads. It's used in bitmap indexes, bloom filters, and column filtering operations. The performance of bitmap scanning can significantly affect query execution times, especially for large datasets. +Bitmap scanning is a fundamental operation in database systems — used in bitmap indexes, bloom filters, and column filters — but it can bottleneck complex analytical queries. In this Learning Path, you'll learn how to speed up these operations using Arm's SVE and NEON vector instructions, especially on Neoverse V2–based servers like AWS Graviton4. In this Learning Path, you will: * Explore how to use SVE instructions on Arm Neoverse V2–based servers like AWS Graviton4 to optimize bitmap scanning * Compare scalar, NEON, and SVE implementations to demonstrate the performance benefits of specialized vector instructions -## What is bitmap scanning? +## What is bitmap scanning in databases? Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: @@ -34,7 +34,7 @@ Here's how vector processing has evolved to improve bitmap scanning performance: * **NEON**: fixed-length 128-bit SIMD processing with vector operations * **SVE**: scalable vector processing with predication and specialized instructions like MATCH -## Set up your environment +## Set up your Arm development environment To follow this Learning Path, you will need: diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md index 75d6f7c029..3d10c7fb07 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -1,6 +1,6 @@ --- # User change -title: "Building and Managing the Bit Vector Structure" +title: "Build and manage the bit vector Structure" weight: 3 From 77b1fbfecddf2ac23bd190258253a79d841c702f Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Thu, 12 Jun 2025 21:57:18 +0000 Subject: [PATCH 05/11] Further updates --- .../bitmap_scan_sve2/01-introduction.md | 5 ++-- .../02-bitmap-data-structure.md | 14 +++++---- .../03-scalar-implementations.md | 30 +++++++++++++------ .../04-vector-implementations.md | 6 ++-- .../05-benchmarking-and-results.md | 28 +++++++++-------- .../06-application-and-best-practices.md | 11 ++++--- 6 files changed, 56 insertions(+), 38 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index 2e6666b88b..627e3c1715 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -1,6 +1,6 @@ --- # User change -title: "Optimize Bitmap Scanning with SVE and NEON on Arm Servers" +title: "Optimize bitmap scanning with SVE and NEON on Arm servers" weight: 2 @@ -58,5 +58,6 @@ Create a directory for your implementations: mkdir -p bitmap_scan cd bitmap_scan ``` -## Next Steps +## Next steps + In the next section, you’ll define the core bitmap data structure and utility functions for setting, clearing, and inspecting bits. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md index 3d10c7fb07..d847e62027 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -1,22 +1,22 @@ --- # User change -title: "Build and manage the bit vector Structure" +title: "Build and manage a bit vector in C" weight: 3 layout: "learningpathall" --- -## Bitmap Data Structure +## Bitmap data structure Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components: - A byte array to store the actual bits - - Tracking of the physical size(bytes) - - Tracking of the logical size(bits) + - Tracking of the physical size (bytes) + - Tracking of the logical size (bits) For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps. -Use a file editor of your choice and the copy the code below into `bitvector_scan_benchmark.c`: +Use a file editor of your choice and then copy the code below into `bitvector_scan_benchmark.c`: ```c // Define a simple bit vector structure @@ -81,4 +81,6 @@ size_t bitvector_count_scalar(bitvector_t* bv) { } return count; } -``` \ No newline at end of file +``` + +You now have a functional, compact bit vector in C for testing bitmap scanning performance. Next, you'll implement scalar, NEON, and SVE-based scanning routines that operate on this structure. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md index 653dc8112e..638b111f0c 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -1,6 +1,6 @@ --- # User change -title: "Scalar Implementations of Bitmap Scanning" +title: "Implement Scalar Bitmap Scanning in C" weight: 4 @@ -10,11 +10,17 @@ layout: "learningpathall" --- ## Bitmap scanning implementations -Now, let's implement four versions of a bitmap scanning operation that finds all positions where a bit is set: +Bitmap scanning is a fundamental operation in performance-critical systems such as databases, search engines, and filtering pipelines. It involves identifying the positions of set bits (`1`s) in a bit vector, which is often used to represent filtered rows, bitmap indexes, or membership flags. -### 1. Generic Scalar Implementation +In this section, you'll implement multiple scalar approaches to bitmap scanning in C, starting with a simple per-bit baseline, followed by an optimized version that reduces overhead for sparse data. -This is the most straightforward implementation, checking each bit individually. It serves as our baseline for comparison against the other implementations to follow. Copy the code below into the same file: +Now, let’s walk through the scalar versions of this operation that locate all set bit positions. + +### Generic scalar implementation + +This is the most straightforward implementation, checking each bit individually. It serves as the baseline for comparison against the other implementations to follow. + +Copy the code below into the same file: ```c // Generic scalar implementation of bit vector scanning (bit-by-bit) @@ -31,13 +37,15 @@ size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions } ``` -You will notice this generic C implementation processes every bit, even when most bits are not set. It has high function call overhead and does not advantage of vector instructions. +You might notice that this generic C implementation processes every bit, even when most bits are not set. It has high per-bit function call overhead and does not take advantage of any vector instructions. -In the following implementations, you will address these inefficiencies with more optimized techniques. +In the following implementations, you can address these inefficiencies with more optimized techniques. -### 2. Optimized Scalar Implementation +### Optimized scalar implementation -This implementation adds byte-level skipping to avoid processing empty bytes. Copy this optimized C scalar implementation code into the same file: +This implementation adds byte-level skipping to avoid processing empty bytes. + +Copy this optimized C scalar implementation code into the same file: ```c // Optimized scalar implementation of bit vector scanning (byte-level) @@ -66,5 +74,9 @@ size_t result_count = 0; return result_count; } ``` -Instead of iterating through each bit, this implementation processes one byte(8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely, For sparse bitmaps, this can dramatically reduce the number of bit checks. +Instead of iterating through each bit individually, this implementation processes one byte (8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely. For sparse bitmaps, this can dramatically reduce the number of bit checks. + +## Next steps + +Next, you’ll explore how to accelerate bitmap scanning using NEON and SVE SIMD instructions on Arm Neoverse platforms like Graviton4, comparing them directly to these scalar baselines. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md index e1279aa0a8..f4825326e9 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -8,7 +8,9 @@ layout: "learningpathall" --- -### 3. NEON Implementation +Modern Arm CPUs like Neoverse V2 support SIMD (Single Instruction, Multiple Data) extensions that allow processing multiple bytes in parallel. In this section, you'll explore how NEON and SVE vector instructions can dramatically accelerate bitmap scanning by skipping over large regions of unset data and reducing per-bit processing overhead. + +## NEON Implementation This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. Copy the NEON implementation shown below into the same file: ```c @@ -81,7 +83,7 @@ size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { ``` This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. -### 4. SVE Implementation +## SVE Implementation This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. Copy this SVE implementation into the same file: diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md index 119d2d941c..3e5c389df1 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -1,6 +1,6 @@ --- # User change -title: "Benchmarking Bitmap Scanning Across Implementations" +title: "Benchmarking bitmap scanning across implementations" weight: 6 @@ -8,9 +8,11 @@ layout: "learningpathall" --- -## Benchmarking Code +## Benchmarking code -Now, that you have created four different implementations of a bitmap scanning algorithm, let's create a benchmarking framework to compare the performance of our implementations. Copy the code shown below into `bitvector_scan_benchmark.c` : +Now that you've created four different bitmap scanning implementations, let’s build a benchmarking framework to compare their performance. + +Copy the code shown below into `bitvector_scan_benchmark.c` : ```c // Timing function for bit vector scanning @@ -37,7 +39,7 @@ double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), } ``` -## Main Function +## Main function The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. Copy the main function code below into `bitvector_scan_benchmark.c`: ```C @@ -122,7 +124,7 @@ int main() { } ``` -## Compiling and Running +## Compiling and running You are now ready to compile and run your bitmap scanning implementations. @@ -132,11 +134,11 @@ To compile our bitmap scanning implementations with the appropriate flags, run: gcc -O3 -march=armv9-a+sve2 -o bitvector_scan_benchmark bitvector_scan_benchmark.c -lm ``` -## Performance Results +## Performance results When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results should look similar to: -### Execution Time (ms) +### Execution time (ms) | Density | Set Bits | Scalar Generic | Scalar Optimized | NEON | SVE | |---------|----------|----------------|------------------|-------|------------| @@ -146,7 +148,7 @@ When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results sh | 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | | 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | -### Speedup vs Generic Scalar +### Speed-up vs generic Scalar | Density | Scalar Optimized | NEON | SVE | |---------|------------------|---------|------------| @@ -156,9 +158,9 @@ When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results sh | 0.0100 | 5.02x | 3.49x | 5.78x | | 0.1000 | 1.54x | 1.40x | 1.90x | -## Understanding the Performance Results +## Understanding the results -### Generic Scalar vs Optimized Scalar +## Generic Scalar vs Optimized Scalar The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: @@ -166,7 +168,7 @@ The optimized scalar implementation shows significant improvements over the gene 2. **Reduced Function Calls**: Accessing bits directly rather than through function calls 3. **Better Cache Utilization**: More sequential memory access patterns -### Optimized Scalar vs NEON +## Optimized Scalar vs NEON The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: @@ -174,7 +176,7 @@ The NEON implementation shows further improvements over the optimized scalar imp 2. **Vectorized Comparison**: Checking multiple bytes in parallel 3. **Early Termination**: Quickly determining if a chunk contains any set bits -### NEON vs SVE +## NEON vs SVE The performance comparison between NEON and SVE depends on the bit density: @@ -188,7 +190,7 @@ The performance comparison between NEON and SVE depends on the bit density: - SVE consistently outperforms NEON - SVE achieves up to 1.66x speedup over NEON at 0.01% density -# Key Optimizations in SVE Implementation +## Key Optimizations in SVE Implementation The SVE implementation includes several key optimizations: diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md index 4fdd71b2ae..34105b1da3 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -1,6 +1,6 @@ --- # User change -title: "Applications and Optimization Best Practices" +title: "Applications and optimization best practices" weight: 7 @@ -10,21 +10,20 @@ layout: "learningpathall" These bitmap scanning optimizations can be applied to various database operations: -### 1. Bitmap Index Scans - +### Bitmap Index Scans Bitmap indexes are commonly used in analytical databases to accelerate queries with multiple filter conditions. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. -### 2. Bloom Filter Checks +### Bloom Filter Checks Bloom filters are probabilistic data structures used to test set membership. They are often used in database systems to quickly filter out rows that don't match certain conditions. The NEON and SVE implementations can accelerate these bloom filter checks. -### 3. Column Filtering +### Column Filtering In column-oriented databases, bitmap filters are often used to represent which rows match certain predicates. The NEON and SVE implementation can speed up the scanning of these bitmap filters, improving query performance. ## Best Practices -Based on our benchmark results, here are some best practices for optimizing bitmap scanning operations: +Based on the benchmark results, here are some best practices for optimizing bitmap scanning operations: 1. **Choose the Right Implementation**: Select the appropriate implementation based on the expected bit density: - For empty bit vectors: NEON is optimal From be737ddcd3ffe9bb4b698dcb9c13b9209faff3a1 Mon Sep 17 00:00:00 2001 From: Maddy Underwood Date: Fri, 13 Jun 2025 09:14:44 +0000 Subject: [PATCH 06/11] Tweaks --- .../bitmap_scan_sve2/01-introduction.md | 17 ++++++++++------- .../bitmap_scan_sve2/_index.md | 2 +- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index 627e3c1715..8aff4d1487 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -1,23 +1,27 @@ --- # User change -title: "Optimize bitmap scanning with SVE and NEON on Arm servers" +title: "Optimize bitmap scanning in databases with SVE and NEON on Arm servers" weight: 2 layout: "learningpathall" --- -## Introduction +## Overview -Bitmap scanning is a fundamental operation in database systems — used in bitmap indexes, bloom filters, and column filters — but it can bottleneck complex analytical queries. In this Learning Path, you'll learn how to speed up these operations using Arm's SVE and NEON vector instructions, especially on Neoverse V2–based servers like AWS Graviton4. +Bitmap scanning is a core operation in many database systems. It's essential for powering fast filtering in bitmap indexes, Bloom filters, and column filters. However, these scans can become performance bottlenecks in complex analytical queries. -In this Learning Path, you will: +In this Learning Path, you’ll learn how to accelerate bitmap scanning using Arm’s vector processing technologies - NEON and SVE - on Neoverse V2–based servers like AWS Graviton4. + +Specifically, you will: * Explore how to use SVE instructions on Arm Neoverse V2–based servers like AWS Graviton4 to optimize bitmap scanning * Compare scalar, NEON, and SVE implementations to demonstrate the performance benefits of specialized vector instructions ## What is bitmap scanning in databases? -Bitmap scanning involves searching through a bit vector to find positions where bits are set (1) or unset (0). In database systems, bitmaps are commonly used to represent: +Bitmap scanning involves searching through a bit vector to find positions where bits are set (`1`) or unset (`0`). + +In database systems, bitmaps are commonly used to represent: * **Bitmap Indexes**: each bit represents whether a row satisfies a particular condition * **Bloom Filters**: probabilistic data structures used to test set membership @@ -31,7 +35,7 @@ Here's how vector processing has evolved to improve bitmap scanning performance: * **Generic Scalar Processing**: traditional bit-by-bit processing with conditional branches * **Optimized Scalar Processing**: byte-level skipping to avoid processing empty bytes -* **NEON**: fixed-length 128-bit SIMD processing with vector operations +* **NEON**: fixed-width 128-bit SIMD processing with vector operations * **SVE**: scalable vector processing with predication and specialized instructions like MATCH ## Set up your Arm development environment @@ -52,7 +56,6 @@ An effective way to achieve optimal performance on Arm is not only through optim {{% /notice %}} - Create a directory for your implementations: ```bash mkdir -p bitmap_scan diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md index 53c4d0bb83..9da8cbe6b4 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/_index.md @@ -3,7 +3,7 @@ title: Accelerate Bitmap Scanning with NEON and SVE Instructions on Arm servers minutes_to_complete: 20 -who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances. +who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone interested in optimizing data processing workloads on Arm-based cloud instances. learning_objectives: From 4330f449a48fd1fe2f8e5da119bd08f75b137500 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 13 Jun 2025 12:51:08 +0000 Subject: [PATCH 07/11] updates --- .../06-application-and-best-practices.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md index 34105b1da3..49f3a6e19b 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -6,29 +6,29 @@ weight: 7 layout: "learningpathall" --- -## Application to Database Systems +## Applications to database systems -These bitmap scanning optimizations can be applied to various database operations: +Optimized bitmap scanning can accelerate several core operations in modern database engines, particularly those used for analytical and vectorized workloads. -### Bitmap Index Scans -Bitmap indexes are commonly used in analytical databases to accelerate queries with multiple filter conditions. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. +### Bitmap index scans +Bitmap indexes are widely used in analytical databases to accelerate queries with multiple filter predicates across large datasets. The NEON and SVE implementations can significantly speed up the scanning of these bitmap indexes, especially for queries with low selectivity. -### Bloom Filter Checks +### Bloom filter checks -Bloom filters are probabilistic data structures used to test set membership. They are often used in database systems to quickly filter out rows that don't match certain conditions. The NEON and SVE implementations can accelerate these bloom filter checks. +Bloom filters are probabilistic structures used to test set membership, commonly employed in join filters or subquery elimination. Vectorized scanning via NEON or SVE accelerates these checks by quickly rejecting rows that don’t match, reducing the workload on subsequent stages of the query. -### Column Filtering +### Column filtering -In column-oriented databases, bitmap filters are often used to represent which rows match certain predicates. The NEON and SVE implementation can speed up the scanning of these bitmap filters, improving query performance. +Columnar databases frequently use bitmap filters to track which rows satisfy filter conditions. These bitmaps can be scanned in a vectorized fashion using NEON or SVE instructions, substantially speeding up predicate evaluation and minimizing CPU cycles spent on row selection. -## Best Practices +## Best practices Based on the benchmark results, here are some best practices for optimizing bitmap scanning operations: -1. **Choose the Right Implementation**: Select the appropriate implementation based on the expected bit density: +1. **Choose the right implementation based on the expected bit density**: - For empty bit vectors: NEON is optimal - - For very sparse bit vectors (0.001% - 0.1% density): SVE is optimal - - For higher densities (> 0.1% density): SVE still outperforms NEON + - For very sparse bit vectors (0.001% - 0.1% set bits): SVE is optimal due to efficient skipping + - For medium to high densities (> 0.1% density): SVE still outperforms NEON 2. **Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. From 24f393961159acd5e11c7a82de8a89e92623d49c Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 13 Jun 2025 14:04:46 +0000 Subject: [PATCH 08/11] updates --- .../bitmap_scan_sve2/01-introduction.md | 14 +++---- .../02-bitmap-data-structure.md | 4 +- .../03-scalar-implementations.md | 11 +++++- .../04-vector-implementations.md | 5 +++ .../05-benchmarking-and-results.md | 39 ++++++++++++------- .../06-application-and-best-practices.md | 19 +++++---- 6 files changed, 61 insertions(+), 31 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md index 8aff4d1487..e5bbe5ecbb 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/01-introduction.md @@ -23,9 +23,9 @@ Bitmap scanning involves searching through a bit vector to find positions where In database systems, bitmaps are commonly used to represent: -* **Bitmap Indexes**: each bit represents whether a row satisfies a particular condition -* **Bloom Filters**: probabilistic data structures used to test set membership -* **Column Filters**: bit vectors indicating which rows match certain predicates +* **Bitmap indexes**: each bit represents whether a row satisfies a particular condition +* **Bloom filters**: probabilistic data structures used to test set membership +* **Column filters**: bit vectors indicating which rows match certain predicates The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization. @@ -33,8 +33,8 @@ The operation of scanning a bitmap to find set bits is often in the critical pat Here's how vector processing has evolved to improve bitmap scanning performance: -* **Generic Scalar Processing**: traditional bit-by-bit processing with conditional branches -* **Optimized Scalar Processing**: byte-level skipping to avoid processing empty bytes +* **Generic scalar processing**: traditional bit-by-bit processing with conditional branches +* **Optimized scalar processing**: byte-level skipping to avoid processing empty bytes * **NEON**: fixed-width 128-bit SIMD processing with vector operations * **SVE**: scalable vector processing with predication and specialized instructions like MATCH @@ -61,6 +61,6 @@ Create a directory for your implementations: mkdir -p bitmap_scan cd bitmap_scan ``` -## Next steps -In the next section, you’ll define the core bitmap data structure and utility functions for setting, clearing, and inspecting bits. +## Next up: build the bitmap scanning foundation +With your development environment set up, you're ready to dive into the core of bitmap scanning. In the next section, you’ll define a minimal bitmap data structure and implement utility functions to set, clear, and inspect individual bits. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md index d847e62027..9fed9d6262 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -83,4 +83,6 @@ size_t bitvector_count_scalar(bitvector_t* bv) { } ``` -You now have a functional, compact bit vector in C for testing bitmap scanning performance. Next, you'll implement scalar, NEON, and SVE-based scanning routines that operate on this structure. \ No newline at end of file +## Next up: Implement and benchmark your first scalar bitmap scan + +With your bit vector infrastructure in place, you're now ready to scan it for set bits—the core operation that underpins all bitmap-based filters in database systems. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md index 638b111f0c..f2debab7ed 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -76,7 +76,14 @@ size_t result_count = 0; ``` Instead of iterating through each bit individually, this implementation processes one byte (8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely. For sparse bitmaps, this can dramatically reduce the number of bit checks. -## Next steps +## Next up: Accelerate bitmap scanning with NEON and SVE -Next, you’ll explore how to accelerate bitmap scanning using NEON and SVE SIMD instructions on Arm Neoverse platforms like Graviton4, comparing them directly to these scalar baselines. +You’ve now implemented two scalar scanning routines: +* A generic per-bit loop for correctness and simplicity + +* An optimized scalar version that improves performance using byte-level skipping + +These provide a solid foundation and performance baseline—but scalar methods can only take you so far. To unlock real throughput gains, it’s time to leverage SIMD (Single Instruction, Multiple Data) execution. + +In the next section, you’ll explore how to use Arm NEON and SVE vector instructions to accelerate bitmap scanning. These approaches will process multiple bytes at once and significantly outperform scalar loops—especially on modern Arm-based CPUs like AWS Graviton4. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md index f4825326e9..37807a6c0e 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -152,3 +152,8 @@ size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { ``` The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. +## Next up: Apply vectorized scanning to database workloads + +With both NEON and SVE implementations in place, you’ve now unlocked the full power of Arm’s vector processing capabilities for bitmap scanning. These SIMD techniques allow you to process large bitvectors more efficiently—especially when filtering sparse datasets or skipping over large blocks of empty rows. + +In the next section, you’ll learn how to apply these optimizations in the context of real database operations like bitmap index scans, Bloom filter probes, and column filtering. You’ll also explore best practices for selecting the right implementation based on bit density, and tuning for maximum performance on AWS Graviton4. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md index 3e5c389df1..1efa29c9fd 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -164,29 +164,29 @@ When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results sh The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: -1. **Byte-level Skipping**: Avoiding processing empty bytes -2. **Reduced Function Calls**: Accessing bits directly rather than through function calls -3. **Better Cache Utilization**: More sequential memory access patterns +* **Byte-level Skipping**: Avoiding processing empty bytes +* **Reduced Function Calls**: Accessing bits directly rather than through function calls +* **Better Cache Utilization**: More sequential memory access patterns ## Optimized Scalar vs NEON The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: -1. **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once -2. **Vectorized Comparison**: Checking multiple bytes in parallel -3. **Early Termination**: Quickly determining if a chunk contains any set bits +* **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once +* **Vectorized Comparison**: Checking multiple bytes in parallel +* **Early Termination**: Quickly determining if a chunk contains any set bits ## NEON vs SVE The performance comparison between NEON and SVE depends on the bit density: -1. **Very Sparse Bit Vectors (0% - 0.01% density)**: +* **Very Sparse Bit Vectors (0% - 0.01% density)**: - NEON performs better for empty bitvectors due to lower overhead - NEON achieves up to 127.41x speedup over generic scalar - SVE performs better for very sparse bitvectors (0.001% density) - SVE achieves up to 29.07x speedup over generic scalar at 0.001% density -2. **Higher Density Bit Vectors (0.1% - 10% density)**: +* **Higher Density Bit Vectors (0.1% - 10% density)**: - SVE consistently outperforms NEON - SVE achieves up to 1.66x speedup over NEON at 0.01% density @@ -194,15 +194,28 @@ The performance comparison between NEON and SVE depends on the bit density: The SVE implementation includes several key optimizations: -1. **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. +* **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. -2. **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. +* **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. -3. **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. +* **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. -4. **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. +* **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. -5. **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. +* **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. +## Next up: Apply what you’ve learned to real-world workloads + +Now that you’ve benchmarked all four bitmap scanning implementations—scalar (generic and optimized), NEON, and SVE—you have a data-driven understanding of how vectorization impacts performance across different bitmap densities. + +In the next section, you’ll explore how to apply these techniques in real-world database workloads, including: + +* Bitmap index scans + +* Bloom filter checks + +* Column-level filtering in analytical queries + +You’ll also learn practical guidelines for choosing the right implementation based on bit density, and discover optimization tips that go beyond the code to help you get the most out of Arm-based systems like Graviton4. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md index 49f3a6e19b..1f5bdc05dd 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/06-application-and-best-practices.md @@ -25,23 +25,26 @@ Columnar databases frequently use bitmap filters to track which rows satisfy fil Based on the benchmark results, here are some best practices for optimizing bitmap scanning operations: -1. **Choose the right implementation based on the expected bit density**: +* Choose the right implementation based on the expected bit density**: - For empty bit vectors: NEON is optimal - For very sparse bit vectors (0.001% - 0.1% set bits): SVE is optimal due to efficient skipping - For medium to high densities (> 0.1% density): SVE still outperforms NEON -2. **Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. +* Implement Early Termination**: Always include a fast path for the no-hits case, as this can provide dramatic performance improvements. -3. **Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. +* Use Byte-level Skipping**: Even in scalar implementations, skipping empty bytes can provide significant performance improvements. -4. **Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. +* Consider Memory Access Patterns**: Optimize memory access patterns to improve cache utilization. -5. **Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. +* Leverage Vector Instructions**: Use NEON or SVE/SVE2 instructions to process multiple bytes in parallel. ## Conclusion -The SVE instructions provides a powerful way to accelerate bitmap scanning operations in database systems. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your database workloads. +Scalable Vector Extension (SVE) instructions provide a powerful and portable way to accelerate bitmap scanning in modern database systems. When implemented on Arm Neoverse V2–based servers like AWS Graviton4, they deliver substantial performance improvements across a wide range of bit densities. + +The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it maintains a performance advantage by amortizing scan costs across wider vectors. + +These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. + -The SVE implementation shows particularly impressive performance for sparse bitvectors (0.001% - 0.1% density), where it outperforms both scalar and NEON implementations. For higher densities, it continues to provide substantial speedups over traditional approaches. -These performance improvements can translate directly to faster query execution times, especially for analytical workloads that involve multiple bitmap operations. \ No newline at end of file From c5e0ad344cb9f561387abc11fa52299b7aa7fbb4 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 13 Jun 2025 14:12:52 +0000 Subject: [PATCH 09/11] Final tweaks --- .../02-bitmap-data-structure.md | 2 +- .../03-scalar-implementations.md | 2 +- .../04-vector-implementations.md | 15 ++++++++++----- .../05-benchmarking-and-results.md | 18 +++++++++++------- 4 files changed, 23 insertions(+), 14 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md index 9fed9d6262..9d3d1d4ed2 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/02-bitmap-data-structure.md @@ -83,6 +83,6 @@ size_t bitvector_count_scalar(bitvector_t* bv) { } ``` -## Next up: Implement and benchmark your first scalar bitmap scan +## Next up: implement and benchmark your first scalar bitmap scan With your bit vector infrastructure in place, you're now ready to scan it for set bits—the core operation that underpins all bitmap-based filters in database systems. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md index f2debab7ed..8ab496a15e 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -76,7 +76,7 @@ size_t result_count = 0; ``` Instead of iterating through each bit individually, this implementation processes one byte (8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely. For sparse bitmaps, this can dramatically reduce the number of bit checks. -## Next up: Accelerate bitmap scanning with NEON and SVE +## Next up: accelerate bitmap scanning with NEON and SVE You’ve now implemented two scalar scanning routines: diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md index 37807a6c0e..be32df49a1 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -10,9 +10,12 @@ layout: "learningpathall" --- Modern Arm CPUs like Neoverse V2 support SIMD (Single Instruction, Multiple Data) extensions that allow processing multiple bytes in parallel. In this section, you'll explore how NEON and SVE vector instructions can dramatically accelerate bitmap scanning by skipping over large regions of unset data and reducing per-bit processing overhead. -## NEON Implementation +## NEON implementation + +This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. + +Copy the NEON implementation shown below into the same file: -This implementation uses NEON SIMD (Single Instruction, Multiple Data) instructions to process 16 bytes (128 bits) at a time, significantly accelerating the scanning process. Copy the NEON implementation shown below into the same file: ```c // NEON implementation of bit vector scanning size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { @@ -83,9 +86,11 @@ size_t scan_bitvector_neon(bitvector_t* bv, uint32_t* result_positions) { ``` This NEON implementation processes 16 bytes at a time with vector instructions. For sparse bitmaps, entire 16-byte chunks can be skipped at once, providing a significant speedup over byte-level skipping. After vector processing, it falls back to scalar code for any remaining bytes that don't fill a complete 16-byte chunk. -## SVE Implementation +## SVE implementation + +This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. -This implementation uses SVE instructions which are available in the Arm Neoverse V2 based AWS Graviton 4 processor. Copy this SVE implementation into the same file: +Copy this SVE implementation into the same file: ```c // SVE implementation using svcmp_u8, PNEXT, and LASTB @@ -152,7 +157,7 @@ size_t scan_bitvector_sve2_pnext(bitvector_t* bv, uint32_t* result_positions) { ``` The SVE implementation efficiently scans bitmaps by using `svcmpne_u8` to identify non-zero bytes and `svpnext_b8` to iterate through them sequentially. It extracts byte indices and values with `svlastb_u8`, then processes individual bits using scalar code. This hybrid vector-scalar approach maintains great performance across various bitmap densities. On Graviton4, SVE vectors are 128 bits (16 bytes), allowing processing of 16 bytes at once. -## Next up: Apply vectorized scanning to database workloads +## Next up: apply vectorized scanning to database workloads With both NEON and SVE implementations in place, you’ve now unlocked the full power of Arm’s vector processing capabilities for bitmap scanning. These SIMD techniques allow you to process large bitvectors more efficiently—especially when filtering sparse datasets or skipping over large blocks of empty rows. diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md index 1efa29c9fd..f6ac4ff8e7 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -40,7 +40,9 @@ double benchmark_scan(size_t (*scan_func)(bitvector_t*, uint32_t*), ``` ## Main function -The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. Copy the main function code below into `bitvector_scan_benchmark.c`: +The main function of your program is responsible for setting up the test environment, running the benchmarking code for the four different implementations across various bit densities, and reporting the results. In the context of bitmap scanning, bit density refers to the percentage or proportion of bits that are set (have a value of 1) in the bitmap. + +Copy the main function code below into `bitvector_scan_benchmark.c`: ```C int main() { @@ -148,7 +150,7 @@ When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results sh | 0.0100 | 99,511 | 7.821 | 1.570 | 2.252 | 1.353 | | 0.1000 | 951,491 | 12.817 | 8.336 | 9.106 | 6.770 | -### Speed-up vs generic Scalar +### Speed-up vs generic scalar | Density | Scalar Optimized | NEON | SVE | |---------|------------------|---------|------------| @@ -160,7 +162,9 @@ When running on a Graviton4 c8g.large instance with Ubuntu 24.04, the results sh ## Understanding the results -## Generic Scalar vs Optimized Scalar +The benchmarking results reveal how different bitmap scanning implementations perform across a range of bit densities—from completely empty vectors to those with millions of set bits. Understanding these trends is key to selecting the most effective approach for your specific use case. + +### Generic scalar vs optimized scalar The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: @@ -168,7 +172,7 @@ The optimized scalar implementation shows significant improvements over the gene * **Reduced Function Calls**: Accessing bits directly rather than through function calls * **Better Cache Utilization**: More sequential memory access patterns -## Optimized Scalar vs NEON +### Optimized scalar vs NEON The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: @@ -176,7 +180,7 @@ The NEON implementation shows further improvements over the optimized scalar imp * **Vectorized Comparison**: Checking multiple bytes in parallel * **Early Termination**: Quickly determining if a chunk contains any set bits -## NEON vs SVE +### NEON vs SVE The performance comparison between NEON and SVE depends on the bit density: @@ -190,7 +194,7 @@ The performance comparison between NEON and SVE depends on the bit density: - SVE consistently outperforms NEON - SVE achieves up to 1.66x speedup over NEON at 0.01% density -## Key Optimizations in SVE Implementation +### Key optimizations in SVE implementation The SVE implementation includes several key optimizations: @@ -204,7 +208,7 @@ The SVE implementation includes several key optimizations: * **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. -## Next up: Apply what you’ve learned to real-world workloads +## Next up: apply what you’ve learned to real-world workloads Now that you’ve benchmarked all four bitmap scanning implementations—scalar (generic and optimized), NEON, and SVE—you have a data-driven understanding of how vectorization impacts performance across different bitmap densities. From 5587496a755de12dd62f3d2fc769a6ab2f597cac Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 13 Jun 2025 14:14:25 +0000 Subject: [PATCH 10/11] tweaks --- .../05-benchmarking-and-results.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md index f6ac4ff8e7..8cf17e8f03 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/05-benchmarking-and-results.md @@ -168,17 +168,17 @@ The benchmarking results reveal how different bitmap scanning implementations pe The optimized scalar implementation shows significant improvements over the generic scalar implementation due to: -* **Byte-level Skipping**: Avoiding processing empty bytes -* **Reduced Function Calls**: Accessing bits directly rather than through function calls -* **Better Cache Utilization**: More sequential memory access patterns +* **Byte-level Skipping**: avoiding processing empty bytes +* **Reduced Function Calls**: accessing bits directly rather than through function calls +* **Better Cache Utilization**: more sequential memory access patterns ### Optimized scalar vs NEON The NEON implementation shows further improvements over the optimized scalar implementation for sparse bit vectors due to: -* **Chunk-level Skipping**: Quickly skipping 16 empty bytes at once -* **Vectorized Comparison**: Checking multiple bytes in parallel -* **Early Termination**: Quickly determining if a chunk contains any set bits +* **Chunk-level Skipping**: quickly skipping 16 empty bytes at once +* **Vectorized Comparison**: checking multiple bytes in parallel +* **Early Termination**: quickly determining if a chunk contains any set bits ### NEON vs SVE @@ -198,13 +198,13 @@ The performance comparison between NEON and SVE depends on the bit density: The SVE implementation includes several key optimizations: -* **Efficient Non-Zero Byte Detection**: Using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. +* **Efficient Non-Zero Byte Detection**: using `svcmpne_u8` to quickly identify non-zero bytes in the bitvector. -* **Byte-Level Processing**: Using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. +* **Byte-Level Processing**: using `svpnext_b8` to efficiently find the next non-zero byte without processing zero bytes. -* **Value Extraction**: Using `svlastb_u8` to extract both the index and value of non-zero bytes. +* **Value Extraction**: using `svlastb_u8` to extract both the index and value of non-zero bytes. -* **Hybrid Vector-Scalar Approach**: Combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. +* **Hybrid Vector-Scalar Approach**: combining vector operations for finding non-zero bytes with scalar operations for processing individual bits. * **Prefetching**: Using `__builtin_prefetch` to reduce memory latency by prefetching the next chunk of data. From 909d5a0f38ccaa035a0626589c0929ab94ea324a Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 13 Jun 2025 14:16:24 +0000 Subject: [PATCH 11/11] Enough is enough --- .../bitmap_scan_sve2/03-scalar-implementations.md | 2 +- .../bitmap_scan_sve2/04-vector-implementations.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md index 8ab496a15e..320549b0f6 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/03-scalar-implementations.md @@ -1,6 +1,6 @@ --- # User change -title: "Implement Scalar Bitmap Scanning in C" +title: "Implement scalar bitmap scanning in C" weight: 4 diff --git a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md index be32df49a1..3de8fba739 100644 --- a/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md +++ b/content/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/04-vector-implementations.md @@ -1,6 +1,6 @@ --- # User change -title: "Vectorized Bitmap Scanning with NEON and SVE" +title: "Vectorized bitmap scanning with NEON and SVE" weight: 5