diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md new file mode 100644 index 0000000000..616f37d088 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_index.md @@ -0,0 +1,55 @@ +--- +title: Optimizing Arm binaries and libraries with LLVM-BOLT and profile merging + +draft: true +cascade: + draft: true + +minutes_to_complete: 30 + +who_is_this_for: Performance engineers, software developers working on Arm platforms who want to optimize both application binaries and shared libraries using LLVM-BOLT. + +learning_objectives: + - Instrument and optimize binaries for individual workload features using LLVM-BOLT. + - Collect separate BOLT profiles and merge them for comprehensive code coverage. + - Optimize shared libraries independently. + - Integrate optimized shared libraries into applications. + - Evaluate and compare application and library performance across baseline, isolated, and merged optimization scenarios. + +prerequisites: + - An Arm based system running Linux with BOLT and Linux Perf installed. The Linux kernel should be version 5.15 or later. + - (Optional) A second, more powerful Linux system to build the software executable and run BOLT. + +author: Gayathri Narayana Yegna Narayanan + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse + - Cortex-A +tools_software_languages: + - BOLT + - perf + - Runbook +operatingsystems: + - Linux + +further_reading: + - resource: + title: BOLT README + link: https://github.com/llvm/llvm-project/tree/main/bolt + type: documentation + - resource: + title: BOLT - A Practical Binary Optimizer for Data Centers and Beyond + link: https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png b/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png new file mode 100644 index 0000000000..c69844bed4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/bolt-merge/example-picture.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md new file mode 100644 index 0000000000..1d80f6e6e7 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-1.md @@ -0,0 +1,27 @@ +--- +title: Overview of BOLT Merge +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) is a post-link binary optimizer that uses Linux Perf data to re-order the executable code layout to reduce memory overhead and improve performance. + +In this Learning Path, you'll learn how to: +- Collect and merge BOLT profiles from multiple workload features (e.g., read-only and write-only) +- Independently optimize application binaries and external user-space libraries (e.g., `libssl.so`, `libcrypto.so`) +- Link the final optimized binary with the separately bolted libraries to deploy a fully optimized runtime stack + +While MySQL and sysbench are used as examples, this method applies to **any feature-rich application** that: +- Exhibits multiple runtime paths +- Uses dynamic libraries +- Requires full-stack binary optimization for performance-critical deployment + +The workflow includes: +1. Profiling each workload feature separately +2. Profiling external libraries independently +3. Merging profiles for broader code coverage +4. Applying BOLT to each binary and library +5. Linking bolted libraries with the merged-profile binary + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md new file mode 100644 index 0000000000..c67ed17850 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-2.md @@ -0,0 +1,89 @@ +--- +title: BOLT Optimization - First feature +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this step, you will instrument an application binary (such as `mysqld`) with BOLT to collect runtime profile data for a specific feature — for example, a **read-only workload**. + +The collected profile will later be merged with others and used to optimize the application's code layout. + +### Step 1: Build or obtain the uninstrumented binary + +Make sure your application binary is: + +- Built from source (e.g., `mysqld`) +- Unstripped, with symbol information available +- Compiled with frame pointers enabled (`-fno-omit-frame-pointer`) + +You can verify this with: + +```bash +readelf -s /path/to/mysqld | grep main +``` + +If the symbols are missing, rebuild the binary with debug info and no stripping. + +--- + +### Step 2: Instrument the binary with BOLT + +Use `llvm-bolt` to create an instrumented version of the binary: + +```bash +llvm-bolt /path/to/mysqld \\ + -instrument \\ + -o /path/to/mysqld.instrumented \\ + --instrumentation-file=/path/to/profile-readonly.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +### Explanation of key options + +- `-instrument`: Enables profile generation instrumentation +- `--instrumentation-file`: Path where the profile output will be saved +- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes) + +--- + +### Step 3: Run the instrumented binary under a feature-specific workload + +Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_only.lua run +``` + +> Adjust this command as needed for your workload and CPU/core binding. + +The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data. + +--- + +### Step 4: Verify the profile was created + +After running the workload: + +```bash +ls -lh /path/to/profile-readonly.fdata +``` + +You should see a non-empty file. This file will later be merged with other profiles (e.g., for write-only traffic) to generate a complete merged profile. + +--- + + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md new file mode 100644 index 0000000000..f1ea41f09c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-3.md @@ -0,0 +1,100 @@ +--- +title: BOLT Optimization - Second Feature & BOLT Merge to combine +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +In this step, you'll collect profile data for a **write-heavy** workload and also **instrument external libraries** such as `libcrypto.so` and `libssl.so` used by the application (e.g., MySQL). + + +### Step 1: Run Write-Only Workload for Application Binary + +Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_write_only.lua run +``` + +Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`. +--- +### Step 2: Verify the Second Profile Was Generated + +```bash +ls -lh /path/to/profile-writeonly.fdata +``` + +Both `.fdata` files should now exist and contain valid data: + +- `profile-readonly.fdata` +- `profile-writeonly.fdata` + +--- + +### Step 3: Merge the Feature Profiles + +Use `merge-fdata` to combine the feature-specific profiles into one comprehensive `.fdata` file: + +```bash +merge-fdata /path/to/profile-readonly.fdata /path/to/profile-writeonly.fdata \\ + -o /path/to/profile-merged.fdata +``` + +**Example command from an actual setup:** + +```bash +/home/ubuntu/llvm-latest/build/bin/merge-fdata prof-instrumentation-readonly.fdata prof-instrumentation-writeonly.fdata \\ + -o prof-instrumentation-readwritemerged.fdata +``` + +Output: + +``` +Using legacy profile format. +Profile from 2 files merged. +``` + +This creates a single merged profile (`profile-merged.fdata`) covering both read-only and write-only workload behaviors. + +--- + +### Step 4: Verify the Merged Profile + +Check the merged `.fdata` file: + +```bash +ls -lh /path/to/profile-merged.fdata +``` + +--- +### Step 5: Generate the Final Binary with the Merged Profile + +Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file: + +```bash +llvm-bolt build/bin/mysqld \\ + -o build/bin/mysqldreadwrite_merged.bolt_instrumentation \\ + -data=/home/ubuntu/mysql-server-8.0.33/sysbench/prof-instrumentation-readwritemerged.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats 2>&1 | tee bolt_orig.log +``` + +This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features. + + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md new file mode 100644 index 0000000000..376c249164 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-4.md @@ -0,0 +1,154 @@ +--- +title: BOLT the Libraries separately +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +### Step 1: Instrument Shared Libraries (e.g., libcrypto, libssl) + +If system libraries like `/usr/lib/libssl.so` are stripped, rebuild OpenSSL from source with relocations: + +```bash +git clone https://github.com/openssl/openssl.git +cd openssl +./config -O2 -Wl,--emit-relocs --prefix=$HOME/bolt-libs/openssl +make -j$(nproc) +make install +``` + +--- + +### Step 2: BOLT-Instrument libssl.so.3 + +Use `llvm-bolt` to instrument `libssl.so.3`: + +```bash +llvm-bolt $HOME/bolt-libs/openssl/lib/libssl.so.3 \\ + -instrument \\ + -o $HOME/bolt-libs/openssl/lib/libssl.so.3.instrumented \\ + --instrumentation-file=libssl-readwrite.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +Then launch MySQL using the **instrumented shared library** and run a **read+write** sysbench test to populate the profile: + +--- + +### Step 3: Optimize 'libssl.so' Using Its Profile + +After running the read+write test, ensure `libssl-readwrite.fdata` is populated. + + +Run BOLT on the uninstrumented `libssl.so` with the collected read-write profile: + +```bash +llvm-bolt /path/to/libssl.so.3 \\ + -o /path/to/libssl.so.optimized \\ + -data=/path/to/prof-instrumentation-libssl-readwrite.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats +``` + +--- + +### Step 3: Replace the Library at Runtime + +Copy the optimized version over the original and export the path: + +```bash +cp /path/to/libssl.so.optimized /path/to/libssl.so.3 +export LD_LIBRARY_PATH=/path/to/ +``` + +This ensures MySQL will dynamically load the optimized `libssl.so`. + +--- + +### Step 4: Run Final Workload and Validate Performance + +Start the BOLT-optimized MySQL binary and link it against the optimized `libssl.so`. Run the combined workload: + +```bash +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_write.lua run +``` + +--- + +In the next step, you'll optimize an additional critical external library (`libcrypto.so`) using BOLT, following a similar process as `libssl.so`. Afterward, you'll interpret performance results to validate and compare optimizations across baseline and merged + scenarios. + +### Step 1: BOLT optimization for 'libcrypto.so' + +Follow these steps to instrument and optimize `libcrypto.so`: + +#### Instrument `libcrypto.so`: + +```bash +llvm-bolt /path/to/libcrypto.so.3 \\ + -instrument \\ + -o /path/to/libcrypto.so.3.instrumented \\ + --instrumentation-file=libcrypto-readwrite.fdata \\ + --instrumentation-sleep-time=5 \\ + --instrumentation-no-counters-clear \\ + --instrumentation-wait-forks +``` + +Run MySQL under the read-write workload to populate `libcrypto-readwrite.fdata`: + +```bash +export LD_LIBRARY_PATH=/path/to/libcrypto-instrumented +taskset -c 9 ./src/sysbench \\ + --db-driver=mysql \\ + --mysql-host=127.0.0.1 \\ + --mysql-db=bench \\ + --mysql-user=bench \\ + --mysql-password=bench \\ + --mysql-port=3306 \\ + --tables=8 \\ + --table-size=10000 \\ + --threads=1 \\ + src/lua/oltp_read_write.lua run +``` + +#### Optimize the `libcrypto.so` library: + +```bash +llvm-bolt /path/to/original/libcrypto.so.3 \\ + -o /path/to/libcrypto.so.optimized \\ + -data=libcrypto-readwrite.fdata \\ + -reorder-blocks=ext-tsp \\ + -reorder-functions=hfsort \\ + -split-functions \\ + -split-all-cold \\ + -split-eh \\ + -dyno-stats \\ + --print-profile-stats +``` + +Replace the original at runtime: + +```bash +cp /path/to/libcrypto.so.optimized /path/to/libcrypto.so.3 +export LD_LIBRARY_PATH=/path/to/ +``` + +Run a final validation workload to ensure functionality and measure performance improvements. + diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md new file mode 100644 index 0000000000..07cd298c5f --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/bolt-merge/how-to-5.md @@ -0,0 +1,68 @@ +--- +title: Performance Results - Baseline, BOLT Merge, and Full Optimization +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +This step presents the performance comparisons across various BOLT optimization scenarios. You'll see how baseline performance compares with BOLT-optimized binaries using merged profiles and bolted external libraries. + +### 1. Baseline Performance (No BOLT) + +| Metric | Read-Only (Baseline) | Write-Only (Baseline) | Read+Write (Baseline) | +|---------------------------|----------------------|------------------------|------------------------| +| Transactions/sec (TPS) | 1006.33 | 2113.03 | 649.15 | +| Queries/sec (QPS) | 16,101.24 | 12,678.18 | 12,983.09 | +| Latency avg (ms) | 0.99 | 0.47 | 1.54 | +| Latency 95th % (ms) | 1.04 | 0.83 | 1.79 | +| Total time (s) | 9.93 | 4.73 | 15.40 | + +--- + +### 2. Performance Comparison: Merged vs Non-Merged Instrumentation + +| Metric | Regular BOLT R+W (No Merge, system libssl) | Merged BOLT (BOLTed Read+Write + BOLTed libssl) | +|---------------------------|---------------------------------------------|-------------------------------------------------| +| Transactions/sec (TPS) | 850.32 | 879.18 | +| Queries/sec (QPS) | 17,006.35 | 17,583.60 | +| Latency avg (ms) | 1.18 | 1.14 | +| Latency 95th % (ms) | 1.52 | 1.39 | +| Total time (s) | 11.76 | 11.37 | + +Second run: + +| Metric | Regular BOLT R+W (No Merge, system libssl) | Merged BOLT (BOLTed Read+Write + BOLTed libssl) | +|---------------------------|---------------------------------------------|-------------------------------------------------| +| Transactions/sec (TPS) | 853.16 | 887.14 | +| Queries/sec (QPS) | 17,063.22 | 17,742.89 | +| Latency avg (ms) | 1.17 | 1.13 | +| Latency 95th % (ms) | 1.39 | 1.37 | +| Total time (s) | 239.9 | 239.9 | + +--- + +### 3. BOLTed READ, BOLTed WRITE, MERGED BOLT (Read+Write+BOLTed Libraries) + +| Metric | Bolted Read-Only | Bolted Write-Only | Merged BOLT (Read+Write+libssl) | Merged BOLT (Read+Write+libcrypto) | Merged BOLT (Read+Write+libssl+libcrypto) | +|---------------------------|---------------------|-------------------|----------------------------------|------------------------------------|-------------------------------------------| +| Transactions/sec (TPS) | 1348.47 | 3170.92 | 887.14 | 896.58 | 902.98 | +| Queries/sec (QPS) | 21575.45 | 19025.52 | 17742.89 | 17931.57 | 18059.52 | +| Latency avg (ms) | 0.74 | 0.32 | 1.13 | 1.11 | 1.11 | +| Latency 95th % (ms) | 0.77 | 0.55 | 1.37 | 1.34 | 1.34 | +| Total time (s) | 239.8 | 239.72 | 239.9 | 239.9 | 239.9 | + +--- + +### Key Metrics to Analyze + +- **TPS (Transactions Per Second)**: Higher is better. +- **QPS (Queries Per Second)**: Higher is better. +- **Latency (Average and 95th Percentile)**: Lower is better. + +--- + +### Conclusion +- BOLT substantially improves performance over non-optimized binaries due to better instruction cache utilization and reduced execution path latency. +- Merging feature-specific profiles does not negatively affect performance; instead, it captures a broader set of runtime behaviors, making the binary better tuned for varied real-world workloads. +- Separately optimizing external user-space libraries, even though providing smaller incremental gains, further complements the overall application optimization, delivering a fully optimized execution environment.