Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: Optimizing Arm binaries and libraries with LLVM-BOLT and profile merging

draft: true
cascade:
draft: true

minutes_to_complete: 30

who_is_this_for: Performance engineers, software developers working on Arm platforms who want to optimize both application binaries and shared libraries using LLVM-BOLT.

learning_objectives:
- Instrument and optimize binaries for individual workload features using LLVM-BOLT.
- Collect separate BOLT profiles and merge them for comprehensive code coverage.
- Optimize shared libraries independently.
- Integrate optimized shared libraries into applications.
- Evaluate and compare application and library performance across baseline, isolated, and merged optimization scenarios.

prerequisites:
- An Arm based system running Linux with BOLT and Linux Perf installed. The Linux kernel should be version 5.15 or later.
- (Optional) A second, more powerful Linux system to build the software executable and run BOLT.

author: Gayathri Narayana Yegna Narayanan

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
- Cortex-A
tools_software_languages:
- BOLT
- perf
- Runbook
operatingsystems:
- Linux

further_reading:
- resource:
title: BOLT README
link: https://github.com/llvm/llvm-project/tree/main/bolt
type: documentation
- resource:
title: BOLT - A Practical Binary Optimizer for Data Centers and Beyond
link: https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/
type: website


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Overview of BOLT Merge
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) is a post-link binary optimizer that uses Linux Perf data to re-order the executable code layout to reduce memory overhead and improve performance.

In this Learning Path, you'll learn how to:
- Collect and merge BOLT profiles from multiple workload features (e.g., read-only and write-only)
- Independently optimize application binaries and external user-space libraries (e.g., `libssl.so`, `libcrypto.so`)
- Link the final optimized binary with the separately bolted libraries to deploy a fully optimized runtime stack

While MySQL and sysbench are used as examples, this method applies to **any feature-rich application** that:
- Exhibits multiple runtime paths
- Uses dynamic libraries
- Requires full-stack binary optimization for performance-critical deployment

The workflow includes:
1. Profiling each workload feature separately
2. Profiling external libraries independently
3. Merging profiles for broader code coverage
4. Applying BOLT to each binary and library
5. Linking bolted libraries with the merged-profile binary

Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: BOLT Optimization - First feature
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

In this step, you will instrument an application binary (such as `mysqld`) with BOLT to collect runtime profile data for a specific feature — for example, a **read-only workload**.

The collected profile will later be merged with others and used to optimize the application's code layout.

### Step 1: Build or obtain the uninstrumented binary

Make sure your application binary is:

- Built from source (e.g., `mysqld`)
- Unstripped, with symbol information available
- Compiled with frame pointers enabled (`-fno-omit-frame-pointer`)

You can verify this with:

```bash
readelf -s /path/to/mysqld | grep main
```

If the symbols are missing, rebuild the binary with debug info and no stripping.

---

### Step 2: Instrument the binary with BOLT

Use `llvm-bolt` to create an instrumented version of the binary:

```bash
llvm-bolt /path/to/mysqld \\
-instrument \\
-o /path/to/mysqld.instrumented \\
--instrumentation-file=/path/to/profile-readonly.fdata \\
--instrumentation-sleep-time=5 \\
--instrumentation-no-counters-clear \\
--instrumentation-wait-forks
```

### Explanation of key options

- `-instrument`: Enables profile generation instrumentation
- `--instrumentation-file`: Path where the profile output will be saved
- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes)

---

### Step 3: Run the instrumented binary under a feature-specific workload

Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench:

```bash
taskset -c 9 ./src/sysbench \\
--db-driver=mysql \\
--mysql-host=127.0.0.1 \\
--mysql-db=bench \\
--mysql-user=bench \\
--mysql-password=bench \\
--mysql-port=3306 \\
--tables=8 \\
--table-size=10000 \\
--threads=1 \\
src/lua/oltp_read_only.lua run
```

> Adjust this command as needed for your workload and CPU/core binding.

The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data.

---

### Step 4: Verify the profile was created

After running the workload:

```bash
ls -lh /path/to/profile-readonly.fdata
```

You should see a non-empty file. This file will later be merged with other profiles (e.g., for write-only traffic) to generate a complete merged profile.

---


Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: BOLT Optimization - Second Feature & BOLT Merge to combine
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

In this step, you'll collect profile data for a **write-heavy** workload and also **instrument external libraries** such as `libcrypto.so` and `libssl.so` used by the application (e.g., MySQL).


### Step 1: Run Write-Only Workload for Application Binary

Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`:

```bash
taskset -c 9 ./src/sysbench \\
--db-driver=mysql \\
--mysql-host=127.0.0.1 \\
--mysql-db=bench \\
--mysql-user=bench \\
--mysql-password=bench \\
--mysql-port=3306 \\
--tables=8 \\
--table-size=10000 \\
--threads=1 \\
src/lua/oltp_write_only.lua run
```

Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`.
---
### Step 2: Verify the Second Profile Was Generated

```bash
ls -lh /path/to/profile-writeonly.fdata
```

Both `.fdata` files should now exist and contain valid data:

- `profile-readonly.fdata`
- `profile-writeonly.fdata`

---

### Step 3: Merge the Feature Profiles

Use `merge-fdata` to combine the feature-specific profiles into one comprehensive `.fdata` file:

```bash
merge-fdata /path/to/profile-readonly.fdata /path/to/profile-writeonly.fdata \\
-o /path/to/profile-merged.fdata
```

**Example command from an actual setup:**

```bash
/home/ubuntu/llvm-latest/build/bin/merge-fdata prof-instrumentation-readonly.fdata prof-instrumentation-writeonly.fdata \\
-o prof-instrumentation-readwritemerged.fdata
```

Output:

```
Using legacy profile format.
Profile from 2 files merged.
```

This creates a single merged profile (`profile-merged.fdata`) covering both read-only and write-only workload behaviors.

---

### Step 4: Verify the Merged Profile

Check the merged `.fdata` file:

```bash
ls -lh /path/to/profile-merged.fdata
```

---
### Step 5: Generate the Final Binary with the Merged Profile

Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file:

```bash
llvm-bolt build/bin/mysqld \\
-o build/bin/mysqldreadwrite_merged.bolt_instrumentation \\
-data=/home/ubuntu/mysql-server-8.0.33/sysbench/prof-instrumentation-readwritemerged.fdata \\
-reorder-blocks=ext-tsp \\
-reorder-functions=hfsort \\
-split-functions \\
-split-all-cold \\
-split-eh \\
-dyno-stats \\
--print-profile-stats 2>&1 | tee bolt_orig.log
```

This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features.


Loading