Profile-Guided Optimization (PGO) benchmark results #118

zamazan4ik · 2024-05-14T11:14:55Z

Hi!

Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO shows measurable improvements in many cases (especially different parsers - check the repo above for xml-rs, quick-xml and other similar libraries), I decided to perform PGO benchmarks on this library too. Here are my results - I hope they will be helpful for someone.

Test environment

Fedora 39
Linux kernel 6.8.9
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78.0
roxmltree version: the latest for now from the master branch on commit 0a6e7afce2e05d7bc4108c675b3f14f0f2640e45
Disabled Turbo boost

Benchmark

For benchmark purposes, I use built-in into the project benchmarks. For PGO optimization I use cargo-pgo tool. Release bench result I got with taskset -c 0 cargo bench command. The PGO training phase is done with taskset -c 0 cargo pgo bench, PGO optimization phase - with taskset -c 0 cargo pgo optimize bench. taskset -c 0 is used for reducing the OS scheduler influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

Release:

test huge_quick_xml                         ... bench:   3,059,195 ns/iter (+/- 21,204)
test huge_roxmltree                         ... bench:   4,423,745 ns/iter (+/- 24,817)
test huge_sdx_document                      ... bench:   9,182,301 ns/iter (+/- 63,025)
test huge_xmlrs                             ... bench:  30,894,169 ns/iter (+/- 888,408)
test huge_xmltree                           ... bench:  36,754,811 ns/iter (+/- 956,986)
test large_quick_xml                        ... bench:   1,792,145 ns/iter (+/- 13,892)
test large_roxmltree                        ... bench:   1,770,705 ns/iter (+/- 7,241)
test large_sdx_document                     ... bench:   4,822,998 ns/iter (+/- 11,973)
test large_xmlrs                            ... bench:  10,636,001 ns/iter (+/- 27,776)
test large_xmltree                          ... bench:  12,569,757 ns/iter (+/- 58,468)
test medium_quick_xml                       ... bench:     200,368 ns/iter (+/- 2,999)
test medium_roxmltree                       ... bench:     530,254 ns/iter (+/- 934)
test medium_sdx_document                    ... bench:   1,571,581 ns/iter (+/- 5,588)
test medium_xmlrs                           ... bench:   4,452,170 ns/iter (+/- 50,976)
test medium_xmltree                         ... bench:   5,078,954 ns/iter (+/- 17,516)
test roxmltree_iter_children                ... bench:       2,279 ns/iter (+/- 4)
test roxmltree_iter_descendants_expensive   ... bench:      92,121 ns/iter (+/- 703)
test roxmltree_iter_descendants_inexpensive ... bench:      26,103 ns/iter (+/- 1,157)
test tiny_quick_xml                         ... bench:       3,085 ns/iter (+/- 33)
test tiny_roxmltree                         ... bench:       2,909 ns/iter (+/- 16)
test tiny_sdx_document                      ... bench:      10,546 ns/iter (+/- 29)
test tiny_xmlrs                             ... bench:      15,142 ns/iter (+/- 162)
test tiny_xmltree                           ... bench:      17,507 ns/iter (+/- 65)
test xmltree_iter_descendants_expensive     ... bench:     218,836 ns/iter (+/- 6,074)
test xmltree_iter_descendants_inexpensive   ... bench:     136,539 ns/iter (+/- 3,554)

PGO optimized compared to Release:

test huge_quick_xml                         ... bench:   2,278,564 ns/iter (+/- 19,007)
test huge_roxmltree                         ... bench:   4,094,643 ns/iter (+/- 10,736)
test huge_sdx_document                      ... bench:   6,433,602 ns/iter (+/- 109,513)
test huge_xmlrs                             ... bench:  19,116,647 ns/iter (+/- 425,167)
test huge_xmltree                           ... bench:  24,554,023 ns/iter (+/- 369,390)
test large_quick_xml                        ... bench:   1,428,947 ns/iter (+/- 19,617)
test large_roxmltree                        ... bench:   1,554,022 ns/iter (+/- 26,051)
test large_sdx_document                     ... bench:   3,276,772 ns/iter (+/- 16,009)
test large_xmlrs                            ... bench:   6,535,937 ns/iter (+/- 73,676)
test large_xmltree                          ... bench:   8,091,513 ns/iter (+/- 35,686)
test medium_quick_xml                       ... bench:     184,094 ns/iter (+/- 401)
test medium_roxmltree                       ... bench:     479,932 ns/iter (+/- 6,205)
test medium_sdx_document                    ... bench:   1,200,572 ns/iter (+/- 4,982)
test medium_xmlrs                           ... bench:   2,545,488 ns/iter (+/- 30,415)
test medium_xmltree                         ... bench:   3,159,610 ns/iter (+/- 40,624)
test roxmltree_iter_children                ... bench:       2,223 ns/iter (+/- 88)
test roxmltree_iter_descendants_expensive   ... bench:      92,571 ns/iter (+/- 1,401)
test roxmltree_iter_descendants_inexpensive ... bench:      26,624 ns/iter (+/- 1,048)
test tiny_quick_xml                         ... bench:       2,377 ns/iter (+/- 15)
test tiny_roxmltree                         ... bench:       2,552 ns/iter (+/- 97)
test tiny_sdx_document                      ... bench:       7,468 ns/iter (+/- 34)
test tiny_xmlrs                             ... bench:       9,581 ns/iter (+/- 76)
test tiny_xmltree                           ... bench:      11,509 ns/iter (+/- 194)
test xmltree_iter_descendants_expensive     ... bench:     165,526 ns/iter (+/- 6,612)
test xmltree_iter_descendants_inexpensive   ... bench:      94,825 ns/iter (+/- 6,145)

(just for reference) PGO instrumentation compared to Release:

test huge_quick_xml                         ... bench:   5,323,934 ns/iter (+/- 27,810)
test huge_roxmltree                         ... bench:  10,184,293 ns/iter (+/- 29,271)
test huge_sdx_document                      ... bench:  16,927,468 ns/iter (+/- 123,906)
test huge_xmlrs                             ... bench:  43,160,874 ns/iter (+/- 241,916)
test huge_xmltree                           ... bench:  49,210,192 ns/iter (+/- 724,862)
test large_quick_xml                        ... bench:   3,148,443 ns/iter (+/- 26,140)
test large_roxmltree                        ... bench:   4,470,726 ns/iter (+/- 15,379)
test large_sdx_document                     ... bench:   8,748,503 ns/iter (+/- 84,547)
test large_xmlrs                            ... bench:  15,045,592 ns/iter (+/- 83,405)
test large_xmltree                          ... bench:  18,391,408 ns/iter (+/- 103,853)
test medium_quick_xml                       ... bench:     374,479 ns/iter (+/- 1,019)
test medium_roxmltree                       ... bench:   1,448,390 ns/iter (+/- 2,073)
test medium_sdx_document                    ... bench:   3,135,006 ns/iter (+/- 9,571)
test medium_xmlrs                           ... bench:   5,416,875 ns/iter (+/- 51,238)
test medium_xmltree                         ... bench:   6,436,398 ns/iter (+/- 161,354)
test roxmltree_iter_children                ... bench:       2,326 ns/iter (+/- 8)
test roxmltree_iter_descendants_expensive   ... bench:     230,030 ns/iter (+/- 668)
test roxmltree_iter_descendants_inexpensive ... bench:      92,757 ns/iter (+/- 328)
test tiny_quick_xml                         ... bench:       5,316 ns/iter (+/- 42)
test tiny_roxmltree                         ... bench:       6,792 ns/iter (+/- 23)
test tiny_sdx_document                      ... bench:      21,065 ns/iter (+/- 102)
test tiny_xmlrs                             ... bench:      21,936 ns/iter (+/- 237)
test tiny_xmltree                           ... bench:      26,819 ns/iter (+/- 170)
test xmltree_iter_descendants_expensive     ... bench:     449,644 ns/iter (+/- 5,387)
test xmltree_iter_descendants_inexpensive   ... bench:     282,320 ns/iter (+/- 5,873)

According to the results, PGO measurably improves the library's performance in many cases. I expected such or similar results since PGO is especially good for things like parsers.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the library's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference.

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.

I would be happy to answer your questions about PGO.

P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

The text was updated successfully, but these errors were encountered:

RazrFalcon · 2024-05-14T11:40:08Z

Thanks for looking into this. It's interesting that roxmltree had the least improvement from PGO. I guess it's already well optimized then?

Suggesting that you can get 5-10% improvement with PGO is probably unnecessary, because how troublesome the setup is.
But finding bottlenecks with PGO looks like an interesting idea, if only there were a way to get some meaningful data from it. I don't think that looking at flamegraphs would help me much. I've already run this library under various profilers.
But if you know how to get some specifics/suggestions from PGO - I would be interested.

Can the trained PGO data be preserved somehow or it has to be gathered each time? We could include it in the repo if it is possible.

zamazan4ik · 2024-05-14T13:44:23Z

Thanks for looking into this. It's interesting that roxmltree had the least improvement from PGO. I guess it's already well optimized then?

I would say "is better optimized than other XML libraries" ;) Great job!

Suggesting that you can get 5-10% improvement with PGO is probably unnecessary, because how troublesome the setup is.

I would politely argue with that. How important is "5-10% performance improvement" compared to "enabling PGO build for an application" depends on multiple things: how the application is difficult to build, how much they care about performance, project roadmap, etc. Documenting such an improvement for the library can be important for the library users - if the application spends enough (from the application devs perspective ofc) with parsing XML documents - they can consider enabling PGO if they see benchmark numbers. It's a good thing to have - you don't need to tweak your dependency to improve its performance - you can do it via an additional compiler switch set.

PGO enablement troublesome is a good topic to discuss. However, from my experience, the Rust ecosystem has one of the most straightforward (if not the most one!) and easy-to-use way to enable PGO build for an application - via cargo-pgo. So difficulties with enabling PGO for an application can be a bit exaggerated.

But finding bottlenecks with PGO looks like an interesting idea, if only there were a way to get some meaningful data from it. I don't think that looking at flamegraphs would help me much. I've already run this library under various profilers.
But if you know how to get some specifics/suggestions from PGO - I would be interested.

If you already were running the library under various profilers, then I guess the only working option here will be comparing "PGOed" vs "non-PGOed" assembly for different roxmltree routines and trying to understand, why PGO-enabled versions are faster (with optional rechecking with Agner Fog's guides). Maybe you will find some useful information about "hot" things from PGO profiles with llvm-profdata show (docs).

Can the trained PGO data be preserved somehow or it has to be gathered each time? We could include it in the repo if it is possible.

Short answer: yes, they can be preserved but with some limitations.

Long answer. PGO profiles are compiler-dependent things since PGO profiles have some internal structure. Your PGO profiles can become stale (read it as "non-working for performance optimization") due to multiple reasons: compiler update (with almost 100% you need to regenerate PGO profiles after the compiler update), changing compiler flags like from opt-level = 2 to opt-level = 3, etc. In each of these cases, you need to somehow detect when you need to regenerate profiles, and then regenerate them. More about it you can read at

Instead, in such situations, I suggest you prepare some PGO training scenario (like a bash script, sample XML files - you already have them in the benchmarks) or something like that. It will allow you to not think about such things and simply reuse training scenarios each time when you want e.g. on each build. Nice side-effect - the library users also will be able to use these scripts to generate their PGO profiles for the library if they have a different target workload.

However, roxmltree is not an application - it's a library, and this library is not delivered to the users in the precompiled way so we need to integrate it into the build process. Probably, it can be done with build.rs for Rust libraries but I've never tried that before.

adamreichold · 2024-05-14T20:22:12Z

I don't think directly applying PGO to non-binary crates is useful beyond investigations towards optimization as you already discuss. For applications using the library, you will definitely need profiles collected by running that application and not profiles based on one of your dependency's micro-benchmarks.

Real applications usually do more than just parse XML and their performance profiles might therefore look remarkably different even in the XML parser code itself, for example due to contention for shared resources like caches.

Sadly this insolent complexity of the real-world also makes PGO significantly harder than just applying cargo-pgo as you'll need a representative workload which you can exercise reproducibly. And if that workload misses important parts of your application, this could actually end up pessimizing instead of optimizing.

This is further complicated by a lot of applications like network services or control systems not performing mostly pure computations like parsers, compilers, etc. but reacting to outside events which can be difficult to capture in synthetic benchmarks.

zamazan4ik · 2024-05-14T23:09:31Z

Actually, I agree with you that making some PGO pre-optimization for the libraries (especially if we are talking about distributed mostly by sources libraries - exactly the Rust ecosystem case) is a bit overhead. I didn't find any library yet that does such things. So documenting/mentioning performance benefits somewhere in the visible to users places can be all that could be done for the library. I have several examples about how it can be done in practice - but they are also all for applications, not for libraries.

For applications using the library, you will definitely need profiles collected by running that application and not profiles based on one of your dependency's micro-benchmarks.

It depends. If the library's PGO profile is "real-world enough", actually it's ok to use it as a part of your application even if the application itself doesn't use PGO for optimizing the whole application (there are so many reasons to not enable PGO for an application - many of them you already mentioned). However, I agree that collecting PGO profiles on the application level should be a default strategy. Of course, it isn't required to use the PGO microbenchmarks above for creating library-specific PGO profiles. E.g. you can prepare a dedicated "PGO training scenario" that tries to perform "default" operations with the library - it will work as well.

Real applications usually do more than just parse XML and their performance profiles might therefore look remarkably different even in the XML parser code itself, for example due to contention for shared resources like caches.

Yep but how much an application spends its time with parsing XML and how it is important for the application you don't know - only application developers can know it. At least in my experience, I had multiple issues with parsing libraries' performance (JSON, XML). The README file also shows some numbers about performance improvement for the library - so at least somewhere it's quite important ;) So optimizing only a part of the application - roxmltree in our case - can be still beneficial for users.

Sadly this insolent complexity of the real-world also makes PGO significantly harder than just applying cargo-pgo as you'll need a representative workload which you can exercise reproducibly.

It depends on the application. For some applications you are right - the PGO process can be complicated. For others - yep, it could be really done just via calling cargo-pgo routines with some really simple workload that is pretty static and doesn't require a lot of maintenance. I have examples for both cases. It depends.

And if that workload misses important parts of your application, this could actually end up pessimizing instead of optimizing.

Yep, it can happen in some cases. Good compilers like GCC have -fprofile-partial-training option for mitigating it (docs). Unfortunately, LLVM doesn't have such a thing yet but there are mitigations for it as well.

This is further complicated by a lot of applications like network services or control systems not performing mostly pure computations like parsers, compilers, etc. but reacting to outside events which can be difficult to capture in synthetic benchmarks.

Yep, you are right. For such systems, PGO profiles are collected directly from a "production" environment and then used during the compilation phase. There are multiple issues with such an approach like PGO profile delivery, storing, Instrumentation PGO overhead (that's why Sampling PGO was implemented - to allow collecting PGO profiles from production environments). Companies like Google, Facebook, Datadog (AFAIK) do exactly this thing. However, these details are not so important for roxmltree PGO case :)

RazrFalcon · 2024-05-15T06:46:29Z

I will agree with @adamreichold here. PGO is an interesting tool for applications, but not for libraries. And since there is not much we can do to improve roxmltree for everyone we should probably ignore it.

Speaking of applications, you can try PGO on resvg. It has ~1600 test files to use for training.
And since we distribute resvg binaries we might integrate PGO into CI build process.

RazrFalcon closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024

zamazan4ik mentioned this issue May 24, 2024

Link-Time Optimization (LTO), Profile-Guided Optimization (PGO), Post-Link Optimization (PLO) benchmark results RazrFalcon/resvg#765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) benchmark results #118

Profile-Guided Optimization (PGO) benchmark results #118

zamazan4ik commented May 14, 2024

RazrFalcon commented May 14, 2024

zamazan4ik commented May 14, 2024

adamreichold commented May 14, 2024 •

edited

zamazan4ik commented May 14, 2024

RazrFalcon commented May 15, 2024

Profile-Guided Optimization (PGO) benchmark results #118

Profile-Guided Optimization (PGO) benchmark results #118

Comments

zamazan4ik commented May 14, 2024

Test environment

Benchmark

Results

Further steps

RazrFalcon commented May 14, 2024

zamazan4ik commented May 14, 2024

adamreichold commented May 14, 2024 • edited

zamazan4ik commented May 14, 2024

RazrFalcon commented May 15, 2024

adamreichold commented May 14, 2024 •

edited