Parallelize perfparser #394

the8472 · 2022-07-02T08:49:48Z

Is your feature request related to a problem? Please describe.
Opening large perf recordings (e.g. of a test-suite spawning many binaries) takes minutes, most of the time is spent executing a single hotspot-perfparser thread

Describe the solution you'd like
Anything that makes it significantly faster, parallelizing the work being a candidate.

Describe alternatives you've considered
I have tried lowering the sampling rate and profiling fewer tests at a time, but that merely reduced it from lunch-break to coffee-break time.

Additional context
Tested on hotspot 1.3.0

The text was updated successfully, but these errors were encountered:

milianw · 2022-10-07T08:23:24Z

hey @the8472 - this request is super unhelpful. I would also like to parallelize it, but the problem is simply not easily amendable to being parallelized. sampling events need to be handled in order, the best we could do is potentially parallelize the unwinding for separate processes. patches welcome I guess, but again - not an easy task.

the8472 · 2022-10-07T08:33:34Z

this request is super unhelpful.

I'll see if I can get a flamegraph of the parsing that's slow for me. Would that help?

sampling events need to be handled in order

I don't know anything about the perf data format. Is it not separable into large chunks that can be processed independently and then joined? I had that impression because it occasionally prints the message that some chunks were lost and it's able to recover from that.

the best we could do is potentially parallelize the unwinding for separate processes

That may help in at least one of my cases where I was profiling a benchmark suite which spawned ~10k short-lived, separately compiled processes.

the8472 · 2022-10-11T20:39:56Z

Top-down and bottom-up flamegraphs of hotspot+perf-parser loading a 9GB perf data which takes several minutes.

milianw · 2022-11-13T18:14:12Z

hey @the8472, thanks for the flamegraphs, but as-is this is still totally unactionable to me. The flamegraphs only show that dwarf resolution of inline frames is slow, as well as repeated mmap/munmap on your system. The latter is surprising, but I'm unsure I can do anything about it. The former is less surprising, as DWARF inline frame resolving is generally a slow thing. If that's needed a ton, then it's simply slow. Why it is so slow for your case compared to other scenarios I looked at so far, I cannot say. Without a way for me to reproduce this issue, I cannot look into it.

If you want me to do anything about this, you will need to either document how you record the perf file such that I can reproduce the issue. Alternatively, you will need to upload the perf.data file as well as all the binaries and libraries and debug files it references - this can easily become a very large tarball requiring multiple gigabytes of storage space. Upload it to a cloud downloader of your choice and share a link here. Note that this obviously only is an option if the binaries included are open source, if that includes proprietary code there's nothing I can do about this task.

the8472 · 2022-11-13T19:15:02Z

This will require beefy hardware or patience:

clone https://github.com/rust-lang/rust
install dependencies
setup a build config. setting download-ci-llvm=true is recommended to reduce build times, but not necessary.
additionally set debuginfo-level = 2 in the [rust] section to build with full debuginfo
run ./x test --stage 1 ui to get things built
run perf record --call-graph dwarf -F 97 -e instructions,cycles ./x test --stage 1 --force-rerun ui to profile the UI testsuite
open the profile in hotspot
observe slow parser

milianw · 2022-11-29T21:37:00Z

thanks, I'll see when I can find the time to replicate that environment. But for now, note already that you should change your perf invocation to leverage leader sampling. I.e. right now, you program the PMU to sample both, instructions and cycles independently at ~97 samples per second. This is probably not what you want - instead, write it like this:

perf record --call-graph dwarf -F 97 -e "{cycles,instructions}:S" ...

this will sample on cycles and then whenever that happens also record the instruction count too.

milianw · 2022-11-29T21:47:25Z

Hm I cannot compile the rust compiler, it seems to not honor PATH but instead just looks only in the first entry for ar instead of continuing onwards?

$ ./x test --stage 1 src/test/ui
Building rustbuild
    Finished dev [unoptimized] target(s) in 0.03s
thread 'main' panicked at '

couldn't find required command: "/home/milian/.bin/ar"

', sanity.rs:59:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Build completed unsuccessfully in 0:00:00
$ which ar
/usr/bin/ar
$ echo $PATH
/home/milian/.bin:/home/milian/.local/bin:/home/milian/projects/compiled/other/bin:/home/milian/.bin/kf5:/home/milian/projects/compiled/kf5-dbg/bin:/home/milian/projects/compiled/other/bin:/home/milian/projects/compiled/kf5/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl

the8472 · 2022-11-29T21:56:14Z

I think the build tools try to infer the ar path based on the c compiler path. If the default logic doesn't work you can explicitly set its path in config.toml

GitMensch · 2023-12-20T07:02:28Z

@the8472: Would you mind to record as per the suggestion above and recheck the loading with a current appimage?

the8472 · 2023-12-21T14:21:06Z

Which suggestion are you referring to?

the8472 · 2023-12-21T14:35:18Z

-e "{cycles,instructions}:S"

Using this leads to tiny profiles (less than a megabyte) instead of gigabytes and opening such a profile crashes hotspot.

GitMensch · 2023-12-21T14:49:43Z

I can't reproduce that with some simple examples:

perf record --call-graph dwarf -F 97 -e "{cycles,instructions}:S" true
perf record --call-graph dwarf -F 97 true

both open fine (same for a bigger application).

Can you please share the perf file that crashes hotspot for you (if not already tested, please check the latest Appimage) and how you recorded that.

the8472 · 2023-12-21T15:01:44Z

I'm rerunning my steps from #394 (comment) only modifying the record command:

perf record --call-graph dwarf -F97 -e "{cycles,instructions}:S" ./x test ui --stage 1 --force-rerun

perf.data.gz

It crashes both hotspot-v1.4.1-263-ga8d1440-x86_64.AppImage and hotspot-v1.3.0-277-g2bcd36d-x86_64.AppImage with a segfault

/tmp/.mount_hotspoWNLRPL/AppRun.wrapped: line 6: 2288130 Segmentation fault (core dumped) LD_LIBRARY_PATH="$d/usr/lib":$LD_LIBRARY_PATH "$bin/hotspot" "$@"

milianw · 2024-01-13T12:52:08Z

I think the crashes should be resolved in newer hotspot. But the fact that the data files are tiny sounds like a kernel bug - is that still the case now?

GitMensch · 2024-01-13T13:36:02Z

Rechecked: appimage from October has that fault as above after loading "long", using the appimage from today hotspot directly opens, no chrash any longer.

For the original case we likely still need a more reproducible step :-/

the8472 · 2024-01-13T14:00:36Z

For the original case we likely still need a more reproducible step :-/

Can you explain what is not reproducible about #394 (comment) ?

milianw · 2024-01-15T10:05:22Z

It's just hard to setup, but I think it should be enough. I just need to find the time to replicate your setup, which is a hurdle for me. Last time I tried I failed, and now I just need to try again with your suggestion on how to workaround that issue.

the8472 · 2024-01-15T10:27:59Z

Would Dockerfile containing all the steps to reproduce it help? It'll be a bit tricky to also get the GUI to run in docker but I think it should be possible.

milianw · 2024-01-15T12:06:55Z

A docker file would help, I can just access it from the outside then using hotspot to analyze the data. I.e. the docker image just needs rust + perf and no UI at all I think.

the8472 · 2024-01-28T00:52:27Z

FROM archlinux

RUN pacman -Sy && pacman -S --noconfirm git perf base-devel

WORKDIR /opt
RUN git clone --depth 1 https://github.com/rust-lang/rust --branch master --single-branch rust

WORKDIR /opt/rust
RUN cp src/bootstrap/defaults/config.compiler.toml config.toml

CMD ./x test ui --stage 1 -- "does not exist" && perf record -m 128 -F59 --call-graph dwarf -e cycles:u ./x test ui --stage 1

podman build  -t perfparser-test .
podman run -it --privileged perfparser-test

The privileged flag is required to run perf record.

milianw · 2024-02-11T15:33:55Z

Alright, I looked at this a bit now. It might theoretically help for this specific problem if we could parallelize the perfparser analysis on the pid level, since there are so many of them in the above record.

Generally though, there is a ton of inherent overhead that I would like to see removed/reduced, but I fear elfutils is not prepared for that - most notably we load the same libraries and seem to spend the most time during unwinding on finding CFI data and building some elfutils-specific cache, only to then remove that again and rebuilding it for the following process in quick succession. I.e. the problem really is that there are thousands of short running processes here, which is apparently the worst case for analysis.

the8472 added the enhancement label Jul 2, 2022

milianw added the help wanted label Oct 7, 2022

GitMensch mentioned this issue Dec 21, 2023

chrash in hotspot in handling perfparser output #578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize perfparser #394

Parallelize perfparser #394

the8472 commented Jul 2, 2022

milianw commented Oct 7, 2022

the8472 commented Oct 7, 2022

the8472 commented Oct 11, 2022

milianw commented Nov 13, 2022

the8472 commented Nov 13, 2022 •

edited

milianw commented Nov 29, 2022

milianw commented Nov 29, 2022 •

edited

the8472 commented Nov 29, 2022

GitMensch commented Dec 20, 2023

the8472 commented Dec 21, 2023

the8472 commented Dec 21, 2023

GitMensch commented Dec 21, 2023

the8472 commented Dec 21, 2023

milianw commented Jan 13, 2024

GitMensch commented Jan 13, 2024

the8472 commented Jan 13, 2024

milianw commented Jan 15, 2024

the8472 commented Jan 15, 2024

milianw commented Jan 15, 2024

the8472 commented Jan 28, 2024 •

edited

milianw commented Feb 11, 2024

Parallelize perfparser #394

Parallelize perfparser #394

Comments

the8472 commented Jul 2, 2022

milianw commented Oct 7, 2022

the8472 commented Oct 7, 2022

the8472 commented Oct 11, 2022

milianw commented Nov 13, 2022

the8472 commented Nov 13, 2022 • edited

milianw commented Nov 29, 2022

milianw commented Nov 29, 2022 • edited

the8472 commented Nov 29, 2022

GitMensch commented Dec 20, 2023

the8472 commented Dec 21, 2023

the8472 commented Dec 21, 2023

GitMensch commented Dec 21, 2023

the8472 commented Dec 21, 2023

milianw commented Jan 13, 2024

GitMensch commented Jan 13, 2024

the8472 commented Jan 13, 2024

milianw commented Jan 15, 2024

the8472 commented Jan 15, 2024

milianw commented Jan 15, 2024

the8472 commented Jan 28, 2024 • edited

milianw commented Feb 11, 2024

the8472 commented Nov 13, 2022 •

edited

milianw commented Nov 29, 2022 •

edited

the8472 commented Jan 28, 2024 •

edited