Exceptions and stack trace overheads #45999

kpamnany · 2022-07-11T20:23:43Z

We observe two significant performance issues with exception handling:

With our sysimage (~3X larger than the stock Julia sysimage), gathering a displayable stack trace takes ~4 seconds. Profiling shows that this time is mostly spent in LLVM (SectionRef::containsSymbol and ELFObjectFile::getSymbolAddress). Any suggestions/ideas for how to improve this would be very helpful!
throw can take very slow paths. Here's one:

jl_throw
record_backtrace
rec_backtrace
jl_unw_stepn
jl_unw_step
unw_step
dwarf_step
find_reg_state
fetch_proc_info
tdep_find_proc_info # def'd to dwarf_find_proc_info
sigprocmask
dl_iterate_phdr
sigprocmask

dl_iterate_phdr walks through a list of shared objects; this can be a long list. We observe this call graph reasonably often; I'm not sure if that has anything to do with the environment (EC2 instance). Our large sysimage seems to exacerbate the performance problem here as well. I also found that exceptions are thrown not infrequently in type inference which is another performance hit.

I'm not sure what a good solution here would be. A couple of ideas:

At throw time, record the bare minimum and build out the backtrace only when/if it is needed. Implementing this might be gnarly.
Introduce a throw_light or similar when using exceptions for control flow. This is kinda ugly.

Any other ideas? I do see references to caching certain information in libunwind code, but I'm not familiar enough with the codebase to understand if this is being done sufficiently.

Cc: @vchuravy, @JeffBezanson, @vtjnash

Edited to correct per the following two comments.

The text was updated successfully, but these errors were encountered:

vchuravy · 2022-07-11T21:14:55Z

dl_iterate_phdr is a libc call. Can you also post the screenshot you showed me that was 60% in a sys all since we hit the slow path in libunwind?

vchuravy · 2022-07-11T22:17:45Z

To be precise, the remote sleuthing I did pointed towards https://github.com/libunwind/libunwind/blob/3be832395426b72248969247a4a66e3c3623578d/src/dwarf/Gfind_proc_info-lsb.c#L806-L808

The profile had ~60% of time spent in the syscall sigprocmask and the function that was called from was find_proc_info as in the chain Kiran showed there.

In particular there is a different path in https://github.com/libunwind/libunwind/blob/1f79a05edbd5c06240f8a15187b106831076540e/src/dwarf/Gparser.c#L468 that we are not hitting that might be faster.

But it is kinda ludicrous that each dwarf_step involves a walk of potentially all the object files loaded, and there should be some caching there?

vtjnash · 2022-07-11T22:29:16Z

That syscall seems unjustified also libunwind/libunwind@d3fad3a
We could disable it at build time to avoid the penalty for a feature we possibly don't need or want. However, I believe it is protecting us against the internal lock in dl_iterate_phdr from causing deadlocks for the process (since libunwind promises to be async-signal-safe). On macos we do something roughly similar by calling _dyld_atfork_prepare to get that same lock before pausing a task for profiling.

kpamnany · 2022-07-11T22:33:20Z

This profile? Had no dl_iterate_phdr... but yes, sigprocmask.

Edit: ah, it was in the code we were looking at.

kpamnany · 2022-07-11T22:52:12Z

I believe it is protecting us against the internal lock in dl_iterate_phdr from causing deadlocks for the process (since libunwind promises to be async-signal-safe).

That sounds unavoidable then?

@vchuravy's suggestion:

each dwarf_step involves a walk of potentially all the object files loaded and there should be some caching there?

Sounds reasonable to me. For a particular backtrace exploration, will the set of object files loaded be static? Can a thread dlclose a shared object and invalidate the cache?

vtjnash · 2022-07-11T23:16:40Z

It might, and that might get us wedged into a really bad state if it happened anyways. In practice, never use dlclose if you care about this.

kpamnany · 2022-07-11T23:52:17Z

On reflection, I don't think that could happen. If we're walking the stack for an exception and an address from a shared object is on the stack, it should be impossible for the reference count of the shared object to drop to 0.

vchuravy · 2022-07-11T23:53:05Z

This profile? Had no dl_iterate_phdr... but yes, sigprocmask.

Yeah I think the frames were inlined.

kpamnany added performance Must go faster error handling Handling of exceptions by Julia or the user labels Jul 11, 2022

kpamnany self-assigned this Sep 20, 2022

kpamnany mentioned this issue Jun 2, 2023

Rare freeze on exit #50038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions and stack trace overheads #45999

Exceptions and stack trace overheads #45999

kpamnany commented Jul 11, 2022 •

edited

vchuravy commented Jul 11, 2022

vchuravy commented Jul 11, 2022

vtjnash commented Jul 11, 2022

kpamnany commented Jul 11, 2022 •

edited

kpamnany commented Jul 11, 2022

vtjnash commented Jul 11, 2022

kpamnany commented Jul 11, 2022

vchuravy commented Jul 11, 2022

Exceptions and stack trace overheads #45999

Exceptions and stack trace overheads #45999

Comments

kpamnany commented Jul 11, 2022 • edited

vchuravy commented Jul 11, 2022

vchuravy commented Jul 11, 2022

vtjnash commented Jul 11, 2022

kpamnany commented Jul 11, 2022 • edited

kpamnany commented Jul 11, 2022

vtjnash commented Jul 11, 2022

kpamnany commented Jul 11, 2022

vchuravy commented Jul 11, 2022

kpamnany commented Jul 11, 2022 •

edited

kpamnany commented Jul 11, 2022 •

edited