Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception stacks #28878

Merged
merged 16 commits into from Oct 23, 2018

Conversation

@c42f
Copy link
Contributor

commented Aug 24, 2018

An exception stack system for julia

Here's my attempt to fix #12485 / #19979 by introducing an exception stack
system. The basic problem is that you may throw a new exception while handling
an exception. What do we do then? The existing runtime simply overwrites the
first exception, loosing any information it might have had about the root
cause.

A distinct problem discussed in #12485 is that task switching looses all
exception and backtrace information. I've tackled this at the same time
though in hindsight it might have been better to do the exception stack
first, followed by fixing task switching.

Design

To solve the first problem, I've implemented an exception stack mechanism.
Conceptually:

  • Every throw() pushes an (exception,backtrace) pair onto a stack of current
    exceptions.
  • When you exit a catch block normally, any exception generated within the
    corresponding try is considered to have been handled. The stack is therefore
    popped back to the state at the entry of the corresponding try.
  • The full stack is available for root cause analysis via Base.catch_stack().

For example:

try
    error("A")
catch
    try
        error("B")
    catch
        for (e, bt) in Base.catch_stack()
            showerror(stdout, e, bt)
            println()
        end
    end
end

Produces

A
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at ./REPL[3]:2
 [3] eval(::Module, ::Any) at ./boot.jl:319
 [4] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
 [5] macro expansion at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
 [6] (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at ./task.jl:259
B
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at ./REPL[3]:5
 [3] eval(::Module, ::Any) at ./boot.jl:319
 [4] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
 [5] macro expansion at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
 [6] (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at ./task.jl:259

Implementation

Lowering

A new lowered expression head :pop_exc is introduced and emitted in any
location where a catch block exits normally (either by falling out the bottom,
or using return, break, goto). This pops the exception stack back to the state
of the associated enter. There needs to be some way to track the associated
enter in the IR, so I chose to emit an SSAValue token from enter which is
consumed by pop_exc. I tried various alternatives, but this was by far the
nicest in terms of non-disruptive integration into the SSAIR processing code,
and supporting both the interpreter and codegen.

finally lowering no longer includes a foreigncall, so expressions using
finally can now be interpreted. This includes the output of @test so may
give a nice speed boost.

Runtime

  • The exception in transit is no longer a single pointer in the thread local
    storage. It's stored in ptls->current_task->exc_stack instead, along with
    associated backtrace. Space for the exception and backtrace is allocated
    on demand inside throw_internal, after changing GC mode.

  • The exception stack is copied out onto the Task naturally preserved during task switching.
    I've been back and forth on several variations here, as the exact detail
    of how this is done depends on how the memory for the exception
    stack is allocated (GC vs plain malloc).

  • jl_current_exception() is now the method to retreive the current exception
    from within a JL_CATCH dynamic context.

  • _resetstkoflw() handling is consolidated into jl_eh_restore_state.
    Needs proper testing on windows.

  • Several places where manual restoration of the exception and backtrace was
    done are no longer necessary because the stack system handles these
    automatically.

  • An exception stack type, jl_exc_stack_t and associated manipulation
    functions has been added. Conceptually stores a stack of
    (exception,backtrace) pairs. It's stored in a shared buffer so we can avoid
    reallocating when throwing and catching the same exception or set of
    exceptions repeatedly.

  • jl_eh_restore_state has become non-inline. It seemed good to get this out
    of the public header.

  • jl_sig_throw is now used with jl_throw_in_ctx from signal handlers, to
    make the special circumstances clear and to avoid conflation with rethrow
    which now simply rethrows the existing exception stack.

Interpreter / Codegen

Some minor rearrangement in the interpreter to make the eh buffer management
and leave handling a little easier to understand as I found this surprisingly
confusing.

Mostly small changes in codegen. Representing the enter token as an SSAValue
really integrates well here when you need to retreive it for the associated
pop_exc.

GC

What? How did this rabbit hole lead down into the GC?

Unfortunately the raw backtrace buffers contain references to GC allocated
objects. Since these buffers can now live on a Task, their contents need to
be GC rooted. Unfortunately the raw buffers are also in a non-julia native
format, so I had to write specialized GC code for them.

I'm not sure this was a good idea, especially having extra code to maintain in
this rather sensitive part of the codebase. Perhaps it would be better to
modify our raw backtrace buffer format so that it can be understood by the GC
without major changes.

TODO

  • Check some potential GC issues I found in testing.
  • Decide on detail of Base.catch_stack() API and export it.
  • Double check consolidated _resetstkoflw handling on windows (checked on cross compiled version via wine on ubuntu)
  • Tests for exception preservation during task switching
  • Reinstate rooting of temporary bt_data
  • Check that access to backtraces for failed Tasks still works correctly and figure out how to access the full exception stack for those.

Optional

  • Reconsider best practices for rethrow(exc) vs just throwing a new exception. Can be done separately
  • jl_apply_with_saved_exception_state may be unnecessary now
  • jl_rethrow_other is probably unnecessary now is required for rethrow(ex)
  • Persist backtrace when task exits with an exception
@iamed2

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2018

I won't comment on the technical aspects of the solution but I like it conceptually! And I'm happy to see those issues resolved, they made error handling in Dispatcher and stacktrace generation in Memento tricky business.

@ararslan ararslan requested review from JeffBezanson, Keno and yuyichao Aug 24, 2018

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2018

@iamed2 Hah yes I originally came across this when working on logging.

To add to the commentary above, One of my biggest difficulties with this patch was the need to GC root objects in the bt buffer, and also GC the buffer itself in the case that it's attached to a Task. Combine this with the need to actually fill the buffer from inside a signal handler (and potentially other restricted contexts) and you're in a tricky situation. There's probably several holes I've yet to plug there.

Actually I hope to delete most of the GC changes which were done here as they seem like overkill. My current best idea is to use the new efficient array-of-small-Union support to represent the raw backtrace buffer as a julia array of Union{Ptr{Nothing},UInt,CodeInfo,...} which could be filled without allocation and with only a little more book keeping than what we currently do. Need to sort out whether we can call jl_gc_wb in the contexts where we fill the buffer.

Is this a sound idea? Advice on how to improve the memory management in this patch would be particularly welcome.

@c42f c42f force-pushed the cjf/exception-stack branch 2 times, most recently from af0afe3 to f76eb32 Aug 25, 2018

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2018

Ok, CI failures are because I've got something wrong in GC land and the debug build is failing on an assertion. My best guess is that I've broken whatever invariants jl_gc_alloc_buf are meant to obey.

Anyway, the release build does "work" so the branch is in a state where the curious can play with it until the real problem is fixed.

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2018

Looking into alternative storage for backtrace buffers, the Union isn't going to work because we really need to store pointers in there. So perhaps we're stuck with two parallel arrays instead like the current bt, bt2 which appears in some places. Or just put up with having some special purpose code in the gc.

@JeffBezanson

This comment has been minimized.

Copy link
Member

commented Sep 6, 2018

Thanks for tackling this; great work! I think this is basically the right thing. I'll review in detail at some point. A bit of custom code for GC marking doesn't seem so bad to me.

@JeffBezanson JeffBezanson added this to the 1.1 milestone Sep 6, 2018

@c42f c42f force-pushed the cjf/exception-stack branch 2 times, most recently from aca0df5 to 9d51768 Sep 10, 2018

src/gc.c Outdated
@@ -1901,7 +1901,7 @@ excstack: {
size_t i = stackitr->i;
while (itr > 0) {
size_t bt_size = jl_exc_stack_bt_size(exc_stack, itr);
uint64_t *bt_data = jl_exc_stack_bt_data(exc_stack, itr);
uintptr_t *bt_data = jl_exc_stack_bt_data(exc_stack, itr);

This comment has been minimized.

Copy link
@c42f

c42f Sep 18, 2018

Author Contributor

Hah, this was a hilariously stupid bug which cost a lot of time due to randomly busted GC on 32 bit, I don't know what I was thinking using uint64_t here :-/

It would have been caught early if we were able to use -Werror on CI builds but I suspect that is quite disruptive in other ways.

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Oct 11, 2018

Member

It would have been caught early if we were able to use -Werror on CI builds but I suspect that is quite disruptive in other ways.

Please, we should absolutely do this. We've had a creeping increase of compiler warnings in the past month or so.

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2018

Hmm, just encountered a spurious test failure on 32 bit due to #27788.

@c42f c42f force-pushed the cjf/exception-stack branch from b853caf to bcc7bab Sep 18, 2018

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2018

Tests have finally passed at least once on all systems and I've done a fair amount of cleanup. Should be good for a closer look.

I've exported catch_stack() for now, it could do with a better name / better cohesion with the catch_backtrace() API.

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2018

Ok, tests have passed after restarting the appveyor x86_64 build three times. There is an intermittent failure in a single one of the spawn tests on that platform related to deleting a directory, see https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.30766/job/wnxk1pxa27xte6un for example.

This seems to be the test from #26687 coming back again after #28107 was merged.

The fact that it's recurring here more frequently could be related to the changes in this PR.

@c42f c42f force-pushed the cjf/exception-stack branch from ebf472a to 66e934c Oct 3, 2018

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2018

I just rebased this again to clean up some recent conflicts, and to consolidate a lot of WIP commits into a small number of consistent changes.

This reminded me of an interesting side effect of this patch which I should point out — code generated by @test should now be able to run in the interpreter. This used not to work because @test emits a finally block, and the lowering of finally previously contained a :foreigncall to jl_rethrow_other which caused the interpreter to be bypassed.

c42f added some commits Jun 28, 2018

Lowering for exception stacks
* A new lowered expression head :pop_exc is introduced and emitted in any
  location where a catch block exits normally (either by stepping out,
  or using return, break, goto). The semantics of :pop_exc are to pop
  the exception stack back to the state of the associated enter.

* Make Expr(:enter) return a token which may be consumed by :pop_exc,
  thereby allowing the interpreter and codegen to know which :enter
  state should be used to pop the exception stack.  I tried various
  alternatives for this association, but this was by far the nicest in
  terms of non-disruptive integration into the SSAIR processing code,
  and supporting both the interpreter and codegen.
Consolidate _resetstkoflw handling
The logic for this was repeated in three quite different ways, in
JL_CATCH, interpreter and codegen but putting it into the runtime in
jl_eh_restore_state should be sufficient.

Also move jl_eh_restore_state out of main header - seems unnecessary to
expose such implementation detail, and inconsistent with other eh
functions.
Add exception stack system to runtime
Runtime
-------

* An exception stack type, `jl_exc_stack_t` and associated manipulation functions
  has been added. Conceptually this stores a stack of (exception,backtrace)
  pairs. It's stored in a contiguous buffer so we can avoid reallocating when
  throwing and catching the same exception or set of exceptions repeatedly. Space
  for the exception and backtrace is allocated on demand inside `throw_internal`,
  after changing GC mode. Several variations were tried for allocating this
  sgorage, including allocating up front with malloc on the thread local state
  and copying during task switching. Keeping everything on the task seemed the
  simplest as it involves the least copying and keeps track of the exception
  stack buffers in a unified way.

* The exception in transit is no longer a single pointer in the thread local
  storage. It's stored in `ptls->current_task->exc_stack` instead, along with
  associated backtrace.

* `jl_current_exception()` is now the method to retreive the current exception
  from within a `JL_CATCH` dynamic context.

* Several places where manual restoration of the exception and backtrace was
  done are no longer necessary because the stack system handles these
  automatically. This code is removed, including
  `jl_apply_with_saved_exception_state`.

* `jl_eh_restore_state` has become non-inline. It seemed good to get this out
  of the public header.

* `jl_sig_throw` is now used with `jl_throw_in_ctx` from signal handlers, to
  make the special circumstances clear and to avoid conflation with rethrow
  which now simply rethrows the existing exception stack.

* Make rethrow outside a catch block into an error.

* Use `ptls->previous_exception` to support `jl_exception_occurred` in
  embedding API.

* finally lowering no longer includes a `foreigncall`, so expressions
  using finally can now be interpreted.

Interpreter / Codegen
---------------------

Mostly small changes here, to track the `:enter` and `:pop_exc` association
with the SSAValue token. The token SSAValue slot is ideal here for storing the
state of the exception stack at the `:enter`.

GC
--

Integrate exception and raw backtrace scanning as a special case into GC mark
loop. This is necessary now that Task objects can refer to exception and
backtrace data in raw form.

@c42f c42f force-pushed the cjf/exception-stack branch from 66e934c to 5333b07 Oct 5, 2018

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

Currently this isn't being used much, right

Yes the stack is used underneath but isn't generally exposed to the user unless they called catch_stack themselves. I've got some hacky WIP for exposing it but I'm not sure about what's best yet. We should definitely open an issue.

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

@ararslan The freebsd buildbot (normally most reliable and awesome of all build bots!) is having a problem today and seems to reproducibly fail to download the pcre depedency. Restarting multiple times has not helped.

@ararslan

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

Should be fixed now. The currently running build looks poised to complete successfully.

@c42f

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

@ararslan Thanks so much!

Pending CI passing once more and the (github outage related) JuliaLang/julia merge lockout being removed, I will go ahead and merge this. Hopefully later today.

@c42f c42f merged commit 24f1316 into master Oct 23, 2018

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
julia freebsd ci Build done
Details

@c42f c42f deleted the cjf/exception-stack branch Oct 23, 2018

@iamed2

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

Congrats on this!! I'm excited to use these when I can!

c42f added a commit that referenced this pull request Oct 31, 2018

initialize previous_exception (#29861)
Fix uninitialized previous_exception bug introduced in #28878

fredrikekre added a commit that referenced this pull request Dec 1, 2018

fredrikekre added a commit that referenced this pull request Dec 1, 2018

fredrikekre added a commit that referenced this pull request Dec 3, 2018

fredrikekre added a commit that referenced this pull request Dec 4, 2018

fredrikekre added a commit that referenced this pull request Dec 4, 2018

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Addition of NEWS and compat admonitions for important
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Addition of NEWS and compat admonitions for important
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Compat admonitions and NEWS for Julia 1.1 (#30230)
Addition of NEWS and compat admonitions for important changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.


Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.