Recovery kernel loop #122

mlange05 · 2016-11-13T09:08:19Z

This new feature enables user-level customisation of the error handling procedures in the main kernel loop through JIT and SciPy mode. The core idea is that kernels can now throw different types of errors, eg. ErrorCode.OutOfBounds and that the user can specify custom kernels to override the default behaviour to create model-specific recovery behaviour. The recovery kernels themselves are always run in Python and can be injected via the recovery={Error.MyErrorCode: MyRecoveryKernel} argument.

Caveats:

Minor: OutOfBounds errors do not propagate the exact sampling location that created the error as SciPy mode does, but the location of the particle at the time. Adding that would require dynamically allocating memory for meta-information, which is left for another PR.
Major: Kernel errors do not trigger rollback! That entails that a kernel that updates particle information before throwing an error results in inconsistent particle data, since time is not incremented the kernel will be applied again once error recovery succeeded. This can be prevented by performing all field sampling before updating particle data, as most built-in kernels do.

All kernel loops are adaptive theses days.

…fter

The whole thing is now moved into it's own sub-module `kernels.error`.

Also slightly re-writes the special-casing in the Python kernel loop and enables the timed failure test for JIT.

This now handles explicit out-of-bounds errors thrown by the field sampling routines and sets the according error code. In the recovery part we then propagate the caught exceptions to the user via the default recovery kernels.

For this we separate the low-level execution routines for JIT and Python, check if we have failing kernels after the first execution and repeatedly apply recovery kernels before attempting the main timestepping loop again in a while loop. Note, that this can cause infinite looping!

We also now explicitly test the size of the final ParticleSet and that the right error has been thrown during the execution loop.

This is necessary to propagate the error to the Python recovery loop.

erikvansebille · 2016-11-16T13:39:08Z

Code looks great @mlange05, glad you got it all to work after what looks like Herculean coding!

I'm really glad we now have this error framework sorted, and I like the syntax of recovery={ErrorCode.Error: CustomKernel}, it is very intuitive and still quite powerful.

A few general comments/caveats that we may (or may not) want to think about right now, or defer for later versions

It is very easy for the code to get into an infinite loop, now. For example, if the recovery function for ErrorOutOfBounds results in the particle still being out of bounds, then the recovery will never exit.
Is there a way to tell a particle to simply be deleted when reaching the boundary? I tried in test_execution_recover_out_of_bounds, but def MoveLeft(particle): particle.delete() made the code hang (see point above). Having a way to delete a particle when it reaches a boundary is important for non-global domains.
For periodic boundary conditions (global domains), it will be important to have separate ErrorOutOfBounds for the different sides of the domain. Thinking about periodic boundary conditions, we want particles which exit on the right (east) to enter on the left (west). But particles that exit on the left have to enter on the right. In fact, ideally we want to simply use the modulo (%) operator to sample the fields, if the domain is with periodic boundary conditions?
We will also need to provide an 'OnLand' error, possibly via a grid.landmask field that the user can set. I'm happy to have a go at that, probably in a new branch once this Pull Request lands.

mlange05 · 2016-12-01T06:56:37Z

Ok, thanks for the review. Regarding the comments:

That is a general problem and I'm not sure there is an easy answer for that. We can probably add some kind of fail-safe mechanism at some point, but I'd like to wait and see how often people get stuck on this before we spend time building this.
That was a real bug, and I believe is now fixed in the latest commit.
Yes, and I think will be covered in a follow-on PR. My intuition is that is requires the additional meta-data mentioned under caveats in the PR description, since we can determine the direction from the coordinate of the OutOfBoundsError, if that is provided.
Agreed. But like the above a follow-on feature that should have it's own PR.

Michael Lange added 25 commits November 10, 2016 11:56

Particle: Add test to verify timestepping and endtimes after execute

71723ae

Particle: Don't artificially invert runtime argument in execute

8b93266

Kernel: Add test for timed failure after 10 timesteps

afd5b2f

Kernel: Add more KernelOps, like Repeat and FailOutOfBounds

6c45c0c

Kernel: Remove forward/backward special-casing from timestepping loop

76c85f7

Kernel: Make particle deletion part of the KernelOp API

4daa322

Kernel: Removing deprecated adaptive keyword from LoopGenerator

e59b593

All kernel loops are adaptive theses days.

Kernel: Move failing particles to temporary sets and deal with them a…

58be153

…fter

Kernel: Ensure failing kernels step out of the time loop

fdaed33

KernelError: Renaming KernelOp to ErrorCode and adding recovery map

cc7f29a

The whole thing is now moved into it's own sub-module `kernels.error`.

CodeGen: Re-write loop structure to have single, sign-aware time-loop

cd39f81

Kernel: Adjust JIT main loop to honour error clauses

bd0ed2a

Also slightly re-writes the special-casing in the Python kernel loop and enables the timed failure test for JIT.

Tests: Enforcing np.float32 data types to suppress warnings

dfffc89

Tests: Update pytest marker in setup.cfg

578782a

Rename ErrorCode.Fail => ErrorCode.Error

f75e713

Kernel: Add execution tests for Python and out-of-bounds errors

4a1be9e

We also now explicitly test the size of the final ParticleSet and that the right error has been thrown during the execution loop.

Kernel: Store return code on particle in JIT execution loop

4995e51

This is necessary to propagate the error to the Python recovery loop.

JIT: Include a return code in C grid sampling routines

107c793

CodeGen: Perform field eval into a temporary via a statement stack

5afcd48

Codegen: Use error codes and explicit conversion in field eval calls

f29338b

CodeGen: Generate and propagate out-f-bounds errors in JIT

75ecdb7

Recovery: Exposing recovery kernels through execute() and adding test

ba5944c

CogeGen: Small bug fix for explicit deletion in JIT mode

b8b29d5

Merge branch 'master' into recovery-kernel-loop

c4411dc

erikvansebille mentioned this pull request Nov 30, 2016

Alternative spatial interpolation in JIT #126

Merged

Recovery: Ensure deletion in recovery kernels is honoured and tested

7921357

Merge branch 'master' into recovery-kernel-loop

4f50ebb

erikvansebille merged commit f239a86 into master Dec 2, 2016

erikvansebille deleted the recovery-kernel-loop branch December 2, 2016 08:44

erikvansebille mentioned this pull request Dec 2, 2016

Improving out-of-bounds check: meta-data on location #127

Closed

erikvansebille mentioned this pull request Dec 10, 2016

Check that particles remain within grid domain #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery kernel loop #122

Recovery kernel loop #122

mlange05 commented Nov 13, 2016

erikvansebille commented Nov 16, 2016

mlange05 commented Dec 1, 2016

Recovery kernel loop #122

Recovery kernel loop #122

Conversation

mlange05 commented Nov 13, 2016

erikvansebille commented Nov 16, 2016

mlange05 commented Dec 1, 2016