Store task exceptions #68

RealOrangeOne · 2024-07-09T22:53:21Z

Closes #63

Tasks can now store the exception they raised, if they raised one.

The serialization process isn't perfect, since it must be JSON-compatible. Some information (such as traceback) is lost as a result. I suspect in most cases, an instance check or extracting the message, which is possible here. The format is internal, so can be expanded in future if needed.

This is less of a foot-gun than receiving a different exception than expected.

ksamuel · 2024-07-10T08:48:03Z

While storing the entire stack with full objects is not necessary, having at least the textual representation of the traceback would be fantastic for debugging. You can't look at an exception message and guess what went wrong most of the time. And debugging async task is very hard, so having a clue with file paths and lines would be super helpful.

Even if you have logging well setup (which a lot of people don't have), at some point people will build django admin plugins to monitor tasks. Being able to click on that and see a string representation of the traceback in there would be huge.

RealOrangeOne · 2024-07-10T09:16:10Z

The full exception is logged, so for debugging the traceback is still accessible. From a code perspective, it should only be the exception instance itself which is needed (at least in most cases).

ksamuel · 2024-07-10T12:55:55Z

True, but again:

People who will need it the most are typically those who don't know how to setup proper logging. Once it's not in the console (e.g: in production), they will be blind.
You will need to correlate the info in the logs with the exact job that triggered the error. It's not fun to day.
If you are in the shell / notebook / debugger, you can get the info from the task object by just looking at it, which is very handy.

Since I understand you might start to feel like I talk a lot while you do all the hard work, maybe I could contribute a PR?

If you are open to the concept.

RealOrangeOne · 2024-07-10T14:59:31Z

I think this PR is a good start, but I'd definitely welcome some improvements to it to include a traceback! I'd rather avoid external libraries or too much added complexity if at all possible.

Once it's not in the console

logging.exception is used to help surface the logs, but I agree it's not perfect.

correlate the info in the logs with the exact job

The logs show the job ID when showing the exception. The Job's ID is logged before execution, and on success/failure, so it shouldn't be too difficult to correlate.

you can get the info from the task object

That's one of the use cases I have in mind. The exception instance and its arguments might be enough for simple jobs, although more complex ones might benefit from at at least showing the text representation of the call stack (which I don't think you could add back into an exception very easily, but I'm happy to be proved wrong!).

ksamuel · 2024-07-11T07:31:10Z

The traceback is gathered from the context and not the exception itself (sys.exc_info() is even Thread specific), so you are right, we can't add it back to the exception. To make it appear in a natural way would mean monkey patch the whole system which wouldn't fit the goal of avoiding too much added complexity.

I can get the traceback lines from traceback.format_exception, so the question is what to do with it. I can see 3 ways that would stay simple enough:

Add that into the dict that exception_to_dict and when getting back the result, attach it to the exception as a "traceback" attribute.
Add it directly as a traceback attribute into the TaskResult object. Maybe even move the exception as an "error" attribute on TaskResult.
Change the design a little and make .result a method. On call, return the result or raise an exception if there is an error. The exception would be a TaskError (raised with from e) with the attributes original_exception and traceback. We can even add instructions in the exceptions message on what to do.

Which would you prefer I start working on?

I like the last one, it avoids cluttering the TastResult namespace, so we can extend the mechanism with more info on the TaskError later on if necessary. It lets you catch any TaskError, or the original error if you want because of task chaining. I can even provide a signature such as result(default=None, default_on_exception=None) to allow for result('my_default_value_if_value_error', ValueError) when one wants to skip the exception handling.

But I'll work on your selection.

RealOrangeOne · 2024-07-11T08:44:33Z

attach it to the exception as a "traceback" attribute.

Having the re-serialized exception be different to how it was before sounds like a real foot-gun. Unless we can replicate it entirely, I don't think it's worth doing.

Adding it to the bare exception dict would be nice, but there's no real interface for it. I guess we could add a traceback attribute to the TaskResult to expose it?

ksamuel · 2024-07-11T11:06:40Z

Ok. I'll look into that.

ksamuel · 2024-07-12T10:29:51Z

While working on the TaskResult.traceback solution this morning (good dev experience BTW, I was up and running in 5 minutes), I found libraries that managed to serialize full stack traces like python-tblib and I realized that in Python 3 we can reinject that into __traceback__. They do much more than what we need, but it's possible to use their method to create a minimalistic traceback serializer/deserializer that is JSON compatible.

So I have an additional, maybe more natural proposal.

There is a bit of black magic to it, but we can limit it because we support only 3.8+. Here is what it looks like.

Serializing part

We need one function to unravel the entire stack trace (otherwise, in the exception handling site, since we don't reach the top of the program, we only get the current frame):

def get_stacktrace():
    tb = None
    depth = 2
    while True:
        try:
            frame = sys._getframe(depth)
            depth += 1
        except ValueError as exc:
            break

        tb = types.TracebackType(tb, frame, frame.f_lasti, frame.f_lineno)

    return tb

Then, we dump it recursively into a dict:

def tb_to_dict(tb):
    tb_list = []
    while tb:
        frame = tb.tb_frame
        code = frame.f_code
        tb_list.append(
            {
                "filename": code.co_filename,
                "name": code.co_name,
                "lineno": tb.tb_lineno,
                "globals": {
                    k: str(v)
                    for k, v in frame.f_globals.items()
                    if k in ("__file__", "__name__")
                },
                "locals": {k: str(v) for k, v in frame.f_locals.items()},
                "code_info": {
                    "co_name": code.co_name,
                    "co_filename": code.co_filename,
                },
            }
        )
        tb = tb.tb_next
    return tb_list

Globals and locals are repr() into strings to give clues to what the values were without entering the rabbit hole of restoring them.

Deserialization part

Black magic is needed because Python won't let us create a Frame instance, so we create and compile a code type with the exception then execute it.

def rebuild_traceback(serialized_traceback):
    top_tb = None
    tb = None
   
    for frame in serialized_traceback:
        code_info = frame["code_info"]
        code = compile(
            "\n" * (frame["lineno"] - 1) + "raise __traceback_maker",
            frame["filename"],
            "exec",
        )
        # For our purpose, we can make it as minimal as we need
        code = code.replace(
            co_argcount=0,
            co_filename=code_info["co_filename"],
            co_name=code_info["co_name"],
            co_freevars=(),
            co_cellvars=(),
        )

        try:
            exec(code, frame["globals"], frame["locals"])
        except Exception:
            next_tb = sys.exc_info()[2].tb_next
            if top_tb is None:
                top_tb = next_tb
            if tb is not None:
                tb.tb_next = next_tb
            tb = next_tb
            del next_tb

    return top_tb

What it looks like in action

source_of_error.py:

import json
import pickle
from pathlib import Path
import pprint


...

def fail():
    raise ValueError("Example exception")

def one():
    try:
        fail()
    except Exception as e:
        tb_dump = tb_to_dict(get_traceback()) + tb_to_dict(e.__traceback__)
        pprint.pprint(tb_dump)
        Path("/tmp/tb").write_text(json.dumps(tb_dump))
        # This simulate your Exception serialization without recoding it
        Path("/tmp/ex").write_bytes(pickle.dumps(e))


def two():
    one()


def three():
    two()


three()

This saves the stack trace to a file, and prints the dump, which looks like this:

[{'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py',
                'co_name': '<module>'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 72,
  'locals': {'Path': "<class 'pathlib.Path'>",
             '__annotations__': '{}',
             '__builtins__': "<module 'builtins' (built-in)>",
             '__cached__': 'None',
             '__doc__': 'None',
             '__file__': '/home/user/Bureau/source_of_error.py',
             '__loader__': '<_frozen_importlib_external.SourceFileLoader '
                           'object at 0x79f230121ae0>',
             '__name__': '__main__',
             '__package__': 'None',
             '__spec__': 'None',
             'get_traceback': '<function get_traceback at 0x79f2300ab880>',
             'json': "<module 'json' from "
                     "'/usr/lib/python3.10/json/__init__.py'>",
             'one': '<function one at 0x79f22ffa9a20>',
             'pickle': "<module 'pickle' from '/usr/lib/python3.10/pickle.py'>",
             'pprint': "<module 'pprint' from '/usr/lib/python3.10/pprint.py'>",
             'sys': "<module 'sys' (built-in)>",
             'tb_to_dict': '<function tb_to_dict at 0x79f23018f5b0>',
             'three': '<function three at 0x79f22ffa9b40>',
             'two': '<function two at 0x79f22ffa9ab0>',
             'types': "<module 'types' from '/usr/lib/python3.10/types.py'>"},
  'name': '<module>'},
 {'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py', 'co_name': 'three'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 69,
  'locals': {},
  'name': 'three'},
 {'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py', 'co_name': 'two'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 65,
  'locals': {},
  'name': 'two'}]

In a different file, we can rebuild the exception:

 a = pickle.loads(Path("/tmp/ex").read_bytes())
 stack_trace = json.loads(Path("/tmp/tb").read_text())
 a.__traceback__ = rebuild_traceback(stack_trace)
 traceback.print_exception(type(a), a, a.__traceback__)

Which prints the content as if it were from the original module:

Traceback (most recent call last):
  File "/home/user/Bureau/foo.py", line 73, in <module>
    three()
  File "/home/user/Bureau/foo.py", line 70, in three
    two()
  File "/home/user/Bureau/foo.py", line 66, in two
    one()
  File "/home/user/Bureau/foo.py", line 56, in one
    fail()
  File "/home/user/Bureau/foo.py", line 51, in fail
    raise ValueError("Example exception")
ValueError: Example exception

Pros and cons

Pros

We can get back the traceback, and attach it to the standard __traceback__ attribute. People can inspect that and use the usual traceback tool. traceback.print_exception(type(e), e, e.__traceback__) works exactly as expected.
No need to add a traceback attribute on the TaskResult. Things are where they are supposed to be in the natural format for a traceback.
No learning curve for the end user. At least no difference from learning the stdblib.
A lot of the stack trace is available in JSON for inspection in the store result in a human-readable format.
We have a nice locals() and globals() repr()-formatted dump for debugging
This should be stable across CPython versions.

Cons

It's not as simple as the other solutions.
compile/exec assumes the JSON is safe to execute. Given the result store and tasks are under your control, this is not an unfair assumption, but it should as least be documented.
Need to be tested on Pypy (if support is a goal, forgot to ask).
Slower than the alternative. Haven't measured it, and don't know if it matters since it's only on errors and we can restore it lazily on read, but still.
Harder to test.

There are some things to iron out but it's promising.

Should I try this out, or continue on the TaskResult.traceback attribute solution?

RealOrangeOne · 2024-07-12T12:06:55Z

Personally, I think the need for exec makes it a complete non-starter. There are so many ways that could go wrong. The traceback attribute is probably ample for the needs.

ksamuel · 2024-07-12T12:08:09Z

Fair enough. I'll continue in that direction.

ksamuel · 2024-07-12T13:18:48Z

Does complex_exception need a tweak?

    def test_complex_exception(self) -> None:
        result = default_task_backend.enqueue(test_tasks.complex_exception, [], {})
        ...
        # Is this meant as  "self.assertIsInstance" ? 
        self.assertIsNone(result.result, TypeError)

Currently, this passes because result.result is actually None.

If this is expected I can:

Fix it to remove the TypeError passed as the msg param.
Make traceback be None as well for complex exceptions.

RealOrangeOne added 3 commits July 9, 2024 23:13

Allow serializing exceptions to JSON

0fc2902

Store task exceptions

83575ae

Store None when exception serialization fails

2071bb8

This is less of a foot-gun than receiving a different exception than expected.

RealOrangeOne marked this pull request as ready for review July 10, 2024 14:59

Add tests to cover more potential exceptions

c35cefd

RealOrangeOne added 2 commits July 12, 2024 14:33

Log when an exception can't be saved

838047e

Fix assertion

1d41585

RealOrangeOne force-pushed the feature/exception-serialization branch from bc02c47 to 1d41585 Compare July 12, 2024 13:34

RealOrangeOne merged commit 8ce1fd8 into master Jul 12, 2024
22 checks passed

RealOrangeOne deleted the feature/exception-serialization branch July 12, 2024 13:42

ksamuel mentioned this pull request Jul 12, 2024

Feature/traceback serialization #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store task exceptions #68

Store task exceptions #68

RealOrangeOne commented Jul 9, 2024

ksamuel commented Jul 10, 2024

RealOrangeOne commented Jul 10, 2024

ksamuel commented Jul 10, 2024

RealOrangeOne commented Jul 10, 2024

ksamuel commented Jul 11, 2024

RealOrangeOne commented Jul 11, 2024

ksamuel commented Jul 11, 2024

ksamuel commented Jul 12, 2024 •

edited

Loading

RealOrangeOne commented Jul 12, 2024

ksamuel commented Jul 12, 2024

ksamuel commented Jul 12, 2024 •

edited

Loading

Store task exceptions #68

Store task exceptions #68

Conversation

RealOrangeOne commented Jul 9, 2024

ksamuel commented Jul 10, 2024

RealOrangeOne commented Jul 10, 2024

ksamuel commented Jul 10, 2024

RealOrangeOne commented Jul 10, 2024

ksamuel commented Jul 11, 2024

RealOrangeOne commented Jul 11, 2024

ksamuel commented Jul 11, 2024

ksamuel commented Jul 12, 2024 • edited Loading

Serializing part

Deserialization part

What it looks like in action

Pros and cons

Pros

Cons

RealOrangeOne commented Jul 12, 2024

ksamuel commented Jul 12, 2024

ksamuel commented Jul 12, 2024 • edited Loading

ksamuel commented Jul 12, 2024 •

edited

Loading

ksamuel commented Jul 12, 2024 •

edited

Loading