Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store task exceptions #68

Merged
merged 6 commits into from
Jul 12, 2024
Merged

Conversation

RealOrangeOne
Copy link
Owner

Closes #63

Tasks can now store the exception they raised, if they raised one.

The serialization process isn't perfect, since it must be JSON-compatible. Some information (such as traceback) is lost as a result. I suspect in most cases, an instance check or extracting the message, which is possible here. The format is internal, so can be expanded in future if needed.

This is less of a foot-gun than receiving a different exception than expected.
@ksamuel
Copy link
Contributor

ksamuel commented Jul 10, 2024

While storing the entire stack with full objects is not necessary, having at least the textual representation of the traceback would be fantastic for debugging. You can't look at an exception message and guess what went wrong most of the time. And debugging async task is very hard, so having a clue with file paths and lines would be super helpful.

Even if you have logging well setup (which a lot of people don't have), at some point people will build django admin plugins to monitor tasks. Being able to click on that and see a string representation of the traceback in there would be huge.

@RealOrangeOne
Copy link
Owner Author

The full exception is logged, so for debugging the traceback is still accessible. From a code perspective, it should only be the exception instance itself which is needed (at least in most cases).

@ksamuel
Copy link
Contributor

ksamuel commented Jul 10, 2024

True, but again:

  • People who will need it the most are typically those who don't know how to setup proper logging. Once it's not in the console (e.g: in production), they will be blind.
  • You will need to correlate the info in the logs with the exact job that triggered the error. It's not fun to day.
  • If you are in the shell / notebook / debugger, you can get the info from the task object by just looking at it, which is very handy.

Since I understand you might start to feel like I talk a lot while you do all the hard work, maybe I could contribute a PR?

If you are open to the concept.

@RealOrangeOne
Copy link
Owner Author

I think this PR is a good start, but I'd definitely welcome some improvements to it to include a traceback! I'd rather avoid external libraries or too much added complexity if at all possible.

Once it's not in the console

logging.exception is used to help surface the logs, but I agree it's not perfect.

correlate the info in the logs with the exact job

The logs show the job ID when showing the exception. The Job's ID is logged before execution, and on success/failure, so it shouldn't be too difficult to correlate.

you can get the info from the task object

That's one of the use cases I have in mind. The exception instance and its arguments might be enough for simple jobs, although more complex ones might benefit from at at least showing the text representation of the call stack (which I don't think you could add back into an exception very easily, but I'm happy to be proved wrong!).

@RealOrangeOne RealOrangeOne marked this pull request as ready for review July 10, 2024 14:59
@ksamuel
Copy link
Contributor

ksamuel commented Jul 11, 2024

The traceback is gathered from the context and not the exception itself (sys.exc_info() is even Thread specific), so you are right, we can't add it back to the exception. To make it appear in a natural way would mean monkey patch the whole system which wouldn't fit the goal of avoiding too much added complexity.

I can get the traceback lines from traceback.format_exception, so the question is what to do with it. I can see 3 ways that would stay simple enough:

  • Add that into the dict that exception_to_dict and when getting back the result, attach it to the exception as a "traceback" attribute.

  • Add it directly as a traceback attribute into the TaskResult object. Maybe even move the exception as an "error" attribute on TaskResult.

  • Change the design a little and make .result a method. On call, return the result or raise an exception if there is an error. The exception would be a TaskError (raised with from e) with the attributes original_exception and traceback. We can even add instructions in the exceptions message on what to do.

Which would you prefer I start working on?

I like the last one, it avoids cluttering the TastResult namespace, so we can extend the mechanism with more info on the TaskError later on if necessary. It lets you catch any TaskError, or the original error if you want because of task chaining. I can even provide a signature such as result(default=None, default_on_exception=None) to allow for result('my_default_value_if_value_error', ValueError) when one wants to skip the exception handling.

But I'll work on your selection.

@RealOrangeOne
Copy link
Owner Author

attach it to the exception as a "traceback" attribute.

Having the re-serialized exception be different to how it was before sounds like a real foot-gun. Unless we can replicate it entirely, I don't think it's worth doing.

Adding it to the bare exception dict would be nice, but there's no real interface for it. I guess we could add a traceback attribute to the TaskResult to expose it?

@ksamuel
Copy link
Contributor

ksamuel commented Jul 11, 2024

Ok. I'll look into that.

@ksamuel
Copy link
Contributor

ksamuel commented Jul 12, 2024

While working on the TaskResult.traceback solution this morning (good dev experience BTW, I was up and running in 5 minutes), I found libraries that managed to serialize full stack traces like python-tblib and I realized that in Python 3 we can reinject that into __traceback__. They do much more than what we need, but it's possible to use their method to create a minimalistic traceback serializer/deserializer that is JSON compatible.

So I have an additional, maybe more natural proposal.

There is a bit of black magic to it, but we can limit it because we support only 3.8+. Here is what it looks like.

Serializing part

We need one function to unravel the entire stack trace (otherwise, in the exception handling site, since we don't reach the top of the program, we only get the current frame):

def get_stacktrace():
    tb = None
    depth = 2
    while True:
        try:
            frame = sys._getframe(depth)
            depth += 1
        except ValueError as exc:
            break

        tb = types.TracebackType(tb, frame, frame.f_lasti, frame.f_lineno)

    return tb

Then, we dump it recursively into a dict:

def tb_to_dict(tb):
    tb_list = []
    while tb:
        frame = tb.tb_frame
        code = frame.f_code
        tb_list.append(
            {
                "filename": code.co_filename,
                "name": code.co_name,
                "lineno": tb.tb_lineno,
                "globals": {
                    k: str(v)
                    for k, v in frame.f_globals.items()
                    if k in ("__file__", "__name__")
                },
                "locals": {k: str(v) for k, v in frame.f_locals.items()},
                "code_info": {
                    "co_name": code.co_name,
                    "co_filename": code.co_filename,
                },
            }
        )
        tb = tb.tb_next
    return tb_list

Globals and locals are repr() into strings to give clues to what the values were without entering the rabbit hole of restoring them.

Deserialization part

Black magic is needed because Python won't let us create a Frame instance, so we create and compile a code type with the exception then execute it.

def rebuild_traceback(serialized_traceback):
    top_tb = None
    tb = None
   
    for frame in serialized_traceback:
        code_info = frame["code_info"]
        code = compile(
            "\n" * (frame["lineno"] - 1) + "raise __traceback_maker",
            frame["filename"],
            "exec",
        )
        # For our purpose, we can make it as minimal as we need
        code = code.replace(
            co_argcount=0,
            co_filename=code_info["co_filename"],
            co_name=code_info["co_name"],
            co_freevars=(),
            co_cellvars=(),
        )

        try:
            exec(code, frame["globals"], frame["locals"])
        except Exception:
            next_tb = sys.exc_info()[2].tb_next
            if top_tb is None:
                top_tb = next_tb
            if tb is not None:
                tb.tb_next = next_tb
            tb = next_tb
            del next_tb

    return top_tb

What it looks like in action

source_of_error.py:

import json
import pickle
from pathlib import Path
import pprint


...

def fail():
    raise ValueError("Example exception")

def one():
    try:
        fail()
    except Exception as e:
        tb_dump = tb_to_dict(get_traceback()) + tb_to_dict(e.__traceback__)
        pprint.pprint(tb_dump)
        Path("/tmp/tb").write_text(json.dumps(tb_dump))
        # This simulate your Exception serialization without recoding it
        Path("/tmp/ex").write_bytes(pickle.dumps(e))


def two():
    one()


def three():
    two()


three()

This saves the stack trace to a file, and prints the dump, which looks like this:

[{'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py',
                'co_name': '<module>'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 72,
  'locals': {'Path': "<class 'pathlib.Path'>",
             '__annotations__': '{}',
             '__builtins__': "<module 'builtins' (built-in)>",
             '__cached__': 'None',
             '__doc__': 'None',
             '__file__': '/home/user/Bureau/source_of_error.py',
             '__loader__': '<_frozen_importlib_external.SourceFileLoader '
                           'object at 0x79f230121ae0>',
             '__name__': '__main__',
             '__package__': 'None',
             '__spec__': 'None',
             'get_traceback': '<function get_traceback at 0x79f2300ab880>',
             'json': "<module 'json' from "
                     "'/usr/lib/python3.10/json/__init__.py'>",
             'one': '<function one at 0x79f22ffa9a20>',
             'pickle': "<module 'pickle' from '/usr/lib/python3.10/pickle.py'>",
             'pprint': "<module 'pprint' from '/usr/lib/python3.10/pprint.py'>",
             'sys': "<module 'sys' (built-in)>",
             'tb_to_dict': '<function tb_to_dict at 0x79f23018f5b0>',
             'three': '<function three at 0x79f22ffa9b40>',
             'two': '<function two at 0x79f22ffa9ab0>',
             'types': "<module 'types' from '/usr/lib/python3.10/types.py'>"},
  'name': '<module>'},
 {'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py', 'co_name': 'three'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 69,
  'locals': {},
  'name': 'three'},
 {'code_info': {'co_filename': '/home/user/Bureau/source_of_error.py', 'co_name': 'two'},
  'filename': '/home/user/Bureau/source_of_error.py',
  'globals': {'__file__': '/home/user/Bureau/source_of_error.py', '__name__': '__main__'},
  'lineno': 65,
  'locals': {},
  'name': 'two'}]

In a different file, we can rebuild the exception:

 a = pickle.loads(Path("/tmp/ex").read_bytes())
 stack_trace = json.loads(Path("/tmp/tb").read_text())
 a.__traceback__ = rebuild_traceback(stack_trace)
 traceback.print_exception(type(a), a, a.__traceback__)

Which prints the content as if it were from the original module:

Traceback (most recent call last):
  File "/home/user/Bureau/foo.py", line 73, in <module>
    three()
  File "/home/user/Bureau/foo.py", line 70, in three
    two()
  File "/home/user/Bureau/foo.py", line 66, in two
    one()
  File "/home/user/Bureau/foo.py", line 56, in one
    fail()
  File "/home/user/Bureau/foo.py", line 51, in fail
    raise ValueError("Example exception")
ValueError: Example exception

Pros and cons

Pros

  • We can get back the traceback, and attach it to the standard __traceback__ attribute. People can inspect that and use the usual traceback tool. traceback.print_exception(type(e), e, e.__traceback__) works exactly as expected.
  • No need to add a traceback attribute on the TaskResult. Things are where they are supposed to be in the natural format for a traceback.
  • No learning curve for the end user. At least no difference from learning the stdblib.
  • A lot of the stack trace is available in JSON for inspection in the store result in a human-readable format.
  • We have a nice locals() and globals() repr()-formatted dump for debugging
  • This should be stable across CPython versions.

Cons

  • It's not as simple as the other solutions.
  • compile/exec assumes the JSON is safe to execute. Given the result store and tasks are under your control, this is not an unfair assumption, but it should as least be documented.
  • Need to be tested on Pypy (if support is a goal, forgot to ask).
  • Slower than the alternative. Haven't measured it, and don't know if it matters since it's only on errors and we can restore it lazily on read, but still.
  • Harder to test.

There are some things to iron out but it's promising.

Should I try this out, or continue on the TaskResult.traceback attribute solution?

@RealOrangeOne
Copy link
Owner Author

Personally, I think the need for exec makes it a complete non-starter. There are so many ways that could go wrong. The traceback attribute is probably ample for the needs.

@ksamuel
Copy link
Contributor

ksamuel commented Jul 12, 2024

Fair enough. I'll continue in that direction.

@ksamuel
Copy link
Contributor

ksamuel commented Jul 12, 2024

Does complex_exception need a tweak?

    def test_complex_exception(self) -> None:
        result = default_task_backend.enqueue(test_tasks.complex_exception, [], {})
        ...
        # Is this meant as  "self.assertIsInstance" ? 
        self.assertIsNone(result.result, TypeError)

Currently, this passes because result.result is actually None.

If this is expected I can:

  • Fix it to remove the TypeError passed as the msg param.
  • Make traceback be None as well for complex exceptions.

@RealOrangeOne RealOrangeOne force-pushed the feature/exception-serialization branch from bc02c47 to 1d41585 Compare July 12, 2024 13:34
@RealOrangeOne RealOrangeOne merged commit 8ce1fd8 into master Jul 12, 2024
22 checks passed
@RealOrangeOne RealOrangeOne deleted the feature/exception-serialization branch July 12, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants