Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in dramatiq.Worker:process_message #351

Closed
wants to merge 1 commit into from

Conversation

pchristos
Copy link

@pchristos pchristos commented Oct 14, 2020

The memory leak takes place only when an exception is thrown.

The problem has been traced down to the exception handler of
process_message and more specifically to how the exception
object is handled and stored in order to be referred back to
down the road, eg. for debugging purposes or by the results'
middleware. When message.stuff_exception(e) is called, a
cyclic reference is created, which causes objects to not be
garbage-collected (at least not until the cyclic GC kicks in).

In general, a cyclic reference is created when the exception
object is stored in a variable x, which is part of the stack
frame. The exception object references the stack frame, thus
the variable x, via its __traceback__ attribute. But now x
references the exception object, too, thus the stack frame.
So, a cyclic reference is created!

In this particular case, the reference cycle created is:

message -> message._exception -> e -> stackframe -> message

In general, it is not recommended to store exception objects
in local vars. However, if that's necessary, such variables
should always be (explicitly) cleared in order to break the
cyclic reference - usually in a try/finally statement.

In that sense, it comes in handy that a try/finally statement
is already being used. By setting message._exception to None
at the very end the reference cycle is broken and objects are
garbage-collected soon enough.

The following piece of code can be used to re-produce the problem:

import dramatiq


@dramatiq.actor
def foo(*args, **kwargs):
    a = tuple(range(5000000))                                        
    raise Exception('bar')                                                      

After running for _ in range(5): foo.send() the memory leak is evident:

Figure_1

When the fix is applied:

Figure_1_no_leak

The memory leak takes place only when an exception is thrown.

The problem has been traced down to the exception handler of
`process_message` and more specifically to how the exception
object is handled and stored in order to be referred back to
down the road, eg. for debugging purposes or by the results'
middleware. When `message.stuff_exception(e)` is called, a
cyclic reference is created, which causes objects to not be
garbage-collected (at least not until the cyclic GC kicks in).

In general, a cyclic reference is created when the exception
object is stored in a variable x, which is part of the stack
frame. The exception object references the stack frame, thus
the variable x, via its `__traceback__` attribute. But now x
references the exception object, too, thus the stack frame.
So, a cyclic reference is created!

In this particular case, the reference cycle created is:

  message -> message._exception -> e -> stackframe -> message

In general, it is not recommended to store exception objects
in local vars. However, if that's necessary, such variables
should always be (explicitly) cleared in order to break the
cyclic reference - usually in a try/finally statement.

In that sense, it comes in handy that a try/finally statement
is alread being used. By seting `message._exception` to `None`
at the very end the reference cycle is broken and objects are
garbage-collected soon enough.
@pchristos
Copy link
Author

pchristos commented Oct 14, 2020

I see that the tests fail due to the StubBroker trying to re-raise message._exception which is None now.

I haven't fixed it yet in case you have a specific fix in mind.

A possible fix could be something like:

  1. Use wrap_exception to store a serialized version of the exception in the MessageProxy.
  2. The above still allows to keep a reference to the exception without the initial exception object e.
  3. Use (2) to re-raise the exception in the StubBroker, eg:
diff --git a/dramatiq/brokers/stub.py b/dramatiq/brokers/stub.py
index 77317c9..8562998 100644
--- a/dramatiq/brokers/stub.py
+++ b/dramatiq/brokers/stub.py
@@ -170,11 +170,16 @@ class StubBroker(Broker):
             else:
                 if fail_fast:
                     for message in self.dead_letters_by_queue[queue_name]:
-                        raise message._exception from None
+                        raise StubBrokerException(message.serialized_exception['msg'])
 
                 return
 
 
+class StubBrokerException(Exception):
+    """An exception specifically raised in unit tests."""
+    pass
+
+
 class _StubConsumer(Consumer):
     def __init__(self, queue, dead_letters, timeout):
         self.queue = queue

This way you also avoid accessing the private _exception attribute.

Again, the above is merely a quick suggestion that came to mind :) I haven't really given much thought to it.

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 14, 2020

Thanks for digging into this and finding the problem! I'd like to preserve the current behavior and I wonder if adding a __del__ method like

def __del__(self):
    self._exception = None

to MessageProxy in broker.py would resolve this without any other changes. That should permit the runtime to free the exception once it sees that it can GC each message. Would you mind giving that a try since you're already set up to measure things? If not, I'll give it a go this weekend.

@pchristos
Copy link
Author

pchristos commented Oct 14, 2020

I'm afraid that's not going to work. The problem here is that __del__ is invoked once an object's reference count reaches zero. However, due to the circular reference the reference count is not decreased to zero for the object to be garbage-collected, which would in turn trigger calling its __del__ method. IMHO the recommended and safest approach is to explicitly break the circular reference.

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 14, 2020

The garbage collector detects cycles such as these and correctly calls __del__ in those cases. Here's an example that shows that this works as expected: https://gist.github.com/Bogdanp/cc59e34fdd60b9d163df0a43e4621dae

I do agree that it would be ideal to break the cycle if possible, but it might not be possible to do that and preserve the current functionality (I haven't had time to think things through yet) and my preference is to put in a hack rather than break backwads-compatibility.

@pchristos
Copy link
Author

pchristos commented Oct 14, 2020

The cyclic garbage collector detects cycles such as these - yes. Yet, I'm not utterly familiar with its intricacies. The cyclic garbage collector runs periodically, attempts to detect cyclic references, and garbage-collect the respective objects. The cyclic garbage collector works off object generations. If it fails to garbage-collect an object due to a reference cycle, it will advance the object to the next generation pool (a total of 3 pools). When an object is moved to a higher generation pool, the frequency by which the cyclic garbage collector attempts to collect that object decreases I believe. The point here is that the cyclic garbage collector may still be able to release the objects in question at some point in the future without any changes.

By the way, by calling gc.collect() you invoke the cyclic garbage collector I think, which is why the object is collected and its __del__ method invoked. That makes total sense.

However, just by adding a __del__ method won't cut it. It's not about adding a __del__ method. We might even be able to work around this whole without adding a custom destructor, if we were just to invoke gc.collect() somewhere in process_message or run to force-collect objects in all generation pools, but do not take my word for it.

So, to sum this up:

  1. Adding a __del__ method won't solve the problem by itself.
  2. The __del__ method won't be invoked, until an object is garbage-collected.
  3. Calling gc.collect() breaks the reference cycle, force-collects the objects, and thus calls __del__.
  4. Caliing gc.collect() inside process_message or run without any other changes (maybe?) might just be the hack you're looking for.

Even if (4) works, I'm not entirely sure if it will work as well as message._exception = None, since the timing of breaking the reference cycle is going to differ. So, I'm afraid that (4) may partly work.

On another note - how badly are we going to break compatibility? Is there any other place besides the StubBroker that's affected by this change that I'm not aware of? Additionally, the private _exception attribute shouldn't really affect the backwards compatibility of the public API of the class, should it?

P.S. I even tested just adding a __del__ method to the MessageProxy, as you described, and the memory leak is still happening.

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 15, 2020

Thanks for trying out my suggestion. I'm surprised it didn't help and I'll try to find some time this weekend to investigate things myself.

The point of the gc.collect call in the example was to force a full collection at that point in the code to demonstrate that when a Message gets collected, the finalizer does run despite multiple "layers" of circular references (exn -> stack -> {m -> exn, d -> m}). You can change the call to say gc.collect(generation=2) to collect the latest generation and it'll still work the same way and you can remove it altogether and still see the finalizer run, but then you have no guarantee about when it runs and it'll probably do so at exit unless you also modify the code to allocate a bunch of objects before then. What generation the message makes it into isn't relevant because the GC will collect older generations before requesting additional memory for its heap from the OS after it's freed young generation objects. At any rate, it's not true that a set of objects that have cycles between each other but are otherwise unreachable get moved to an older generation, because that would defeat the purpose of having a GC in the first place. Here's an example of all of this.

On another note - how badly are we going to break compatibility? Is there any other place besides the StubBroker that's affected by this change that I'm not aware of? Additionally, the private _exception attribute shouldn't really affect the backwards compatibility of the public API of the class, should it?

The StubBroker and the Results middleware would be affected by any changes made to message._exception. In the case of StubBroker, the effect would be to reduce the utility of the exception re-raising functionality by dropping stack traces, which is something I'd like to avoid since those are really useful when testing code and I would like to avoid having to serialize stack traces if possible.

@dimrozakis
Copy link
Contributor

dimrozakis commented Oct 15, 2020

@Bogdanp can you please explain why you think a __del__ method that breaks the cycle would help? It's my understanding that for __del__ to run would mean that the message is being garbage collected which would in turn mean that the GC located an unreachable cycle and is garbage collecting the entire cycle. So by the time __del__ attempts to break the cycle, there is no need to do so. What am I missing?

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 15, 2020

@dimrozakis take a look at PEP422, in particular the section titled Disposal of cyclic isolates.

@pchristos
Copy link
Author

pchristos commented Oct 15, 2020

The point of the gc.collect call in the example was to force a full collection at that point in the code to demonstrate that when a Message gets collected, the finalizer does run despite multiple "layers" of circular references (exn -> stack -> {m -> exn, d -> m}). You can change the call to say gc.collect(generation=2) to collect the latest generation and it'll still work the same way and you can remove it altogether and still see the finalizer run, but then you have no guarantee about when it runs and it'll probably do so at exit unless you also modify the code to allocate a bunch of objects before then.

That makes sense. I agree.

What generation the message makes it into isn't relevant because the GC will collect older generations before requesting additional memory for its heap from the OS after it's freed young generation objects.

I'm not utterly familiar with this TBH, yet I find it hard to believe that the generation pool an object is in isn't relevant. What's the guarantee that the GC will (successfully) collect older generations' objects, or even run for older (or more specifically the oldest) generation at all, before requesting extra memory from the OS? If there's no guarantee, couldn't one argue that that's what a memory leak is actually about?

At any rate, it's not true that a set of objects that have cycles between each other but are otherwise unreachable get moved to an older generation, because that would defeat the purpose of having a GC in the first place.

Agreed. That makes sense based on PEP442. Didn't mean to imply the opposite.

The StubBroker and the Results middleware would be affected by any changes made to message._exception.

Regarding the Results middleware - aren't all middlewares run sequentially in the same thread after a task has finished running? If so, the _exception should still be available, before it's cleared in the finally statement. Or have I got it wrong?

In the case of StubBroker, the effect would be to reduce the utility of the exception re-raising functionality by dropping stack traces, which is something I'd like to avoid since those are really useful when testing code and I would like to avoid having to serialize stack traces if possible.

I see your point.

Now, regarding PEP442 you mentioned - I don't see how that's helpful TBH. The PEP talks about object finalization in reference cycles. That implies that the reference cycle has already been identified, no? If so, it is garbage-collect-able, isn't it? Additionally, in our case the message is part of that already identified reference cycle, so how does clearing the _exception attribute help? Bottom line is I don't see how object finalization solves the problem, since it depends on finding the reference cycle first.

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 18, 2020

I'm not utterly familiar with this TBH, yet I find it hard to believe that the generation pool an object is in isn't relevant. What's the guarantee that the GC will (successfully) collect older generations' objects, or even run for older (or more specifically the oldest) generation at all, before requesting extra memory from the OS? If there's no guarantee, couldn't one argue that that's what a memory leak is actually about?

I've refreshed my memory about this and I did have it wrong. It does seem like it would be easy to have memory bloat issues if you allocate very large objects infrequently due to the way the threshold stats work.

Regarding the Results middleware - aren't all middlewares run sequentially in the same thread after a task has finished running? If so, the _exception should still be available, before it's cleared in the finally statement. Or have I got it wrong?

Yep, that's right so this is only a problem for the StubBroker after all.

Now, regarding PEP442 you mentioned - I don't see how that's helpful TBH. The PEP talks about object finalization in reference cycles. That implies that the reference cycle has already been identified, no? If so, it is garbage-collect-able, isn't it? Additionally, in our case the message is part of that already identified reference cycle, so how does clearing the _exception attribute help? Bottom line is I don't see how object finalization solves the problem, since it depends on finding the reference cycle first.

In my rush to reply I kind of lost track of the original point of the discussion. I'd originally offered up the finalizer as a way to break the cycle, but that's not necessary since the GC will detect that the graph is unreachable and release those objects. Sorry for the confusion! I only have so much time to deal with these things during the week.

I'll push a fix for this today.

@Bogdanp
Copy link
Owner

Bogdanp commented Oct 18, 2020

This should now be fixed on master. Thanks again for finding the issue and the discussion.

@Bogdanp Bogdanp closed this Oct 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants