Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when objects emit messages at garbage collection #712

Closed
jacksmith15 opened this issue Sep 26, 2022 · 8 comments · Fixed by #720
Closed

Deadlock when objects emit messages at garbage collection #712

jacksmith15 opened this issue Sep 26, 2022 · 8 comments · Fixed by #720
Labels
enhancement Improvement to an already existing feature

Comments

@jacksmith15
Copy link
Contributor

jacksmith15 commented Sep 26, 2022

Hello 👋

I have found a issue whereby loguru can hit a deadlock due to garbage collection. Specifically, if there is an object in memory that has the following properties:

  1. Is part of a Cyclic Isolate (i.e. it will be garbage collected by the generational garbage collector, not the reference counter)
  2. Implements a __del__ finalizer which explicitly or implicitly triggers a log statement

Then any memory assignment during log handling has a chance to trigger a deadlock.

Reproducible test case

Python version: 3.8.12
Loguru version: 0.6.0

The following is a script which demonstrates the deadlock:

import sys

import loguru


class CyclicReference:
    """A minimal cyclic reference.

    Cyclical references are garbage collected using the generational collector rather than
    via reference counting. This is important here, because the generational collector runs
    periodically, meaning that it is hard to predict when the stack will be overtaken by a
    garbage collection process - but it will almost always be when allocating memory of some
    kind.

    When this object is garbage-collected, a log will be emitted.
    """
    def __init__(self, _other: "CyclicReference" = None):
        self.other = _other or CyclicReference(_other=self)

    def __del__(self):
        loguru.logger.info("Goodbye!")


class Sink:
    def write(self, message):
        # This simulates assigning some memory as part of log handling. Eventually this will
        # trigger the generational garbage collector to take over here.
        # You could replace it with `json.dumps(message.record)` to get the same effect.
        _ = [dict() for _ in range(20)]

        # Actually write the message somewhere
        sys.stderr.write(message)

    def flush(self):
        sys.stderr.flush()


def main():
    # Set up logger with sink:
    loguru.logger.add(Sink(), level="INFO")

    # Keep logging until a cyclic reference is garbage collected, at which point deadlock
    # should be reached:
    for index in range(10000):
        CyclicReference()  # Here we create cyclic isolates, simulating code which uses them
        loguru.logger.info("Hello", index=index)


if __name__ == "__main__":
    main()

Expected behaviour

A series of Hello logs are emitted, interspersed with Goodbye! logs.

Actual behaviour

The program reaches a deadlock, requiring a SIGTERM to exit.

Explanation of the issue

The emit method of loguru's handler uses a lock to avoid contention for the log handler. It is for this reason, that emitting additional log messages in custom sinks is not supported and will result in deadlocks (see #390).

This seems to be a relatively sensible limitation, however the nature of the generational garbage collector in CPython can cause this to happen inadvertently and unpredictably in code which has cyclic isolates and performs some level of processing during log handling.

Real world example

The reproducible example above seems contrived, and one might suggest that the answer is simply to avoid reference cycles, so I will give an actual example of this that I've seen. The Python Elasticsearch client contains reference cycles as a means for sharing connections between interfaces.

As a transport it uses aiohttp, which implements a __del__ garbage collection finalizer. This in turn invokes the asyncio exception handler, which defaults to emitting a log statement. Since this is a stdlib logging call, it doesn't cause a problem on its own, but it is likely that users of loguru will set up some kind of intercept handler to forward third party stdlib logs to loguru.

Additional notes

It took me some time to figure out a reproducible example for this, as I was only familiar with the reference counting part of the garbage collector. I found the this article useful to understand the generational collector, which is what led me to isolate the circumstances to reproduce.

@jacksmith15
Copy link
Contributor Author

jacksmith15 commented Sep 26, 2022

Just noting that a solution which would work for me would be being able to skip emitting a message if the handler lock is not free (rather than waiting). I imagine that wouldn't cover everyone's use cases however!

Yet another solution might be to allow pre-processing messages before the lock is acquired (a la #521), perhaps by extending the dynamic format functionality? However I don't think that solves the issue for more complicated output sinks, which may need to allocate some memory just for the purpose of sending the message as-is.

@jacksmith15
Copy link
Contributor Author

I suppose another option would be to use a re-entrant lock, but I assume this was avoided for a reason?

@Delgan
Copy link
Owner

Delgan commented Sep 30, 2022

Hi @jacksmith15!

This is a bug report of very great quality, thanks a lot for that!
You did all the tedious work of investigation, it was a tremendous help to understand the problem. I can only imagine how hard this bug was to track down.

Your analysis is very interesting. I didn't realize the possibility of a deadlock if the logger was used in a __del__ and the garbage collector happened to run within an handler... 😬

I saw that you went further ahead and implemented a solution before opening a PR, thanks once again for your work. 👍

Regarding the chosen solution of making handlers re-entrant. Some time ago, I had to solve a bug causing a deadlock when a process was forked using the multiprocessing module (see #231 (comment) ). During my investigation, I came to the conclusion that using Lock instead of RLock had greatly simplified the resolution of the problem and avoided other complications. For this reason, I was glad that Loguru was initially implemented with a simple Lock and therefore I had good reasons to never switch to RLock.
I thought about it again today, and I'm not sure what I meant at the time. My analysis of the problem must have been incorrect, because I can't find a scenario in which using a RLock would cause subtle deadlocks. It seems acceptable, finally.

However, what I don't like about using a RLock is that it prevents to guarantee the atomicity of the operations performed by the handler. Suddenly, the implementation must take into account that the sink is re-entrant. Although it avoids deadlocks, it can cause other undesirable and even hazardous problems.

I thought for a long time that there was no way to cleanly prevent the user from facing a deadlock, but I finally realized: why not just check that the lock is not already in use by the current thread when calling the handler?

The implementation could roughly look like that (exception handling not addressed):

--- a/loguru/_handler.py
+++ b/loguru/_handler.py
@@ -2,7 +2,8 @@ import functools
 import json
 import multiprocessing
 import os
-from threading import Thread
+from threading import Thread, local

 from ._colorizer import Colorizer
 from ._locks_machinery import create_handler_lock
@@ -57,6 +58,8 @@ class Handler:
         self._exception_formatter = exception_formatter
         self._id = id_
         self._levels_ansi_codes = levels_ansi_codes  # Warning, reference shared among handlers
+        self._local = local()
+        self._local.locked = False

         self._decolorized_format = None
         self._precolorized_formats = {}
@@ -168,6 +171,12 @@ class Handler:
             str_record = Message(formatted)
             str_record.record = record

+            if self._local.locked:
+                raise RuntimeError("Deadlock avoided")
+
+            self._local.locked = True
+
             with self._lock:
                 if self._stopped:
                     return
@@ -176,6 +185,8 @@ class Handler:
                 else:
                     self._sink.write(str_record)

+            self._local.locked = False
+
         except Exception:
             if not self._error_interceptor.should_catch():
                 raise

If I'm not mistaken, I don't think there is any possible data race or deadlock with this solution.

Well, this implies that the message in __del__ won't be logged and will generate an error instead, but this is hard to solve without making the handler re-entrant (unless we implement a system to postpone the message?).

This also has the advantage of avoiding deadlocks or infinite loops in case the logger is used inside a sink. This is the most common problem, and it would allow to warn the user with an exception explaining the problem explicitly.

That's basically the first solution your proposed. What do you think of it? That's not ideal as some messages might be discarded, but I basically consider logging inside __del__ or signal to be very risky and best avoided.

@jacksmith15
Copy link
Contributor Author

To verify I understand the error_interceptor logic correctly, I think your proposal would mean my test case above would instead log a series of "Hello" logs interspersed with error reports to stderr.

And if the test case was modified to use:

-    loguru.logger.add(Sink(), level="INFO")
+    loguru.logger.add(Sink(), level="INFO", catch=False)

Then the program would print a number of "Hello" logs before exiting with a RuntimeError.

Assuming I've got that correct, I think this change would certainly be a vast improvement on the current behaviour ✔️ . The behaviour might still be surprising when they users encounter it, but its better than a deadlock. Possibly there is an argument for a custom exception type which is handled regardless of the catch option.

As an aside, if you have the time I'd be interested in understanding the following more clearly:

However, what I don't like about using a RLock is that it prevents to guarantee the atomicity of the operations performed by the handler. Suddenly, the implementation must take into account that the sink is re-entrant. Although it avoids deadlocks, it can cause other undesirable and even hazardous problems.

@jacksmith15
Copy link
Contributor Author

Also I think its worth moving the self._local.locked = False to a finally clause 📝

@Delgan
Copy link
Owner

Delgan commented Oct 7, 2022

Yeah, I will probably use a context manager to make sure locked is set back to False. 👍

Your interpretation of the proposed solution is perfectly correct. Although I'm not so sure about what will happen in practice, because the documentation of __del__ states than exceptions occurring when the function is called are ignored.

Regarding the possible problems using a RLock, it's all about reentrancy. I expect some custom sinks to not be reentrant, and Loguru has not control over it so I would rather explicitly make logging not reentrant and prevent possible misuses. It feels safer to guarantee unique access to the sink resources.

Just imagine sinks like that for example:

def sink_example_1(record):
    with open("db.json") as file:
        db = json.load(file)
    # If garbage collector happens here -> lost messages.
    db["messages"].append(record["message"])
    with open("db.json", "w") as file:
        json.dump(db, file)

def sink_example_2(record):
    with threading.Lock:
        # If garbage collector happens here -> deadlock.
        db.write(record["message"])

def sink_example_3(record):
    with open("file.log", "a") as file:
        file.write("Message: ")
        data = do_some_computation()
        # If garbage collector happens here -> interleaved messages.
        file.write(f"[data] record[message]\n")

@jacksmith15
Copy link
Contributor Author

Although I'm not so sure about what will happen in practice, because the documentation of del states than exceptions occurring when the function is called are ignored.

That is a very good point, I suppose it is likely better to always handle the error in the handler - because otherwise __del__ implementations might start failing without any easy way to identify the cause.

And thankyou for the explanation of the issues with re-entrancy, the examples make a lot of sense!

@Delgan Delgan added the enhancement Improvement to an already existing feature label Oct 7, 2022
@jacksmith15
Copy link
Contributor Author

I've raised a PR here: #720

Feel free to disregard if you already know how you want to go about implementing it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to an already existing feature
Projects
None yet
2 participants