New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment #7212
Conversation
abc4b5c
to
7ae2a89
Compare
7ae2a89
to
b61c3c8
Compare
7439c85
to
0e4f031
Compare
0e4f031
to
8727410
Compare
@patch.object(logger, 'warning') | ||
def test_global_process_manager(warning: Mock, process_manager: ProcessManager): | ||
assert get_global_process_manager() is None | ||
|
||
set_global_process_manager(process_manager) | ||
assert get_global_process_manager() is process_manager | ||
|
||
set_global_process_manager(None) | ||
assert get_global_process_manager() is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses a singleton for storing the global process manager. This could lead to a flaky behavior for the test suite as we had before.
The test as well not tests the main functionality of setter (the locking).
That's why it is better to just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with removing the test, but I don't see the reason for it, so I suggest discussing it one more time. In my opinion, it is better to have this test for better coverage of the code that handles the global singleton. Without the coverage, it may be possible that the logic will be broken in some later refactoring.
The test does not contain any complex logic and returns the global singleton to a previous state right after the test.
The test as well not tests the main functionality of setter (the locking).
I don't think the lock is the main functionality of get_global_process_manager
/set_global_process_manager
. There are two types of locks - long and short. If the code contains a non-trivial code inside a long lock, it may spend too much time inside it, and another thread can freeze for too long to acquire it. An even worse scenario is a deadlock when two threads acquire two different locks in an improper order.
Neither of these potential problems is possible in the current test. The logic inside the lock is very simple and contains just a single Python statement:
with _lock:
global_process_manager = process_manager
and
with _lock:
return global_process_manager
It is an example of a short lock that only covers a simple instruction to protect access to a global variable to make read/write operation atomic. It is just not possible to stay inside the lock for too long. There is no non-trivial function call inside the lock that can delay the exit from the lock. So, in this case, the lock is not a "main" functionality of the function; it is a trivial and very short lock for protecting the atomic access to a global variable. The lock functionality is already tested in Python's test suite, and we can quickly check the code inside the lock to see that it does not have any potential to create a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I added a try/finally block inside the test to ensure it has no chance to affect the following tests in case of an exception. In any case, the test suite will fail in case of exception, so there is no chance that after this test, we can have a mysterious problem in a subsequent test, but this test passes without an error. So, even in the (highly improbable) scenario that this test will fail and affect the global state, it will also be marked as failed and should explain the actual reason for any other error.
def test_global_process_manager(process_manager: ProcessManager):
assert get_global_process_manager() is None
try:
set_global_process_manager(process_manager)
assert get_global_process_manager() is process_manager
finally:
set_global_process_manager(None)
assert get_global_process_manager() is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use simple singleton creation for the ProcessManager
as we have done for the ExceptionHandler
?
In this case, it is not necessary to test the setter and the getter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we pass the current_process
object to the ProcessManager
constructor, and it is different for GUI and Core processes. To create a singleton at the end of the tribler.core.utilities.process_manager.manager
module, I need to create TriblerProcess
before it, and I prefer to create the TriblerProcess
instance in tribler.gui.start_gui
/tribler.core.start_core
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, running all tests without a global singleton feels a bit safer to me regarding test isolation than testing with a globally-defined-during-the-module-import singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the lock is the main functionality of get_global_process_manager/set_global_process_manager.
Why then do use locks? What corner case do you want to cover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to create TriblerProcess before it, and I prefer to create the TriblerProcess instance in tribler.gui.start_gui/tribler.core.start_core instead.
It is not necessary. You can create a global singleton before and set the current process to it after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why then do use locks? What corner case do you want to cover?
It is necessary to have a lock when it is potentially possible to work with the same global variable from two different threads
It is not necessary. You can create a global singleton before and set the current process to it after.
The the API will be more complicated, as it becomes necessary to make the current_process optional and check it in all places
Ok, I do not agree with this, but I removed the test to avoid long disputes
@@ -38,8 +38,12 @@ def run_gui(api_port, api_key, root_state_dir, parsed_args): | |||
logger.info('Enabling a workaround for Ubuntu 21.04+ wayland environment') | |||
os.environ["GDK_BACKEND"] = "x11" | |||
|
|||
# Set up logging | |||
load_logger_config('tribler-gui', root_state_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it turns out, the previous code contains a bug. Tribler Core/GUI processes should not load logger config before determining whether the current process is the primary or not. Otherwise, primary and secondary processes start writing to log files, and the logging module does not support writing to the same log file from multiple processes. It leads to random errors during the log rotation, like, using an incorrect file descriptor.
In this PR, I fix the problem by first checking whether the current process is primary and then initializing file-based logging for a primary process only.
load_logger_config('test', tmpdir) | ||
assert len(logging.root.manager.loggerDict) >= logger_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it turns out, the test that checks the loading of the logger configuration was incorrect for two reasons:
- It modifies the global configuration of the logger handlers, which can affect the subsequent tests (the logger configuration does not completely reset after each test)
- The
assert
check does not check anything relevant. Thelogger_count
value is the same before and after the test. What changed during the logger configuration is handlers and not loggers, so we should check handlers and not the loggers count.
The new version of the test mocks the dictConfig
function, so the logger configuration does not change, and the state of the subsequent tests is not affected.
10f048c
to
baa2822
Compare
…ffect the subsequent tests
baa2822
to
b944cd0
Compare
@@ -25,7 +27,7 @@ | |||
|
|||
|
|||
def run_gui(api_port, api_key, root_state_dir, parsed_args): | |||
logger.info('Running GUI' + ' in gui_test_mode' if parsed_args.gui_test_mode else '') | |||
logger.info(f"Running GUI in {'gui_test_mode' if parsed_args.gui_test_mode else 'normal mode'}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, because of the precedence rules, an empty string was logged in normal mode. Basically:
('Running GUI' + ' in gui_test_mode') if parsed_args.gui_test_mode else ('')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes from the previous request were not addressed.
ea010a7
to
90a8db4
Compare
I addressed the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. It can fix around 8 current issues :)
Tribler has mechanisms that should ensure that only a single instance of a GUI/Core process is running:
QtSingleApplication
is responsible for thatsingle_tribler_instance
context manager.Unfortunately, both of these mechanisms are vulnerable to race conditions, so it is possible to run two Tribler instances simultaneously, which leads to bizarre bugs like this. The error itself does not provide any clue that the actual reason for it is the simultaneous start of two Tribler instances.
I was able to reproduce the situation when two Tribler instances were started at the exact moment but got a different error. That means other bizarre errors we observe in Sentry may have the same origin.
Because of this, we need to implement a proper way to ensure the uniqueness of Tribler processes, and this PR does this (and provides some other nice features that I'll describe a bit later).
To check the process's uniqueness properly, we need to use some kind of lock. It is not easy to implement such a lock in a cross-platform way. In this PR, I use a dedicated SQLite database file to provide the atomicity of the check. SQLite already implements locks for all platforms supported by Tribler, so using SQLite, we avoid new possible sources of bugs when implementing the lock.
The database file introduced in this PR not only provides a lock when launching Core and GUI processes but also keeps the history of all Tribler processes that were launched on the machine, and it should allow providing an additional source of information when debugging the race conditions during the Tribler run. Previously we did not have information on what Tribler processes were started on the local machine, how long they were working, and whether several processes were launched at the same moment.
The database is placed in the root Tribler directory (alongside the currently using
triblerd.lock
file) and contains the following information:I implemented the logic using native SQLite and not any ORM to ensure that the code is simple and has no "surprises".