Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment #7212

kozlovsky · 2022-12-02T08:30:57Z

Tribler has mechanisms that should ensure that only a single instance of a GUI/Core process is running:

For GUI process QtSingleApplication is responsible for that
For the Core process, the uniqueness is controlled by 'ProcessChecker' with the help of a single_tribler_instance context manager.

Unfortunately, both of these mechanisms are vulnerable to race conditions, so it is possible to run two Tribler instances simultaneously, which leads to bizarre bugs like this. The error itself does not provide any clue that the actual reason for it is the simultaneous start of two Tribler instances.

I was able to reproduce the situation when two Tribler instances were started at the exact moment but got a different error. That means other bizarre errors we observe in Sentry may have the same origin.

Because of this, we need to implement a proper way to ensure the uniqueness of Tribler processes, and this PR does this (and provides some other nice features that I'll describe a bit later).

To check the process's uniqueness properly, we need to use some kind of lock. It is not easy to implement such a lock in a cross-platform way. In this PR, I use a dedicated SQLite database file to provide the atomicity of the check. SQLite already implements locks for all platforms supported by Tribler, so using SQLite, we avoid new possible sources of bugs when implementing the lock.

The database file introduced in this PR not only provides a lock when launching Core and GUI processes but also keeps the history of all Tribler processes that were launched on the machine, and it should allow providing an additional source of information when debugging the race conditions during the Tribler run. Previously we did not have information on what Tribler processes were started on the local machine, how long they were working, and whether several processes were launched at the same moment.

The database is placed in the root Tribler directory (alongside the currently using triblerd.lock file) and contains the following information:

What Tribler GUI & Core processes were ever launched on the local machine;
What processes are currently working;
What is the reason the process was finished;
How long each process worked;

I implemented the logic using native SQLite and not any ORM to ensure that the code is simple and has no "surprises".

drew2a · 2022-12-14T13:29:10Z

Closes #7065, #7069, #6948, #7232, #7234

Could be related: #5252, #7222

src/run_tribler.py

src/tribler/core/components/reporter/exception_handler.py

src/tribler/core/exceptions.py

src/tribler/core/sentry_reporter/sentry_reporter.py

src/tribler/core/start_core.py

src/tribler/core/sentry_reporter/sentry_reporter.py

src/tribler/core/start_core.py

src/tribler/core/utilities/process_locker.py

src/tribler/gui/start_gui.py

drew2a · 2023-01-05T08:55:48Z

src/tribler/core/utilities/process_manager/tests/test_manager.py

+@patch.object(logger, 'warning')
+def test_global_process_manager(warning: Mock, process_manager: ProcessManager):
+    assert get_global_process_manager() is None
+
+    set_global_process_manager(process_manager)
+    assert get_global_process_manager() is process_manager
+
+    set_global_process_manager(None)
+    assert get_global_process_manager() is None


This test uses a singleton for storing the global process manager. This could lead to a flaky behavior for the test suite as we had before.

The test as well not tests the main functionality of setter (the locking).

That's why it is better to just remove it.

I'm ok with removing the test, but I don't see the reason for it, so I suggest discussing it one more time. In my opinion, it is better to have this test for better coverage of the code that handles the global singleton. Without the coverage, it may be possible that the logic will be broken in some later refactoring.

The test does not contain any complex logic and returns the global singleton to a previous state right after the test.

The test as well not tests the main functionality of setter (the locking).

I don't think the lock is the main functionality of get_global_process_manager/set_global_process_manager. There are two types of locks - long and short. If the code contains a non-trivial code inside a long lock, it may spend too much time inside it, and another thread can freeze for too long to acquire it. An even worse scenario is a deadlock when two threads acquire two different locks in an improper order.

Neither of these potential problems is possible in the current test. The logic inside the lock is very simple and contains just a single Python statement:

with _lock: global_process_manager = process_manager

and

with _lock: return global_process_manager

It is an example of a short lock that only covers a simple instruction to protect access to a global variable to make read/write operation atomic. It is just not possible to stay inside the lock for too long. There is no non-trivial function call inside the lock that can delay the exit from the lock. So, in this case, the lock is not a "main" functionality of the function; it is a trivial and very short lock for protecting the atomic access to a global variable. The lock functionality is already tested in Python's test suite, and we can quickly check the code inside the lock to see that it does not have any potential to create a problem.

Update: I added a try/finally block inside the test to ensure it has no chance to affect the following tests in case of an exception. In any case, the test suite will fail in case of exception, so there is no chance that after this test, we can have a mysterious problem in a subsequent test, but this test passes without an error. So, even in the (highly improbable) scenario that this test will fail and affect the global state, it will also be marked as failed and should explain the actual reason for any other error.

def test_global_process_manager(process_manager: ProcessManager): assert get_global_process_manager() is None try: set_global_process_manager(process_manager) assert get_global_process_manager() is process_manager finally: set_global_process_manager(None) assert get_global_process_manager() is None

Why don't you use simple singleton creation for the ProcessManager as we have done for the ExceptionHandler?

https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/reporter/exception_handler.py#L131

In this case, it is not necessary to test the setter and the getter.

Because we pass the current_process object to the ProcessManager constructor, and it is different for GUI and Core processes. To create a singleton at the end of the tribler.core.utilities.process_manager.manager module, I need to create TriblerProcess before it, and I prefer to create the TriblerProcess instance in tribler.gui.start_gui/tribler.core.start_core instead.

Also, running all tests without a global singleton feels a bit safer to me regarding test isolation than testing with a globally-defined-during-the-module-import singleton.

I don't think the lock is the main functionality of get_global_process_manager/set_global_process_manager.

Why then do use locks? What corner case do you want to cover?

I need to create TriblerProcess before it, and I prefer to create the TriblerProcess instance in tribler.gui.start_gui/tribler.core.start_core instead.

It is not necessary. You can create a global singleton before and set the current process to it after.

Why then do use locks? What corner case do you want to cover?

It is necessary to have a lock when it is potentially possible to work with the same global variable from two different threads

It is not necessary. You can create a global singleton before and set the current process to it after.

The the API will be more complicated, as it becomes necessary to make the current_process optional and check it in all places

Ok, I do not agree with this, but I removed the test to avoid long disputes

kozlovsky · 2023-01-09T07:47:53Z

src/tribler/gui/start_gui.py

@@ -38,8 +38,12 @@ def run_gui(api_port, api_key, root_state_dir, parsed_args):
        logger.info('Enabling a workaround for Ubuntu 21.04+ wayland environment')
        os.environ["GDK_BACKEND"] = "x11"

-    # Set up logging
-    load_logger_config('tribler-gui', root_state_dir)


As it turns out, the previous code contains a bug. Tribler Core/GUI processes should not load logger config before determining whether the current process is the primary or not. Otherwise, primary and secondary processes start writing to log files, and the logging module does not support writing to the same log file from multiple processes. It leads to random errors during the log rotation, like, using an incorrect file descriptor.

In this PR, I fix the problem by first checking whether the current process is primary and then initializing file-based logging for a primary process only.

kozlovsky · 2023-01-09T07:54:36Z

src/tribler/core/utilities/tests/test_utilities.py

    load_logger_config('test', tmpdir)
-    assert len(logging.root.manager.loggerDict) >= logger_count


As it turns out, the test that checks the loading of the logger configuration was incorrect for two reasons:

It modifies the global configuration of the logger handlers, which can affect the subsequent tests (the logger configuration does not completely reset after each test)

The assert check does not check anything relevant. The logger_count value is the same before and after the test. What changed during the logger configuration is handlers and not loggers, so we should check handlers and not the loggers count.

The new version of the test mocks the dictConfig function, so the logger configuration does not change, and the state of the subsequent tests is not affected.

…ffect the subsequent tests

kozlovsky · 2023-01-09T09:10:29Z

src/tribler/gui/start_gui.py

@@ -25,7 +27,7 @@


 def run_gui(api_port, api_key, root_state_dir, parsed_args):
-    logger.info('Running GUI' + ' in gui_test_mode' if parsed_args.gui_test_mode else '')
+    logger.info(f"Running GUI in {'gui_test_mode' if parsed_args.gui_test_mode else 'normal mode'}")


Previously, because of the precedence rules, an empty string was logged in normal mode. Basically:

('Running GUI' + ' in gui_test_mode') if parsed_args.gui_test_mode else ('')

drew2a

The changes from the previous request were not addressed.

kozlovsky · 2023-01-09T14:04:41Z

I addressed the changes

drew2a

Great job. It can fix around 8 current issues :)

kozlovsky force-pushed the process_locker branch 6 times, most recently from abc4b5c to 7ae2a89 Compare December 9, 2022 08:31

kozlovsky force-pushed the process_locker branch from 7ae2a89 to b61c3c8 Compare December 12, 2022 12:00

kozlovsky force-pushed the process_locker branch 7 times, most recently from 7439c85 to 0e4f031 Compare December 21, 2022 13:40

kozlovsky changed the title ~~[WIP] Add process_locker~~ Ensure that no more than one Tribler GUI/Core process runs on a single machine at any moment Dec 21, 2022

kozlovsky changed the title ~~Ensure that no more than one Tribler GUI/Core process runs on a single machine at any moment~~ Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment Dec 21, 2022

kozlovsky marked this pull request as ready for review December 21, 2022 13:50

kozlovsky requested a review from a team as a code owner December 21, 2022 13:50

kozlovsky requested review from drew2a and removed request for a team December 21, 2022 13:50

kozlovsky force-pushed the process_locker branch from 0e4f031 to 8727410 Compare December 21, 2022 13:53