Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment #7212

Merged
merged 58 commits into from Jan 9, 2023

Conversation

kozlovsky
Copy link
Collaborator

@kozlovsky kozlovsky commented Dec 2, 2022

Tribler has mechanisms that should ensure that only a single instance of a GUI/Core process is running:

  • For GUI process QtSingleApplication is responsible for that
  • For the Core process, the uniqueness is controlled by 'ProcessChecker' with the help of a single_tribler_instance context manager.

Unfortunately, both of these mechanisms are vulnerable to race conditions, so it is possible to run two Tribler instances simultaneously, which leads to bizarre bugs like this. The error itself does not provide any clue that the actual reason for it is the simultaneous start of two Tribler instances.

I was able to reproduce the situation when two Tribler instances were started at the exact moment but got a different error. That means other bizarre errors we observe in Sentry may have the same origin.

Because of this, we need to implement a proper way to ensure the uniqueness of Tribler processes, and this PR does this (and provides some other nice features that I'll describe a bit later).

To check the process's uniqueness properly, we need to use some kind of lock. It is not easy to implement such a lock in a cross-platform way. In this PR, I use a dedicated SQLite database file to provide the atomicity of the check. SQLite already implements locks for all platforms supported by Tribler, so using SQLite, we avoid new possible sources of bugs when implementing the lock.

The database file introduced in this PR not only provides a lock when launching Core and GUI processes but also keeps the history of all Tribler processes that were launched on the machine, and it should allow providing an additional source of information when debugging the race conditions during the Tribler run. Previously we did not have information on what Tribler processes were started on the local machine, how long they were working, and whether several processes were launched at the same moment.

The database is placed in the root Tribler directory (alongside the currently using triblerd.lock file) and contains the following information:

  • What Tribler GUI & Core processes were ever launched on the local machine;
  • What processes are currently working;
  • What is the reason the process was finished;
  • How long each process worked;

I implemented the logic using native SQLite and not any ORM to ensure that the code is simple and has no "surprises".

@drew2a
Copy link
Collaborator

drew2a commented Dec 14, 2022

Closes #7065, #7069, #6948, #7232, #7234

Could be related: #5252, #7222

@kozlovsky kozlovsky force-pushed the process_locker branch 7 times, most recently from 7439c85 to 0e4f031 Compare December 21, 2022 13:40
@kozlovsky kozlovsky changed the title [WIP] Add process_locker Ensure that no more than one Tribler GUI/Core process runs on a single machine at any moment Dec 21, 2022
@kozlovsky kozlovsky changed the title Ensure that no more than one Tribler GUI/Core process runs on a single machine at any moment Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment Dec 21, 2022
@kozlovsky kozlovsky marked this pull request as ready for review December 21, 2022 13:50
@kozlovsky kozlovsky requested a review from a team as a code owner December 21, 2022 13:50
@kozlovsky kozlovsky requested review from drew2a and removed request for a team December 21, 2022 13:50
src/tribler/core/sentry_reporter/sentry_reporter.py Outdated Show resolved Hide resolved
src/tribler/core/start_core.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/process_locker.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/process_locker.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/process_locker.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/process_locker.py Outdated Show resolved Hide resolved
@kozlovsky kozlovsky requested a review from drew2a January 3, 2023 14:46
Comment on lines 86 to 95
@patch.object(logger, 'warning')
def test_global_process_manager(warning: Mock, process_manager: ProcessManager):
assert get_global_process_manager() is None

set_global_process_manager(process_manager)
assert get_global_process_manager() is process_manager

set_global_process_manager(None)
assert get_global_process_manager() is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a singleton for storing the global process manager. This could lead to a flaky behavior for the test suite as we had before.

The test as well not tests the main functionality of setter (the locking).

That's why it is better to just remove it.

Copy link
Collaborator Author

@kozlovsky kozlovsky Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with removing the test, but I don't see the reason for it, so I suggest discussing it one more time. In my opinion, it is better to have this test for better coverage of the code that handles the global singleton. Without the coverage, it may be possible that the logic will be broken in some later refactoring.

The test does not contain any complex logic and returns the global singleton to a previous state right after the test.

The test as well not tests the main functionality of setter (the locking).

I don't think the lock is the main functionality of get_global_process_manager/set_global_process_manager. There are two types of locks - long and short. If the code contains a non-trivial code inside a long lock, it may spend too much time inside it, and another thread can freeze for too long to acquire it. An even worse scenario is a deadlock when two threads acquire two different locks in an improper order.

Neither of these potential problems is possible in the current test. The logic inside the lock is very simple and contains just a single Python statement:

    with _lock:
        global_process_manager = process_manager

and

    with _lock:
        return global_process_manager

It is an example of a short lock that only covers a simple instruction to protect access to a global variable to make read/write operation atomic. It is just not possible to stay inside the lock for too long. There is no non-trivial function call inside the lock that can delay the exit from the lock. So, in this case, the lock is not a "main" functionality of the function; it is a trivial and very short lock for protecting the atomic access to a global variable. The lock functionality is already tested in Python's test suite, and we can quickly check the code inside the lock to see that it does not have any potential to create a problem.

Copy link
Collaborator Author

@kozlovsky kozlovsky Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I added a try/finally block inside the test to ensure it has no chance to affect the following tests in case of an exception. In any case, the test suite will fail in case of exception, so there is no chance that after this test, we can have a mysterious problem in a subsequent test, but this test passes without an error. So, even in the (highly improbable) scenario that this test will fail and affect the global state, it will also be marked as failed and should explain the actual reason for any other error.

def test_global_process_manager(process_manager: ProcessManager):
    assert get_global_process_manager() is None

    try:
        set_global_process_manager(process_manager)
        assert get_global_process_manager() is process_manager
    finally:
        set_global_process_manager(None)

    assert get_global_process_manager() is None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use simple singleton creation for the ProcessManager as we have done for the ExceptionHandler?

https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/reporter/exception_handler.py#L131

In this case, it is not necessary to test the setter and the getter.

Copy link
Collaborator Author

@kozlovsky kozlovsky Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we pass the current_process object to the ProcessManager constructor, and it is different for GUI and Core processes. To create a singleton at the end of the tribler.core.utilities.process_manager.manager module, I need to create TriblerProcess before it, and I prefer to create the TriblerProcess instance in tribler.gui.start_gui/tribler.core.start_core instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, running all tests without a global singleton feels a bit safer to me regarding test isolation than testing with a globally-defined-during-the-module-import singleton.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the lock is the main functionality of get_global_process_manager/set_global_process_manager.

Why then do use locks? What corner case do you want to cover?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to create TriblerProcess before it, and I prefer to create the TriblerProcess instance in tribler.gui.start_gui/tribler.core.start_core instead.

It is not necessary. You can create a global singleton before and set the current process to it after.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why then do use locks? What corner case do you want to cover?

It is necessary to have a lock when it is potentially possible to work with the same global variable from two different threads

It is not necessary. You can create a global singleton before and set the current process to it after.

The the API will be more complicated, as it becomes necessary to make the current_process optional and check it in all places

Ok, I do not agree with this, but I removed the test to avoid long disputes

@@ -38,8 +38,12 @@ def run_gui(api_port, api_key, root_state_dir, parsed_args):
logger.info('Enabling a workaround for Ubuntu 21.04+ wayland environment')
os.environ["GDK_BACKEND"] = "x11"

# Set up logging
load_logger_config('tribler-gui', root_state_dir)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, the previous code contains a bug. Tribler Core/GUI processes should not load logger config before determining whether the current process is the primary or not. Otherwise, primary and secondary processes start writing to log files, and the logging module does not support writing to the same log file from multiple processes. It leads to random errors during the log rotation, like, using an incorrect file descriptor.

In this PR, I fix the problem by first checking whether the current process is primary and then initializing file-based logging for a primary process only.

load_logger_config('test', tmpdir)
assert len(logging.root.manager.loggerDict) >= logger_count
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, the test that checks the loading of the logger configuration was incorrect for two reasons:

  1. It modifies the global configuration of the logger handlers, which can affect the subsequent tests (the logger configuration does not completely reset after each test)
  2. The assert check does not check anything relevant. The logger_count value is the same before and after the test. What changed during the logger configuration is handlers and not loggers, so we should check handlers and not the loggers count.

The new version of the test mocks the dictConfig function, so the logger configuration does not change, and the state of the subsequent tests is not affected.

@kozlovsky kozlovsky requested a review from drew2a January 9, 2023 07:56
@kozlovsky kozlovsky force-pushed the process_locker branch 2 times, most recently from 10f048c to baa2822 Compare January 9, 2023 08:39
@@ -25,7 +27,7 @@


def run_gui(api_port, api_key, root_state_dir, parsed_args):
logger.info('Running GUI' + ' in gui_test_mode' if parsed_args.gui_test_mode else '')
logger.info(f"Running GUI in {'gui_test_mode' if parsed_args.gui_test_mode else 'normal mode'}")
Copy link
Collaborator Author

@kozlovsky kozlovsky Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, because of the precedence rules, an empty string was logged in normal mode. Basically:

('Running GUI' + ' in gui_test_mode') if parsed_args.gui_test_mode else ('')

Copy link
Collaborator

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from the previous request were not addressed.

@kozlovsky
Copy link
Collaborator Author

I addressed the changes

Copy link
Collaborator

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. It can fix around 8 current issues :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants