Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Core Process Error Handling #7699

Merged
merged 28 commits into from
Mar 4, 2024
Merged

Conversation

xoriole
Copy link
Contributor

@xoriole xoriole commented Nov 20, 2023

This PR addresses

Here is an example of Sentry report sent:

The following additional entry is added to the report:

core_restart_logs: [
CoreRestartLog(pid=1232211, started_at='2023-11-27 14:36:14', uptime='0:00:11',  time_to_shutdown='0:00:04',  exit_code=1, exit_status='0'), 
CoreRestartLog(pid=1232278, started_at='2023-11-27 14:36:25', uptime='still running',  time_to_shutdown='unknown',  exit_code=None, exit_status='None')
],

@xoriole xoriole force-pushed the feature/restart branch 3 times, most recently from 7ba5afd to dd17989 Compare November 27, 2023 13:39
@xoriole xoriole marked this pull request as ready for review November 27, 2023 13:46
@xoriole xoriole requested a review from a team as a code owner November 27, 2023 13:46
@xoriole xoriole requested review from drew2a and removed request for a team November 27, 2023 13:46
.github/workflows/pytest.yml Outdated Show resolved Hide resolved
@xoriole xoriole changed the base branch from release/7.13 to main November 29, 2023 13:51
@xoriole xoriole marked this pull request as draft November 29, 2023 13:51
@xoriole xoriole marked this pull request as ready for review November 29, 2023 15:24
@drew2a
Copy link
Collaborator

drew2a commented Nov 30, 2023

The field core_restart_logs was added to the tags:

image

It feels that the more appropriate place for this field is additional_information or Reporter:
image

@drew2a
Copy link
Collaborator

drew2a commented Nov 30, 2023

@xoriole could you please rebase it to the main branch?

# If should_stop is True, the error is critical, so we first save it to the file to ensure
# that it will be reported to the GUI via events endpoint either immediately or later
# after core restart. This saved file will be deleted when the error is reported to the GUI.
if should_stop:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, if error reports are handled by the exception handler, then they will be sent to the GUI without any problem.

Could you please provide a way to reproduce the initial problem that this PR is aimed to solve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reproduce a part of the solution, CTRL + ALT + SHIFT + C will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.

The other part where the core exception is logged and sent to sentry on the next restart is difficult to reproduce. To test this, I manually placed a saved exception on the crash_logs directory inside the state directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a sample exception file
1701359190810-TriblerCoreTestException.json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reproduce a part of the solution, CTRL + ALT + SHIFT + C will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.

Should I press it within the Tribler GUI? Is it Ctrl+Opt+Shift+C on macOS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a keyboard shortcut to trigger core exception.

self.tribler_gui_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+G"), self)
connect(self.tribler_gui_test_exception_shortcut.activated, self.on_test_tribler_gui_exception)
self.tribler_core_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+C"), self)
connect(self.tribler_core_test_exception_shortcut.activated, self.on_test_tribler_core_exception)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, this combination just calls core_exception_handler.unhandled_error_observer, which is not a way to reproduce the real use case of the improvements proposed in this PR.

    async def core_test_exception(self, _):
        self._logger.info('Test Tribler Core exception was triggered')
        if self.core_exception_handler:
            try:
                raise TriblerCoreTestException('Tribler Core Test Exception')
            except TriblerCoreTestException as e:
                context = dict(should_stop=True, message='Test message', exception=e)
                self.core_exception_handler.unhandled_error_observer(None, context)
        else:
            self._logger.info('Exception handler is not set in DebugEndpoint')

        return RESTResponse({"success": True})

So for me, it is still not clear which case the improvement covers. I understand the theoretical point from the original issue but do not understand their relation with the implementation.

In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic? Where should an error be raised in the core to prove the new logic adds value to error handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic?

Here is the original issue this PR is developed to tackle:

And, here is the linked sentry issue for it:

In the case where the Core connection is lost, we do not know what caused the issue. From the uptime, we see that Tribler was running for 7205 seconds. Then suddenly the core connection was lost and we have no idea what caused it. The original error in the core that broke the GUI connection is unknown.

With this PR, such a core error should be logged as a file (assuming the exception handler on the core catches such an exception) and sent to the GUI when the core connection is re-established after the core restart. Then the error report sent by the user should give us more information about the original cause of the previous exception.

Where should an error be raised in the core to prove the new logic adds value to error handling?

Regarding the reproduction of such a case practically under a normal Tribler run, I'm not able to do so. Which is why the above suggestion was mentioned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your intention and appreciate it. I'm just questioning the implementation, particularly whether the suggested change could reveal the real case of #7603.

I'm uncertain about the change because we actually cannot imagine the potential case for the error; therefore, we cannot verify the solution, even theoretically. It is such a significant change you propose (15 new commits, +700 LOC) and it changes the interaction between Core and GUI pretty significantly, which could lead to unforeseen circumstances. I'm not saying the PR is good or bad; I'm just highlighting the risks and questioning the worthiness of taking them.

From my POV, if the error is successfully handled by the core's error handler, then it should be sent to the GUI without any problem, so the fix you suggested might be insufficient.

One of the explanations for #7603 could be the possibility that someone (maybe even a user) just killed the process following the observation that the process consumes a lot of CPU and/or memory resources.

Another explanation could be the probability of some weird bug in aiohttp that led to an HTTP server crash, which in turn led to the CoreConnectionError. With this explanation, the suggested solution could work, but we need to verify it somehow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with two more probable scenarios that could lead to a CoreConnectionError but are not related to aiohttp. I'll explain them tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added scenarios here: #7647 (comment)

@xoriole
Copy link
Contributor Author

xoriole commented Nov 30, 2023

The field core_restart_logs was added to the tags:

image It feels that the more appropriate place for this field is `additional_information` or `Reporter`: ![image](https://private-user-images.githubusercontent.com/13798583/286916217-79968c60-20cd-4a58-b821-21321b0a5a12.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDEzNTg4NTIsIm5iZiI6MTcwMTM1ODU1MiwicGF0aCI6Ii8xMzc5ODU4My8yODY5MTYyMTctNzk5NjhjNjAtMjBjZC00YTU4LWI4MjEtMjEzMjFiMGE1YTEyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzExMzAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMTMwVDE1MzU1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQxYzMwMWRlZDJmNTFkODNhZTJiNTA3ODI2YWRlZTNmMTZmMWE5YzE5NzFjYTcyMmY0YTBjMjg5Mjg4ZTI3MWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.TxPqW1pwKqbAbuLTxSSxSYBkHkInI1sH-GdXE-56AoU)

Thank you for pointing that out. It should be fixed now.
An example: here

@xoriole
Copy link
Contributor Author

xoriole commented Dec 11, 2023

Here is a summary of the issue based on what I know so far:

Platform Error code Uptime
Windows 10 (Dec 11) 2 37018 seconds
Linux; Flatpak 1 163 seconds
Linux; Flatpak 1 4336 seconds
... ... ...
Linux; Flatpak 1 24 seconds
Linux; Flatpak 1 158 seconds
Windows 2 6900 seconds
Windows 2 75543 seconds
Linux; Flatpak 1 N/A
Windows 2 329 seconds
Linux; Flatpak 1 28205 seconds
Windows 2 1793 seconds
Windows 2 6957 seconds
Windows7 2 805 seconds
Windows10 2 2888 seconds
Linux; Flatpak 1 5 seconds
Windows10 2 2423 seconds
Windows10 2 58490 seconds
Linux; Flatpak 1 800 seconds
Linux; Flatpak 1 3 seconds
Windows10 2 13059 seconds

The issue related to Linux; Flatpak is at mostly at the start and was reproducible. This should be fixed with #7660. The QT Network error code: 1 also confirms that it is Connect Refused.

The second issue related to Windows is mysterious and there is no clear indication of its cause. The error code: 2 indicates that Remote Host has closed the connection. One likely cause of the connection being closed is that the core has terminated for some reason breaking the connection. This is reproducible with a simple example using aiohttp server and QNetworkAccessManager on the QT side.

Another possible scenario as suggested by @drew2a regarding asyncio loop blockage (#7647 (comment)) could be a possible cause but I'm not able to reproduce the issue. Even though the core is frozen, the QT side connection is not closed, rather it can timeout which is a separate error code: 5 instead of error code 2. So, I believe the case for code being frozen or deadlocked will not cause the CoreConnectionError with error code 2.

So, I'm open to suggestions from the reviewers, on how this PR should be further proceeded.

@drew2a
Copy link
Collaborator

drew2a commented Dec 11, 2023

@xoriole, nice summary!

Regarding the PR, my suggestion is to treat it as general core process error handling rather than particular handling of the original CoreConnectionError.

If it works with the deadlock case and with the frozen asyncio case, then let's merge it. If it doesn't work, let's improve it and then merge.

It looks like a nice improvement, but we have to have confirmation that it improves error handling (and we can prove it by executing scenarios described in #7647 (comment)). We can also extend this list, but I'm out of good ideas.

Copy link
Collaborator

@kozlovsky kozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this PR adds valuable functionality.

As I see it, the PR implements two independent features:

  1. The first feature is restarting Core in case of a serious error without restarting GUI. It is beneficial for user experience. Not all Core problems can be fixed by restart, but some can.

  2. The second feature is storing serious Core errors on disk before reporting them to GUI and re-sending errors to GUI in case sending them successfully the first time was not possible.

Both parts of the current PR rely on the established connection between GUI and Core. For that reason, the solution implemented in this PR looks partial:

  1. The PR logic only restarts Core if an error was successfully sent to GUI. If the Core error happens before the connection to GUI is established, Core will not be restarted.

  2. If sending the error to the GUI immediately is impossible, the PR logic sends it again after the connection to the GUI is established after the restart. But if Core constantly crashes before the connection to GUI is established (the most interesting error case, actually!), GUI will never receive the errors stored on disk.

However, extending the logic in a subsequent PR is possible by changing how the serialized errors are handled. Instead of waiting until the restarted Core process reads errors from the disk and sends them to the GUI process via the newly established connection, the GUI process can directly read serialized errors from the disk even before the new Core process is started. It should not be hard to make this change in a subsequent PR to extend the range of the handled startup errors significantly.

Also, GUI can restart the crashed Core process even before the connection to GUI is established. GUI already does it in case of database corruption, and we can extend the range of errors handled this way. For that, we can extend the subset of the Core process's exit codes for which the restart of the Core process should be performed.

In conclusion, the implemented logic is useful and can be extended further in a subsequent PR to cover a broader range of possible errors. I have some minor comments about unused attributes and logical flags handling.

"""
return dataclasses.replace(self, **changes)

def serialized_copy(self) -> 'ReportedError':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the serialized_copy method is only used in tests. If it is the intention, it is worth to mention it in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now used in EventsEndpoint.

self.core_running = False
self.core_connected = False
self.shutting_down = False
self.core_finished = False
self.is_restarting = False

self.core_manager_started_at = int(time.time())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the core_manager_started_at attribute is unnecessary and can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it is no longer used. I have removed it now.

def shutdown_request_processed(self, response):
self._logger.info(f"{SHUTDOWN_ENDPOINT} request was processed by Core. Response: {response}")

def send_shutdown_request(self, initial=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial flag and the shutdown request re-sending logic are obsolete and can be deleted. We no longer re-send the shutdown request; instead, we set request.cancellable = False flag for the request to avoid accidental cancellation of it. It seems I forgot to remove it in #7205.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed initial flag


self.old_core_process_pid = self.core_process.pid()
self.should_quit_app_on_core_finished = False
self.shutting_down = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks incorrect to flip the shutting_down flag back and forth. If some code has already set it to True, it should be the final decision. The same is applied to self.events_manager.shutting_down = False a few lines further.

Both self.shutting_down and events_manager.shutting_down flags are set to True only in the CoreManager.stop() method, so it looks logical that after this method is called, no attempts to restart the core should be performed.

I'd rather say it should be the other way around: if the shutting_down flag is set to True, the restart_core method should do nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this part of the code. I agree with shutting down to be the final call, however, I've opted not to do the refactoring part of this PR. That could be a separate PR dedicated to simplifying the states in the CoreManager.

@@ -158,7 +189,7 @@ def check_core_api_port(self, *args):
return

core_process = self.process_manager.current_process.get_core_process()
if core_process is not None and core_process.api_port:
if core_process is not None and core_process.api_port and core_process.pid != self.old_core_process_pid:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new core_process.pid != self.old_core_process_pid looks suspicious to me. Can we 100% guarantee the PID of the new process can't be the same as the old process's PID in all exotic environments like Flatpak?

It looks like this additional check is unnecessary. We call the check_core_api_port event only after successfully connecting to the Core.

I think it is better to remove this new check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this check. This was mostly for Flatpak but now removed.

@@ -139,3 +150,10 @@ def _stop_tribler(self, text):

if self.tribler_window.debug_window:
self.tribler_window.debug_window.setHidden(True)

def _restart_tribler(self, text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not restart full Tribler here; only Core is restarted. I think we should rename the method to _restart_core or _restart_tribler_core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _restart_tribler_core

@@ -303,6 +373,9 @@ def on_core_finished(self, exit_code, exit_status):
if self.shutting_down:
if self.should_quit_app_on_core_finished:
self.app_manager.quit_application()
elif self.is_restarting:
self.is_restarting = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_restarting flag is set to True at the beginning of the start_tribler_core method, so setting it here looks unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code was to check if this finished was part of restart. To make it more explicit, I have defined another variable wait_for_finished_to_restart_core.

self.send_exception(reported_error)
else:
if reported_error.should_stop:
self.exception_handler.save_to_file(reported_error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the should_stop flag is spread across several files and is therefore not obvious. It may be better to handle it directly right here:

Suggested change
self.exception_handler.save_to_file(reported_error)
self.exception_handler.save_to_file(reported_error.copy(should_stop=False))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Used serialized_copy()

@xoriole
Copy link
Contributor Author

xoriole commented Feb 14, 2024

@kozlovsky I have addressed your review comments and would appreciate your re-review.

Add more tests

Fix issue with tmp directory clean up in tests
@xoriole xoriole merged commit d420a66 into Tribler:main Mar 4, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants