Improve Core Process Error Handling #7699

xoriole · 2023-11-20T07:29:43Z

This PR addresses

Improve Core Process Error Handling: Restart on Crash and Prompt for Error Reporting #7647

Here is an example of Sentry report sent:

https://sentry.tribler.org/organizations/tribler/issues/2500/?project=10&query=&referrer=issue-stream&stream_index=0

The following additional entry is added to the report:

core_restart_logs: [
CoreRestartLog(pid=1232211, started_at='2023-11-27 14:36:14', uptime='0:00:11',  time_to_shutdown='0:00:04',  exit_code=1, exit_status='0'), 
CoreRestartLog(pid=1232278, started_at='2023-11-27 14:36:25', uptime='still running',  time_to_shutdown='unknown',  exit_code=None, exit_status='None')
],

.github/workflows/pytest.yml

drew2a · 2023-11-30T11:36:13Z

The field core_restart_logs was added to the tags:

It feels that the more appropriate place for this field is additional_information or Reporter:

drew2a · 2023-11-30T12:21:44Z

@xoriole could you please rebase it to the main branch?

drew2a · 2023-11-30T12:28:04Z

src/tribler/core/components/reporter/exception_handler.py

+            # If should_stop is True, the error is critical, so we first save it to the file to ensure
+            # that it will be reported to the GUI via events endpoint either immediately or later
+            # after core restart. This saved file will be deleted when the error is reported to the GUI.
+            if should_stop:


From my understanding, if error reports are handled by the exception handler, then they will be sent to the GUI without any problem.

Could you please provide a way to reproduce the initial problem that this PR is aimed to solve?

To reproduce a part of the solution, CTRL + ALT + SHIFT + C will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.

The other part where the core exception is logged and sent to sentry on the next restart is difficult to reproduce. To test this, I manually placed a saved exception on the crash_logs directory inside the state directory.

Here is a sample exception file
1701359190810-TriblerCoreTestException.json

To reproduce a part of the solution, CTRL + ALT + SHIFT + C will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.

Should I press it within the Tribler GUI? Is it Ctrl+Opt+Shift+C on macOS?

It is a keyboard shortcut to trigger core exception.

tribler/src/tribler/gui/tribler_window.py

Lines 237 to 241 in 8bb85b5

self.tribler_gui_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+G"), self)

connect(self.tribler_gui_test_exception_shortcut.activated, self.on_test_tribler_gui_exception)

self.tribler_core_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+C"), self)

connect(self.tribler_core_test_exception_shortcut.activated, self.on_test_tribler_core_exception)

Basically, this combination just calls core_exception_handler.unhandled_error_observer, which is not a way to reproduce the real use case of the improvements proposed in this PR.

async def core_test_exception(self, _): self._logger.info('Test Tribler Core exception was triggered') if self.core_exception_handler: try: raise TriblerCoreTestException('Tribler Core Test Exception') except TriblerCoreTestException as e: context = dict(should_stop=True, message='Test message', exception=e) self.core_exception_handler.unhandled_error_observer(None, context) else: self._logger.info('Exception handler is not set in DebugEndpoint') return RESTResponse({"success": True})

So for me, it is still not clear which case the improvement covers. I understand the theoretical point from the original issue but do not understand their relation with the implementation.

In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic? Where should an error be raised in the core to prove the new logic adds value to error handling?

In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic?

Here is the original issue this PR is developed to tackle:

[Flatpak][7.13.0] CoreConnectionError: "The connection to the Tribler Core was lost" in event_request_manager.on_error #7603

And, here is the linked sentry issue for it:

https://sentry.tribler.org/organizations/tribler/issues/2003/events/f1a9c56138aa4dbea5598d00c2a72311/?project=7&query=is%3Aunresolved+version%3A7.13.0&referrer=previous-event&stream_index=2

In the case where the Core connection is lost, we do not know what caused the issue. From the uptime, we see that Tribler was running for 7205 seconds. Then suddenly the core connection was lost and we have no idea what caused it. The original error in the core that broke the GUI connection is unknown.

With this PR, such a core error should be logged as a file (assuming the exception handler on the core catches such an exception) and sent to the GUI when the core connection is re-established after the core restart. Then the error report sent by the user should give us more information about the original cause of the previous exception.

Where should an error be raised in the core to prove the new logic adds value to error handling?

Regarding the reproduction of such a case practically under a normal Tribler run, I'm not able to do so. Which is why the above suggestion was mentioned.

I understand your intention and appreciate it. I'm just questioning the implementation, particularly whether the suggested change could reveal the real case of #7603.

I'm uncertain about the change because we actually cannot imagine the potential case for the error; therefore, we cannot verify the solution, even theoretically. It is such a significant change you propose (15 new commits, +700 LOC) and it changes the interaction between Core and GUI pretty significantly, which could lead to unforeseen circumstances. I'm not saying the PR is good or bad; I'm just highlighting the risks and questioning the worthiness of taking them.

From my POV, if the error is successfully handled by the core's error handler, then it should be sent to the GUI without any problem, so the fix you suggested might be insufficient.

One of the explanations for #7603 could be the possibility that someone (maybe even a user) just killed the process following the observation that the process consumes a lot of CPU and/or memory resources.

Another explanation could be the probability of some weird bug in aiohttp that led to an HTTP server crash, which in turn led to the CoreConnectionError. With this explanation, the suggested solution could work, but we need to verify it somehow.

I came up with two more probable scenarios that could lead to a CoreConnectionError but are not related to aiohttp. I'll explain them tomorrow.

I've added scenarios here: #7647 (comment)

xoriole · 2023-11-30T15:53:51Z

The field core_restart_logs was added to the tags:
It feels that the more appropriate place for this field is `additional_information` or `Reporter`: ![image](https://private-user-images.githubusercontent.com/13798583/286916217-79968c60-20cd-4a58-b821-21321b0a5a12.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDEzNTg4NTIsIm5iZiI6MTcwMTM1ODU1MiwicGF0aCI6Ii8xMzc5ODU4My8yODY5MTYyMTctNzk5NjhjNjAtMjBjZC00YTU4LWI4MjEtMjEzMjFiMGE1YTEyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzExMzAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMTMwVDE1MzU1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQxYzMwMWRlZDJmNTFkODNhZTJiNTA3ODI2YWRlZTNmMTZmMWE5YzE5NzFjYTcyMmY0YTBjMjg5Mjg4ZTI3MWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.TxPqW1pwKqbAbuLTxSSxSYBkHkInI1sH-GdXE-56AoU)

Thank you for pointing that out. It should be fixed now.
An example: here

xoriole · 2023-12-11T15:23:49Z

Here is a summary of the issue based on what I know so far:

Platform	Error code	Uptime
Windows 10 (Dec 11)	2	37018 seconds
Linux; Flatpak	1	163 seconds
Linux; Flatpak	1	4336 seconds
...	...	...
Linux; Flatpak	1	24 seconds
Linux; Flatpak	1	158 seconds
Windows	2	6900 seconds
Windows	2	75543 seconds
Linux; Flatpak	1	N/A
Windows	2	329 seconds
Linux; Flatpak	1	28205 seconds
Windows	2	1793 seconds
Windows	2	6957 seconds
Windows7	2	805 seconds
Windows10	2	2888 seconds
Linux; Flatpak	1	5 seconds
Windows10	2	2423 seconds
Windows10	2	58490 seconds
Linux; Flatpak	1	800 seconds
Linux; Flatpak	1	3 seconds
Windows10	2	13059 seconds

The issue related to Linux; Flatpak is at mostly at the start and was reproducible. This should be fixed with #7660. The QT Network error code: 1 also confirms that it is Connect Refused.

The second issue related to Windows is mysterious and there is no clear indication of its cause. The error code: 2 indicates that Remote Host has closed the connection. One likely cause of the connection being closed is that the core has terminated for some reason breaking the connection. This is reproducible with a simple example using aiohttp server and QNetworkAccessManager on the QT side.

Another possible scenario as suggested by @drew2a regarding asyncio loop blockage (#7647 (comment)) could be a possible cause but I'm not able to reproduce the issue. Even though the core is frozen, the QT side connection is not closed, rather it can timeout which is a separate error code: 5 instead of error code 2. So, I believe the case for code being frozen or deadlocked will not cause the CoreConnectionError with error code 2.

So, I'm open to suggestions from the reviewers, on how this PR should be further proceeded.

drew2a · 2023-12-11T15:33:43Z

@xoriole, nice summary!

Regarding the PR, my suggestion is to treat it as general core process error handling rather than particular handling of the original CoreConnectionError.

If it works with the deadlock case and with the frozen asyncio case, then let's merge it. If it doesn't work, let's improve it and then merge.

It looks like a nice improvement, but we have to have confirmation that it improves error handling (and we can prove it by executing scenarios described in #7647 (comment)). We can also extend this list, but I'm out of good ideas.

kozlovsky

In my opinion, this PR adds valuable functionality.

As I see it, the PR implements two independent features:

The first feature is restarting Core in case of a serious error without restarting GUI. It is beneficial for user experience. Not all Core problems can be fixed by restart, but some can.
The second feature is storing serious Core errors on disk before reporting them to GUI and re-sending errors to GUI in case sending them successfully the first time was not possible.

Both parts of the current PR rely on the established connection between GUI and Core. For that reason, the solution implemented in this PR looks partial:

The PR logic only restarts Core if an error was successfully sent to GUI. If the Core error happens before the connection to GUI is established, Core will not be restarted.
If sending the error to the GUI immediately is impossible, the PR logic sends it again after the connection to the GUI is established after the restart. But if Core constantly crashes before the connection to GUI is established (the most interesting error case, actually!), GUI will never receive the errors stored on disk.

However, extending the logic in a subsequent PR is possible by changing how the serialized errors are handled. Instead of waiting until the restarted Core process reads errors from the disk and sends them to the GUI process via the newly established connection, the GUI process can directly read serialized errors from the disk even before the new Core process is started. It should not be hard to make this change in a subsequent PR to extend the range of the handled startup errors significantly.

Also, GUI can restart the crashed Core process even before the connection to GUI is established. GUI already does it in case of database corruption, and we can extend the range of errors handled this way. For that, we can extend the subset of the Core process's exit codes for which the restart of the Core process should be performed.

In conclusion, the implemented logic is useful and can be extended further in a subsequent PR to cover a broader range of possible errors. I have some minor comments about unused attributes and logical flags handling.

kozlovsky · 2024-01-09T07:37:39Z

src/tribler/core/components/reporter/reported_error.py

+        """
+        return dataclasses.replace(self, **changes)
+
+    def serialized_copy(self) -> 'ReportedError':


It looks like the serialized_copy method is only used in tests. If it is the intention, it is worth to mention it in the docstring.

It is now used in EventsEndpoint.

kozlovsky · 2024-01-09T07:45:12Z

src/tribler/gui/core_manager.py

        self.core_running = False
        self.core_connected = False
        self.shutting_down = False
        self.core_finished = False
+        self.is_restarting = False
+
+        self.core_manager_started_at = int(time.time())


It looks like the core_manager_started_at attribute is unnecessary and can be removed

You're right, it is no longer used. I have removed it now.

kozlovsky · 2024-01-09T08:53:19Z

src/tribler/gui/core_manager.py

+    def shutdown_request_processed(self, response):
+        self._logger.info(f"{SHUTDOWN_ENDPOINT} request was processed by Core. Response: {response}")
+
+    def send_shutdown_request(self, initial=False):


The initial flag and the shutdown request re-sending logic are obsolete and can be deleted. We no longer re-send the shutdown request; instead, we set request.cancellable = False flag for the request to avoid accidental cancellation of it. It seems I forgot to remove it in #7205.

Removed initial flag

kozlovsky · 2024-01-09T09:25:04Z

src/tribler/gui/core_manager.py

+
+        self.old_core_process_pid = self.core_process.pid()
+        self.should_quit_app_on_core_finished = False
+        self.shutting_down = False


It looks incorrect to flip the shutting_down flag back and forth. If some code has already set it to True, it should be the final decision. The same is applied to self.events_manager.shutting_down = False a few lines further.

Both self.shutting_down and events_manager.shutting_down flags are set to True only in the CoreManager.stop() method, so it looks logical that after this method is called, no attempts to restart the core should be performed.

I'd rather say it should be the other way around: if the shutting_down flag is set to True, the restart_core method should do nothing.

I have removed this part of the code. I agree with shutting down to be the final call, however, I've opted not to do the refactoring part of this PR. That could be a separate PR dedicated to simplifying the states in the CoreManager.

kozlovsky · 2024-01-09T09:59:53Z

src/tribler/gui/core_manager.py

@@ -158,7 +189,7 @@ def check_core_api_port(self, *args):
            return

        core_process = self.process_manager.current_process.get_core_process()
-        if core_process is not None and core_process.api_port:
+        if core_process is not None and core_process.api_port and core_process.pid != self.old_core_process_pid:


This new core_process.pid != self.old_core_process_pid looks suspicious to me. Can we 100% guarantee the PID of the new process can't be the same as the old process's PID in all exotic environments like Flatpak?

It looks like this additional check is unnecessary. We call the check_core_api_port event only after successfully connecting to the Core.

I think it is better to remove this new check.

I have removed this check. This was mostly for Flatpak but now removed.

kozlovsky · 2024-01-09T10:26:50Z

src/tribler/gui/error_handler.py

@@ -139,3 +150,10 @@ def _stop_tribler(self, text):

        if self.tribler_window.debug_window:
            self.tribler_window.debug_window.setHidden(True)
+
+    def _restart_tribler(self, text):


We do not restart full Tribler here; only Core is restarted. I think we should rename the method to _restart_core or _restart_tribler_core.

Renamed to _restart_tribler_core

kozlovsky · 2024-01-09T10:44:20Z

src/tribler/gui/core_manager.py

@@ -303,6 +373,9 @@ def on_core_finished(self, exit_code, exit_status):
        if self.shutting_down:
            if self.should_quit_app_on_core_finished:
                self.app_manager.quit_application()
+        elif self.is_restarting:
+            self.is_restarting = False


The is_restarting flag is set to True at the beginning of the start_tribler_core method, so setting it here looks unnecessary.

This part of code was to check if this finished was part of restart. To make it more explicit, I have defined another variable wait_for_finished_to_restart_core.

kozlovsky · 2024-01-09T12:04:47Z

src/tribler/core/components/restapi/rest/events_endpoint.py

+            self.send_exception(reported_error)
+        else:
+            if reported_error.should_stop:
+                self.exception_handler.save_to_file(reported_error)


The handling of the should_stop flag is spread across several files and is therefore not obvious. It may be better to handle it directly right here:

Suggested change

self.exception_handler.save_to_file(reported_error)

self.exception_handler.save_to_file(reported_error.copy(should_stop=False))

Good point! Used serialized_copy()

xoriole · 2024-02-14T11:48:27Z

@kozlovsky I have addressed your review comments and would appreciate your re-review.

Add more tests Fix issue with tmp directory clean up in tests

Fix codacy issues

xoriole force-pushed the feature/restart branch 3 times, most recently from 7ba5afd to dd17989 Compare November 27, 2023 13:39

xoriole marked this pull request as ready for review November 27, 2023 13:46

xoriole requested a review from a team as a code owner November 27, 2023 13:46

xoriole requested review from drew2a and removed request for a team November 27, 2023 13:46

xoriole commented Nov 27, 2023

View reviewed changes

.github/workflows/pytest.yml Outdated Show resolved Hide resolved

xoriole force-pushed the feature/restart branch from dd17989 to ae5d6f2 Compare November 29, 2023 13:51

xoriole changed the base branch from release/7.13 to main November 29, 2023 13:51

xoriole marked this pull request as draft November 29, 2023 13:51

xoriole marked this pull request as ready for review November 29, 2023 15:24

drew2a requested a review from kozlovsky November 30, 2023 11:38

drew2a reviewed Nov 30, 2023

View reviewed changes

xoriole force-pushed the feature/restart branch from 67124a5 to 78eacd8 Compare November 30, 2023 14:24

drew2a mentioned this pull request Dec 4, 2023

Improve Core Process Error Handling: Restart on Crash and Prompt for Error Reporting #7647

Closed

xoriole mentioned this pull request Dec 11, 2023

[Windows][7.13.0] CoreConnectionError: "The connection to the Tribler Core was lost | Error Code : 2 #7772

Closed

kozlovsky requested changes Jan 10, 2024

View reviewed changes

xoriole force-pushed the feature/restart branch from d37ca35 to aeb230d Compare January 17, 2024 12:39

kozlovsky mentioned this pull request Jan 26, 2024

Switch all error and exception logger calls to warning if they do not lead to a crash. #7861

Closed

xoriole self-assigned this Feb 12, 2024

xoriole force-pushed the feature/restart branch from aeb230d to 84045ef Compare February 14, 2024 11:34

xoriole requested a review from kozlovsky February 14, 2024 11:47

xoriole mentioned this pull request Feb 16, 2024

CoreCrashedError: The Tribler core has unexpectedly finished with exit code 9 and status: 1. #7904

Open

xoriole added 27 commits March 4, 2024 15:14

Add additional information about core restart logs to sentry report

daad8e5

Reset status flags on starting tribler core process

f4eefec

Restructure CoreRestartLog class

0be051c

Update error handler logic for core error

f99382a

Extract CoreRestartLogs

b6a6aae

Fix broken tests and add new tests for core restart functionality

7d22e39

Update type for core restart logs field

8bdacc5

Create crash file on critical errors

d2231f3

Remove unused method on events_endpoint.py

2319fda

Add time to shutdown on core restart logs

060797b

Update exception handler

8fb36ea

Rename some core manager instance variables

3c04efd

Fix core restart logs tests

039e048

Address Codacy reported issue

f4415c5

Fix exception handler error path related issues

5bcfc6e

Add typehints and comments on exception handler

15afa20

Add tests for exception handler and reported error

057ec33

Separate crash directory setting and creation

2858a79

Add more tests Fix issue with tmp directory clean up in tests

Add exception_handler as instance variable in EventsEndpoint

ec90973

Fix codacy issues

Fix events_endpoint related variable and codacy issues

06a4301

Make sure core_restart_logs are always on reported error

6a98125

Remove unused core_manager_started_at field from CoreManager

7832d92

Rename _restart_tribler to _restart_tribler_core

82b19d3

Save serialized copy while on exception at EventsEndpoint

a935456

Remove initial flag from send_shutdown_request()

e72eecc

Remove stop_application_on_close from FeedbackDialog

40e3ad9

Address reviewer comments

c27f2ac

xoriole force-pushed the feature/restart branch from 84045ef to c27f2ac Compare March 4, 2024 14:21

xoriole merged commit d420a66 into Tribler:main Mar 4, 2024
20 checks passed

drew2a mentioned this pull request Jun 4, 2024

FeedbackDialog triggers multiple times upon en exception in RESTComponent #8055

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Core Process Error Handling #7699

Improve Core Process Error Handling #7699

xoriole commented Nov 20, 2023 •

edited

Loading

drew2a commented Nov 30, 2023

drew2a commented Nov 30, 2023

drew2a Nov 30, 2023

xoriole Nov 30, 2023

xoriole Nov 30, 2023

drew2a Nov 30, 2023

xoriole Dec 1, 2023

drew2a Dec 1, 2023

xoriole Dec 1, 2023

drew2a Dec 1, 2023

drew2a Dec 3, 2023

drew2a Dec 4, 2023

xoriole commented Nov 30, 2023

xoriole commented Dec 11, 2023

drew2a commented Dec 11, 2023

kozlovsky left a comment •

edited

Loading

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

kozlovsky Jan 9, 2024

xoriole Feb 14, 2024

xoriole commented Feb 14, 2024

	self.tribler_gui_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+G"), self)
	connect(self.tribler_gui_test_exception_shortcut.activated, self.on_test_tribler_gui_exception)

	self.tribler_core_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+C"), self)
	connect(self.tribler_core_test_exception_shortcut.activated, self.on_test_tribler_core_exception)

	self.exception_handler.save_to_file(reported_error)
	self.exception_handler.save_to_file(reported_error.copy(should_stop=False))

Improve Core Process Error Handling #7699

Improve Core Process Error Handling #7699

Conversation

xoriole commented Nov 20, 2023 • edited Loading

drew2a commented Nov 30, 2023

drew2a commented Nov 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xoriole commented Nov 30, 2023

xoriole commented Dec 11, 2023

drew2a commented Dec 11, 2023

kozlovsky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xoriole commented Feb 14, 2024

xoriole commented Nov 20, 2023 •

edited

Loading

kozlovsky left a comment •

edited

Loading