-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Core Process Error Handling #7699
Conversation
7ba5afd
to
dd17989
Compare
dd17989
to
ae5d6f2
Compare
@xoriole could you please rebase it to the main branch? |
# If should_stop is True, the error is critical, so we first save it to the file to ensure | ||
# that it will be reported to the GUI via events endpoint either immediately or later | ||
# after core restart. This saved file will be deleted when the error is reported to the GUI. | ||
if should_stop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, if error reports are handled by the exception handler, then they will be sent to the GUI without any problem.
Could you please provide a way to reproduce the initial problem that this PR is aimed to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reproduce a part of the solution, CTRL
+ ALT
+ SHIFT
+ C
will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.
The other part where the core exception is logged and sent to sentry on the next restart is difficult to reproduce. To test this, I manually placed a saved exception on the crash_logs directory inside the state directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a sample exception file
1701359190810-TriblerCoreTestException.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reproduce a part of the solution, CTRL + ALT + SHIFT + C will trigger a core error which is sent to the GUI as an error report and in turn restarts the core. The core_restart_logs should be visible on sentry then.
Should I press it within the Tribler GUI? Is it Ctrl+Opt+Shift+C
on macOS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a keyboard shortcut to trigger core exception.
tribler/src/tribler/gui/tribler_window.py
Lines 237 to 241 in 8bb85b5
self.tribler_gui_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+G"), self) | |
connect(self.tribler_gui_test_exception_shortcut.activated, self.on_test_tribler_gui_exception) | |
self.tribler_core_test_exception_shortcut = QShortcut(QKeySequence("Ctrl+Alt+Shift+C"), self) | |
connect(self.tribler_core_test_exception_shortcut.activated, self.on_test_tribler_core_exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, this combination just calls core_exception_handler.unhandled_error_observer
, which is not a way to reproduce the real use case of the improvements proposed in this PR.
async def core_test_exception(self, _):
self._logger.info('Test Tribler Core exception was triggered')
if self.core_exception_handler:
try:
raise TriblerCoreTestException('Tribler Core Test Exception')
except TriblerCoreTestException as e:
context = dict(should_stop=True, message='Test message', exception=e)
self.core_exception_handler.unhandled_error_observer(None, context)
else:
self._logger.info('Exception handler is not set in DebugEndpoint')
return RESTResponse({"success": True})
So for me, it is still not clear which case the improvement covers. I understand the theoretical point from the original issue but do not understand their relation with the implementation.
In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic? Where should an error be raised in the core to prove the new logic adds value to error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, which type of errors could trigger the new logic and become visible for us but not visible with the old logic?
Here is the original issue this PR is developed to tackle:
And, here is the linked sentry issue for it:
In the case where the Core connection is lost, we do not know what caused the issue. From the uptime, we see that Tribler was running for 7205 seconds. Then suddenly the core connection was lost and we have no idea what caused it. The original error in the core that broke the GUI connection is unknown.
With this PR, such a core error should be logged as a file (assuming the exception handler on the core catches such an exception) and sent to the GUI when the core connection is re-established after the core restart. Then the error report sent by the user should give us more information about the original cause of the previous exception.
Where should an error be raised in the core to prove the new logic adds value to error handling?
Regarding the reproduction of such a case practically under a normal Tribler run, I'm not able to do so. Which is why the above suggestion was mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your intention and appreciate it. I'm just questioning the implementation, particularly whether the suggested change could reveal the real case of #7603.
I'm uncertain about the change because we actually cannot imagine the potential case for the error; therefore, we cannot verify the solution, even theoretically. It is such a significant change you propose (15 new commits, +700 LOC) and it changes the interaction between Core and GUI pretty significantly, which could lead to unforeseen circumstances. I'm not saying the PR is good or bad; I'm just highlighting the risks and questioning the worthiness of taking them.
From my POV, if the error is successfully handled by the core's error handler, then it should be sent to the GUI without any problem, so the fix you suggested might be insufficient.
One of the explanations for #7603 could be the possibility that someone (maybe even a user) just killed the process following the observation that the process consumes a lot of CPU and/or memory resources.
Another explanation could be the probability of some weird bug in aiohttp
that led to an HTTP server crash, which in turn led to the CoreConnectionError
. With this explanation, the suggested solution could work, but we need to verify it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with two more probable scenarios that could lead to a CoreConnectionError
but are not related to aiohttp
. I'll explain them tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added scenarios here: #7647 (comment)
67124a5
to
78eacd8
Compare
Thank you for pointing that out. It should be fixed now. |
Here is a summary of the issue based on what I know so far:
The issue related to Linux; Flatpak is at mostly at the start and was reproducible. This should be fixed with #7660. The QT Network error code: 1 also confirms that it is The second issue related to Windows is mysterious and there is no clear indication of its cause. The error code: 2 indicates that Remote Host has closed the connection. One likely cause of the connection being closed is that the core has terminated for some reason breaking the connection. This is reproducible with a simple example using aiohttp server and QNetworkAccessManager on the QT side. Another possible scenario as suggested by @drew2a regarding asyncio loop blockage (#7647 (comment)) could be a possible cause but I'm not able to reproduce the issue. Even though the core is frozen, the QT side connection is not closed, rather it can timeout which is a separate error code: 5 instead of error code 2. So, I believe the case for code being frozen or deadlocked will not cause the CoreConnectionError with error code 2. So, I'm open to suggestions from the reviewers, on how this PR should be further proceeded. |
@xoriole, nice summary! Regarding the PR, my suggestion is to treat it as general core process error handling rather than particular handling of the original If it works with the deadlock case and with the frozen asyncio case, then let's merge it. If it doesn't work, let's improve it and then merge. It looks like a nice improvement, but we have to have confirmation that it improves error handling (and we can prove it by executing scenarios described in #7647 (comment)). We can also extend this list, but I'm out of good ideas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, this PR adds valuable functionality.
As I see it, the PR implements two independent features:
-
The first feature is restarting Core in case of a serious error without restarting GUI. It is beneficial for user experience. Not all Core problems can be fixed by restart, but some can.
-
The second feature is storing serious Core errors on disk before reporting them to GUI and re-sending errors to GUI in case sending them successfully the first time was not possible.
Both parts of the current PR rely on the established connection between GUI and Core. For that reason, the solution implemented in this PR looks partial:
-
The PR logic only restarts Core if an error was successfully sent to GUI. If the Core error happens before the connection to GUI is established, Core will not be restarted.
-
If sending the error to the GUI immediately is impossible, the PR logic sends it again after the connection to the GUI is established after the restart. But if Core constantly crashes before the connection to GUI is established (the most interesting error case, actually!), GUI will never receive the errors stored on disk.
However, extending the logic in a subsequent PR is possible by changing how the serialized errors are handled. Instead of waiting until the restarted Core process reads errors from the disk and sends them to the GUI process via the newly established connection, the GUI process can directly read serialized errors from the disk even before the new Core process is started. It should not be hard to make this change in a subsequent PR to extend the range of the handled startup errors significantly.
Also, GUI can restart the crashed Core process even before the connection to GUI is established. GUI already does it in case of database corruption, and we can extend the range of errors handled this way. For that, we can extend the subset of the Core process's exit codes for which the restart of the Core process should be performed.
In conclusion, the implemented logic is useful and can be extended further in a subsequent PR to cover a broader range of possible errors. I have some minor comments about unused attributes and logical flags handling.
""" | ||
return dataclasses.replace(self, **changes) | ||
|
||
def serialized_copy(self) -> 'ReportedError': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the serialized_copy
method is only used in tests. If it is the intention, it is worth to mention it in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now used in EventsEndpoint.
src/tribler/gui/core_manager.py
Outdated
self.core_running = False | ||
self.core_connected = False | ||
self.shutting_down = False | ||
self.core_finished = False | ||
self.is_restarting = False | ||
|
||
self.core_manager_started_at = int(time.time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the core_manager_started_at
attribute is unnecessary and can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it is no longer used. I have removed it now.
src/tribler/gui/core_manager.py
Outdated
def shutdown_request_processed(self, response): | ||
self._logger.info(f"{SHUTDOWN_ENDPOINT} request was processed by Core. Response: {response}") | ||
|
||
def send_shutdown_request(self, initial=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial
flag and the shutdown request re-sending logic are obsolete and can be deleted. We no longer re-send the shutdown request; instead, we set request.cancellable = False
flag for the request to avoid accidental cancellation of it. It seems I forgot to remove it in #7205.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed initial flag
src/tribler/gui/core_manager.py
Outdated
|
||
self.old_core_process_pid = self.core_process.pid() | ||
self.should_quit_app_on_core_finished = False | ||
self.shutting_down = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks incorrect to flip the shutting_down
flag back and forth. If some code has already set it to True, it should be the final decision. The same is applied to self.events_manager.shutting_down = False
a few lines further.
Both self.shutting_down
and events_manager.shutting_down
flags are set to True
only in the CoreManager.stop()
method, so it looks logical that after this method is called, no attempts to restart the core should be performed.
I'd rather say it should be the other way around: if the shutting_down
flag is set to True
, the restart_core
method should do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this part of the code. I agree with shutting down to be the final call, however, I've opted not to do the refactoring part of this PR. That could be a separate PR dedicated to simplifying the states in the CoreManager.
src/tribler/gui/core_manager.py
Outdated
@@ -158,7 +189,7 @@ def check_core_api_port(self, *args): | |||
return | |||
|
|||
core_process = self.process_manager.current_process.get_core_process() | |||
if core_process is not None and core_process.api_port: | |||
if core_process is not None and core_process.api_port and core_process.pid != self.old_core_process_pid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new core_process.pid != self.old_core_process_pid
looks suspicious to me. Can we 100% guarantee the PID of the new process can't be the same as the old process's PID in all exotic environments like Flatpak?
It looks like this additional check is unnecessary. We call the check_core_api_port
event only after successfully connecting to the Core.
I think it is better to remove this new check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this check. This was mostly for Flatpak but now removed.
src/tribler/gui/error_handler.py
Outdated
@@ -139,3 +150,10 @@ def _stop_tribler(self, text): | |||
|
|||
if self.tribler_window.debug_window: | |||
self.tribler_window.debug_window.setHidden(True) | |||
|
|||
def _restart_tribler(self, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not restart full Tribler here; only Core is restarted. I think we should rename the method to _restart_core
or _restart_tribler_core
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _restart_tribler_core
src/tribler/gui/core_manager.py
Outdated
@@ -303,6 +373,9 @@ def on_core_finished(self, exit_code, exit_status): | |||
if self.shutting_down: | |||
if self.should_quit_app_on_core_finished: | |||
self.app_manager.quit_application() | |||
elif self.is_restarting: | |||
self.is_restarting = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_restarting
flag is set to True
at the beginning of the start_tribler_core
method, so setting it here looks unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code was to check if this finished was part of restart. To make it more explicit, I have defined another variable wait_for_finished_to_restart_core
.
self.send_exception(reported_error) | ||
else: | ||
if reported_error.should_stop: | ||
self.exception_handler.save_to_file(reported_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling of the should_stop
flag is spread across several files and is therefore not obvious. It may be better to handle it directly right here:
self.exception_handler.save_to_file(reported_error) | |
self.exception_handler.save_to_file(reported_error.copy(should_stop=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Used serialized_copy()
d37ca35
to
aeb230d
Compare
aeb230d
to
84045ef
Compare
@kozlovsky I have addressed your review comments and would appreciate your re-review. |
Add more tests Fix issue with tmp directory clean up in tests
Fix codacy issues
This PR addresses
Here is an example of Sentry report sent:
The following additional entry is added to the report: