-
Notifications
You must be signed in to change notification settings - Fork 174
Failure handling #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure handling #72
Conversation
|
Hm, still discussing this on the mailing list. I think the patches are an improvement over status quo, but you might want to wait a bit to see if the semantics change. |
|
Both patches seem sane to me, but please keep me in the loop on the mailing list discussion. FWIW, fwupd also uses the offline update mechanism too. |
|
On Mon, Jul 27, 2015 at 08:55:23AM -0700, Richard Hughes wrote:
There hasn't been much discussion on systemd-devel yet. @poettering
It most likely only works as a race condition then :) |
|
@keszybz Is this still something that needs to be dealt with? |
|
Yeah, those two patches seem still relevant. |
|
@keszybz Could you rebase the commits against master? And what happened with the mailing list discussion? |
|
@keszybz Do you have a link to the mailing list discussion that I could read? |
cae5c4e to
5690867
Compare
|
Rebased and updated with a commit that should fix https://bugzilla.redhat.com/show_bug.cgi?id=1430920. Please have a look and test.
There were a few discussions. Anyway, the "official" recommendations are all gathered in https://www.freedesktop.org/software/systemd/man/systemd.offline-updates.html. If anything is missing, issues/PRs against that page in the systemd bug tracker will be the most effective. |
|
Awesome, thanks for updating these, Zbigniew! I'll go and apply the 3rd patch downstream in Fedora to get the blocker fix in. I'm a bit unsure about the first patch, "pk-offline-update: only try to shutdown or reboot on success". If the user explicitly wanted to shut down how is it a good idea to then reboot on failure? I think I'd prefer to shut down in that case as well. Let's say a user has hacked the whole day on systemd at home and her eyes are bleeding and she wants to go to bed. She unplugs the computer and goes to turn it off. When doing that she notices that GNOME Shell offers updates as well. She chooses "Install pending software updates" and clicks Power Off and walks away. Next morning, she puts the computer in her bag, knowing it's fully updated and nearly fully charged as well, and heads off to give a presentation. When there, the computer doesn't turn on ... and she realizes that it must have rebooted instead of shutting down and it was actually powered on all nighty and the battery is fully drained now. I'd like to avoid that if possible :) |
True, if you requested a shutdown, you probably want a shutdown even if the upgrade fails. Existing code was subject to a race: a reboot or a shutdown was scheduled, but the unit has OnFailure=reboot.target, so it might be possible for the shutdown to be overridden by a reboot anyway. So it seems the right solution is to issue a big warning on failure, and schedule a shutdown or reboot, and exit with success, to avoid the OnFailure handling. If this approach is acceptable, I'll rework the patch. |
|
That sounds great to me! Thanks for working on this. |
The unit file contains OnFailure=reboot.target, so if we return non-zero for any reason, a reboot will be scheduled. This creates a race condition where our request to shutdown might be overridden by the request to reboot. To avoid the issue, always return "success" after we have scheduled the shutdown or reboot ourselves. However, if we fail to we fail to communicate with systemd, return error so that systemd's failure handling is triggered. Should not happen to often, but this makes things marginally more robust, since a reboot is better than a hang. Double request example: Jul 20 21:55:00 fedora22 pk-offline-update[657]: failed to update system: cannot download Packages/g/gstreamer1-1.4.5-1.fc2 ... Jul 20 21:55:16 fedora22 systemd[1]: Trying to enqueue job reboot.target/start/replace Jul 20 21:55:16 fedora22 systemd[1]: Enqueued job reboot.target/start as 658 Jul 20 21:55:16 fedora22 systemd[1]: packagekit-offline-update.service failed. ... Jul 20 21:55:11 fedora22 systemd[1]: packagekit-offline-update.service: main process exited, code=exited, status=1/FAILURE Jul 20 21:55:11 fedora22 systemd[1]: packagekit-offline-update.service changed running -> failed Jul 20 21:55:11 fedora22 systemd[1]: Unit packagekit-offline-update.service entered failed state. Jul 20 21:55:11 fedora22 systemd[1]: Triggering OnFailure= dependencies of packagekit-offline-update.service. Jul 20 21:55:16 fedora22 systemd[1]: Job system-update.target/stop finished, result=done
- Change description to "Update ...", because systemd uses the Description when starting and stopping units, and "Starting Update the OS…" better than "Starting Updates the OS…". - Add After=dbus.socket. a54c5ae added Requires=, but Requires= without After= is useless for socket units. - Add DefaultDependencies=no to drop the dependency on basic.target, and add Requires=sysinit.target, After=sysinit.target systemd-journald.socket, Before=shutdown.target system-update.target to re-add the dependencies that are necessary. Because of the dependency on dbus.socket, which starts dbus.service, we still have a dependency on basic.target. (As soon as a connection to the socket is made, dbus.service will be started, and since dbus.service has DefaultDependencies=yes, we'll wait for basic.target to be reached.) But we might be started without our symlink being in place and exit without doing anything. In that case basic.target might not be needed. This allows other upgrade mechanism which do not require dbus to proceed. - Change OnFailure to FailureAction — more modern and nicer syntax, and the target is started irreplaceably, which should be more robust. - Change to Type=oneshot. The default is Type=simple, which means that systemd considers the unit started immediately, which means that it races with the system-update-cleanup.service which is started after system-update.target is reached. https://bugzilla.redhat.com/show_bug.cgi?id=1430920
5690867 to
1afd636
Compare
|
Updated. I squashed the first two commits together, they make more sense as one step now. |
While testing offline updates I noticed that reboot was scheduled twice when the packagekit offline update failed (see patch one for logs). Two fixes to make sure that either failure is returned or a reboot/shutdown in scheduled.