Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cause system to hang on halt/reboot when net-booting #1776

Merged
merged 2 commits into from May 3, 2018

Conversation

chrisnovakovic
Copy link
Contributor

I ran into another problem while net-booting LibreELEC: with /storage mounted via NFS, halting or rebooting causes the system to hang. This is due to ConnMan exiting (and therefore bringing down the network) while processes still have open handles to files under /storage (including ConnMan itself: this can be verified with lsof | grep /storage, which shows that connmand has an open state file at /storage/.cache/connman/*/data).

This can be solved by lazy-unmounting /storage when LibreELEC is being net-booted (along with /flash, if that's also mounted remotely, since there's no reason not to) before ConnMan is stopped. This means that ConnMan doesn't have a chance to update its state file before it exits, but from a brief look at the ConnMan source code, it appears not to do that anyway.

This PR doesn't address the follow-up problem of /flash and /storage having disappeared after the system resumes from standby. I think one way to fix that would be to create two new systemd targets, one for /flash being mounted and one for /storage being mounted, and do the net-boot mounting operations currently performed in the initramfs init script in systemd unit files after switching to the new root filesystem - that would require substantial work from someone who understands both LibreELEC and systemd better than I do, I'm afraid, but at least this PR fixes halting and rebooting.

Tested and working fine on a RPi3 using the RPi2 build.

@MilhouseVH
Copy link
Contributor

Many thanks for this - yes this has been a known issue for a while. For instance, see this forum thread

The first two commits look OK, but I'll have to take your word on the lazy umount. I can't easily test this myself at the moment, but can include this in RPi/RPi2/Generic test builds (from Tuesday night, #711) if nothing else than to ensure there are no unwanted side effects.

Would be nice to sort out resume from standby too! :)

@MilhouseVH
Copy link
Contributor

MilhouseVH commented Jul 11, 2017

As it's kind of related, if you haven't already you might want to review the hack I added in #546 which prevents systemd from unmounting /flash and /storage, just in case it's relevant of interest.

@chrisnovakovic
Copy link
Contributor Author

Yeah, the lazy-unmount is ugly: it relies on the assumption that all other processes using /storage have stopped by the time ConnMan stops (which I think ought to be true, given my understanding of LibreELEC's systemd unit files). More testing of this would definitely be a good thing. Making systemd aware of the role of /flash and /storage and unmounting them cleanly after there are no more open handles would be the proper way to do it, and would also solve the resume-from-standby problem at the same time - I'll see if I can take a look at it, but no promises :)

@lrusak
Copy link
Member

lrusak commented Jul 15, 2017

Thanks for the contribution but I don't really like this. A proper solution should be found. It most likely needs some systemd reworking.

@chewitt
Copy link
Member

chewitt commented Sep 13, 2017

@chrisnovakovic have you done any further investigation? - systemd is likely the problem and the solution here. Thanks.

@chrisnovakovic chrisnovakovic force-pushed the nfs-hang-shutdown branch 2 times, most recently from b9f12fc to 77ca9ed Compare February 3, 2018 01:02
@chrisnovakovic
Copy link
Contributor Author

I've found some time to look into doing this in a way that doesn't rely on lazy-unmounting /storage in the ConnMan systemd unit file.

The good news is that shutting down and rebooting can be made to work by eliminating the file descriptors that ConnMan keeps open for its interface stats and history files, which end up getting written to /storage/.cache/connman/*/{data,history}: those files aren't particularly useful (LibreELEC doesn't even ship the tools necessary to parse them), so a small patch to ConnMan to disable the generation of those files is enough to break that cyclic dependency and allow the shutdown/reboot to complete.

The bad news is that suspending still doesn't work, even with ConnMan patched, because other processes that are stopped after ConnMan also have open file descriptors on /storage. This varies by build target - for instance, on PROJECT=Generic running in a VMware VM:

libreelec-pxe:~ # lsof | grep /storage
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system@0a41cdd7006c4af98168f77016281c99-0000000000000001-00056444267ac87c.journal
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system.journal
234     /usr/bin/vmtoolsd       /storage/log/vmware-vmsvc.log
295     /usr/bin/Xorg   /storage/log/Xorg.0.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/temp/kodi.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Addons27.db
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Textures13.db

Kodi's not a problem here (it's stopped before ConnMan when the host is suspended) and I don't think X is either, but the others might be. The fact that systemd itself writes its journal to /storage makes fixing this a massive (and perhaps intractable) undertaking, and I'd recommend not letting the perfect be the enemy of the good: as it stands, this PR at least fixes PXE-based shutdowns and reboots, and might help someone who understands systemd better than I do to come along later and figure out how to fix the rest.

@chewitt
Copy link
Member

chewitt commented Feb 10, 2018

@chrisnovakovic could you submit a description of the problem and your patch to the connman mailing list? .. several folks there are somewhat familiar with LE and should be receptive to the addition; worst case they might refuse your patch but come up with their own alternative. Our preference is not to add patches unless we have visibility on being able to drop them in a future package update, i.e. connman version bump. Nice investigation work btw :)

@chrisnovakovic
Copy link
Contributor Author

@chewitt Actually I originally found that patch on the ConnMan mailing list, but there didn't seem to be any appetite for merging it. I'll post it again and emphasise the benefits of being able to choose not to repeatedly write a file that's not particularly useful to flash-backed storage :)

@chrisnovakovic
Copy link
Contributor Author

@rudy1981 Can you confirm that this patch worked for you? It worked for me in a VMware VM and on an RPi3, but wider testing would be a good thing, especially from those who have more services running in LibreELEC (and therefore a higher probability of having processes running that have open file descriptors on /storage) than I do.

@chrisnovakovic
Copy link
Contributor Author

@chrisnovakovic
Copy link
Contributor Author

Patch merged upstream. Slightly different to the one here, but the semantics are the same: --disable-stats at compile time to disable stats/history files. connman-01-configure-stats-option.patch can be deleted when ConnMan 1.36 is released.

@chewitt
Copy link
Member

chewitt commented Mar 10, 2018

@chrisnovakovic thanks for the update, i'll go nag wagi for a connman release :)

@chrisnovakovic
Copy link
Contributor Author

While we're waiting for ConnMan 1.36, I've updated the second commit (my cherry-picked upstream patch doesn't apply cleanly against ConnMan 1.35). Tested on RPi3.

It should be safe to merge this into master now, if you want to: whenever you update to 1.36, just delete packages/network/connman/patches/connman-01-build-add-disable-stats-option.patch. Minor changes are needed if you want to backport it to 8.2, but that also works fine (again, tested on RPi3). I can open a separate PR for that if necessary.

@chewitt
Copy link
Member

chewitt commented Apr 20, 2018

@chrisnovakovic it's not necessary to comment the package.mk - can you drop that and we're happy to merge with the patch - we'll spot the connection when testing the next connman release. NB: I've tried to encourage connman devs to roll something out recently as there are some NTP fixes we'd like to see in addition to this, but nothing has been forthcoming.

@chewitt
Copy link
Member

chewitt commented Apr 26, 2018

@chrisnovakovic ping! ^^

@chrisnovakovic
Copy link
Contributor Author

Sorry, Christian - I'll get this done today if I can.

The init script currently touches a file at /dev/.storage_netboot if
/storage is a remote filesystem, so that scripts that run after the root
filesystem has been switched can behave differently depending on whether
/storage is mounted locally or remotely. Add similar functionality for
/flash by touching /dev/.flash_netboot if it is a remote filesystem.
ConnMan writes stats and history files for each configured interface to
/storage/.cache/connman/*/{data,history}. These files remain open while
ConnMan is running, and prevent the system from halting or rebooting
when /storage is an NFS mount (because ConnMan brings down the interface
through which the NFS mount is accessed and then tries to update the
stats and/or history file for that interface, but the file descriptors
are no longer valid, so the system hangs).

The stats and history files are superfluous, especially since the means
of viewing them isn't included in LibreELEC (the stats tool is missing
because ConnMan is compiled with --disable-tools), so there's no harm in
not generating them on systems that don't mount /storage over NFS
either; in fact, it benefits LibreELEC installations where /storage is
mounted on a flash device by reducing unnecessary flash writes.
@chewitt
Copy link
Member

chewitt commented May 3, 2018

@chrisnovakovic Looks good. All done?

@chrisnovakovic
Copy link
Contributor Author

chrisnovakovic commented May 3, 2018

Yep, this is good to go on master. I can do a separate PR against libreelec-8.2 later (I've actually been using this on 8.2 rather than master since February, and everything's been fine, so it should be a safe backport).

@chrisnovakovic
Copy link
Contributor Author

Actually, it seems that with the latest modifications, this applies cleanly against the libreelec-8.2 branch, so there's no need for a separate PR after all.

@chewitt
Copy link
Member

chewitt commented May 3, 2018

There are no current plans for more 8.2 releases, we're overdue on 9.0 already. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants