New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cause system to hang on halt/reboot when net-booting #1776

Merged
merged 2 commits into from May 3, 2018

Conversation

Projects
None yet
6 participants
@chrisnovakovic
Contributor

chrisnovakovic commented Jul 11, 2017

I ran into another problem while net-booting LibreELEC: with /storage mounted via NFS, halting or rebooting causes the system to hang. This is due to ConnMan exiting (and therefore bringing down the network) while processes still have open handles to files under /storage (including ConnMan itself: this can be verified with lsof | grep /storage, which shows that connmand has an open state file at /storage/.cache/connman/*/data).

This can be solved by lazy-unmounting /storage when LibreELEC is being net-booted (along with /flash, if that's also mounted remotely, since there's no reason not to) before ConnMan is stopped. This means that ConnMan doesn't have a chance to update its state file before it exits, but from a brief look at the ConnMan source code, it appears not to do that anyway.

This PR doesn't address the follow-up problem of /flash and /storage having disappeared after the system resumes from standby. I think one way to fix that would be to create two new systemd targets, one for /flash being mounted and one for /storage being mounted, and do the net-boot mounting operations currently performed in the initramfs init script in systemd unit files after switching to the new root filesystem - that would require substantial work from someone who understands both LibreELEC and systemd better than I do, I'm afraid, but at least this PR fixes halting and rebooting.

Tested and working fine on a RPi3 using the RPi2 build.

@MilhouseVH

This comment has been minimized.

Show comment
Hide comment
@MilhouseVH

MilhouseVH Jul 11, 2017

Contributor

Many thanks for this - yes this has been a known issue for a while. For instance, see this forum thread

The first two commits look OK, but I'll have to take your word on the lazy umount. I can't easily test this myself at the moment, but can include this in RPi/RPi2/Generic test builds (from Tuesday night, #711) if nothing else than to ensure there are no unwanted side effects.

Would be nice to sort out resume from standby too! :)

Contributor

MilhouseVH commented Jul 11, 2017

Many thanks for this - yes this has been a known issue for a while. For instance, see this forum thread

The first two commits look OK, but I'll have to take your word on the lazy umount. I can't easily test this myself at the moment, but can include this in RPi/RPi2/Generic test builds (from Tuesday night, #711) if nothing else than to ensure there are no unwanted side effects.

Would be nice to sort out resume from standby too! :)

@MilhouseVH

This comment has been minimized.

Show comment
Hide comment
@MilhouseVH

MilhouseVH Jul 11, 2017

Contributor

As it's kind of related, if you haven't already you might want to review the hack I added in #546 which prevents systemd from unmounting /flash and /storage, just in case it's relevant of interest.

Contributor

MilhouseVH commented Jul 11, 2017

As it's kind of related, if you haven't already you might want to review the hack I added in #546 which prevents systemd from unmounting /flash and /storage, just in case it's relevant of interest.

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Jul 11, 2017

Contributor

Yeah, the lazy-unmount is ugly: it relies on the assumption that all other processes using /storage have stopped by the time ConnMan stops (which I think ought to be true, given my understanding of LibreELEC's systemd unit files). More testing of this would definitely be a good thing. Making systemd aware of the role of /flash and /storage and unmounting them cleanly after there are no more open handles would be the proper way to do it, and would also solve the resume-from-standby problem at the same time - I'll see if I can take a look at it, but no promises :)

Contributor

chrisnovakovic commented Jul 11, 2017

Yeah, the lazy-unmount is ugly: it relies on the assumption that all other processes using /storage have stopped by the time ConnMan stops (which I think ought to be true, given my understanding of LibreELEC's systemd unit files). More testing of this would definitely be a good thing. Making systemd aware of the role of /flash and /storage and unmounting them cleanly after there are no more open handles would be the proper way to do it, and would also solve the resume-from-standby problem at the same time - I'll see if I can take a look at it, but no promises :)

@lrusak

This comment has been minimized.

Show comment
Hide comment
@lrusak

lrusak Jul 15, 2017

Member

Thanks for the contribution but I don't really like this. A proper solution should be found. It most likely needs some systemd reworking.

Member

lrusak commented Jul 15, 2017

Thanks for the contribution but I don't really like this. A proper solution should be found. It most likely needs some systemd reworking.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt Sep 13, 2017

Member

@chrisnovakovic have you done any further investigation? - systemd is likely the problem and the solution here. Thanks.

Member

chewitt commented Sep 13, 2017

@chrisnovakovic have you done any further investigation? - systemd is likely the problem and the solution here. Thanks.

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Feb 3, 2018

Contributor

I've found some time to look into doing this in a way that doesn't rely on lazy-unmounting /storage in the ConnMan systemd unit file.

The good news is that shutting down and rebooting can be made to work by eliminating the file descriptors that ConnMan keeps open for its interface stats and history files, which end up getting written to /storage/.cache/connman/*/{data,history}: those files aren't particularly useful (LibreELEC doesn't even ship the tools necessary to parse them), so a small patch to ConnMan to disable the generation of those files is enough to break that cyclic dependency and allow the shutdown/reboot to complete.

The bad news is that suspending still doesn't work, even with ConnMan patched, because other processes that are stopped after ConnMan also have open file descriptors on /storage. This varies by build target - for instance, on PROJECT=Generic running in a VMware VM:

libreelec-pxe:~ # lsof | grep /storage
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system@0a41cdd7006c4af98168f77016281c99-0000000000000001-00056444267ac87c.journal
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system.journal
234     /usr/bin/vmtoolsd       /storage/log/vmware-vmsvc.log
295     /usr/bin/Xorg   /storage/log/Xorg.0.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/temp/kodi.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Addons27.db
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Textures13.db

Kodi's not a problem here (it's stopped before ConnMan when the host is suspended) and I don't think X is either, but the others might be. The fact that systemd itself writes its journal to /storage makes fixing this a massive (and perhaps intractable) undertaking, and I'd recommend not letting the perfect be the enemy of the good: as it stands, this PR at least fixes PXE-based shutdowns and reboots, and might help someone who understands systemd better than I do to come along later and figure out how to fix the rest.

Contributor

chrisnovakovic commented Feb 3, 2018

I've found some time to look into doing this in a way that doesn't rely on lazy-unmounting /storage in the ConnMan systemd unit file.

The good news is that shutting down and rebooting can be made to work by eliminating the file descriptors that ConnMan keeps open for its interface stats and history files, which end up getting written to /storage/.cache/connman/*/{data,history}: those files aren't particularly useful (LibreELEC doesn't even ship the tools necessary to parse them), so a small patch to ConnMan to disable the generation of those files is enough to break that cyclic dependency and allow the shutdown/reboot to complete.

The bad news is that suspending still doesn't work, even with ConnMan patched, because other processes that are stopped after ConnMan also have open file descriptors on /storage. This varies by build target - for instance, on PROJECT=Generic running in a VMware VM:

libreelec-pxe:~ # lsof | grep /storage
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system@0a41cdd7006c4af98168f77016281c99-0000000000000001-00056444267ac87c.journal
185     /usr/lib/systemd/systemd-journald       /storage/log/journal/12bc3734adcf2e4f272964cfd17f72c1/system.journal
234     /usr/bin/vmtoolsd       /storage/log/vmware-vmsvc.log
295     /usr/bin/Xorg   /storage/log/Xorg.0.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/temp/kodi.log
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Addons27.db
322     /usr/lib/kodi/kodi.bin  /storage/.kodi/userdata/Database/Textures13.db

Kodi's not a problem here (it's stopped before ConnMan when the host is suspended) and I don't think X is either, but the others might be. The fact that systemd itself writes its journal to /storage makes fixing this a massive (and perhaps intractable) undertaking, and I'd recommend not letting the perfect be the enemy of the good: as it stands, this PR at least fixes PXE-based shutdowns and reboots, and might help someone who understands systemd better than I do to come along later and figure out how to fix the rest.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt Feb 10, 2018

Member

@chrisnovakovic could you submit a description of the problem and your patch to the connman mailing list? .. several folks there are somewhat familiar with LE and should be receptive to the addition; worst case they might refuse your patch but come up with their own alternative. Our preference is not to add patches unless we have visibility on being able to drop them in a future package update, i.e. connman version bump. Nice investigation work btw :)

Member

chewitt commented Feb 10, 2018

@chrisnovakovic could you submit a description of the problem and your patch to the connman mailing list? .. several folks there are somewhat familiar with LE and should be receptive to the addition; worst case they might refuse your patch but come up with their own alternative. Our preference is not to add patches unless we have visibility on being able to drop them in a future package update, i.e. connman version bump. Nice investigation work btw :)

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Feb 13, 2018

Contributor

@chewitt Actually I originally found that patch on the ConnMan mailing list, but there didn't seem to be any appetite for merging it. I'll post it again and emphasise the benefits of being able to choose not to repeatedly write a file that's not particularly useful to flash-backed storage :)

Contributor

chrisnovakovic commented Feb 13, 2018

@chewitt Actually I originally found that patch on the ConnMan mailing list, but there didn't seem to be any appetite for merging it. I'll post it again and emphasise the benefits of being able to choose not to repeatedly write a file that's not particularly useful to flash-backed storage :)

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Feb 13, 2018

Contributor

@rudy1981 Can you confirm that this patch worked for you? It worked for me in a VMware VM and on an RPi3, but wider testing would be a good thing, especially from those who have more services running in LibreELEC (and therefore a higher probability of having processes running that have open file descriptors on /storage) than I do.

Contributor

chrisnovakovic commented Feb 13, 2018

@rudy1981 Can you confirm that this patch worked for you? It worked for me in a VMware VM and on an RPi3, but wider testing would be a good thing, especially from those who have more services running in LibreELEC (and therefore a higher probability of having processes running that have open file descriptors on /storage) than I do.

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic
Contributor

chrisnovakovic commented Feb 16, 2018

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Mar 8, 2018

Contributor

Patch merged upstream. Slightly different to the one here, but the semantics are the same: --disable-stats at compile time to disable stats/history files. connman-01-configure-stats-option.patch can be deleted when ConnMan 1.36 is released.

Contributor

chrisnovakovic commented Mar 8, 2018

Patch merged upstream. Slightly different to the one here, but the semantics are the same: --disable-stats at compile time to disable stats/history files. connman-01-configure-stats-option.patch can be deleted when ConnMan 1.36 is released.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt Mar 10, 2018

Member

@chrisnovakovic thanks for the update, i'll go nag wagi for a connman release :)

Member

chewitt commented Mar 10, 2018

@chrisnovakovic thanks for the update, i'll go nag wagi for a connman release :)

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic Apr 19, 2018

Contributor

While we're waiting for ConnMan 1.36, I've updated the second commit (my cherry-picked upstream patch doesn't apply cleanly against ConnMan 1.35). Tested on RPi3.

It should be safe to merge this into master now, if you want to: whenever you update to 1.36, just delete packages/network/connman/patches/connman-01-build-add-disable-stats-option.patch. Minor changes are needed if you want to backport it to 8.2, but that also works fine (again, tested on RPi3). I can open a separate PR for that if necessary.

Contributor

chrisnovakovic commented Apr 19, 2018

While we're waiting for ConnMan 1.36, I've updated the second commit (my cherry-picked upstream patch doesn't apply cleanly against ConnMan 1.35). Tested on RPi3.

It should be safe to merge this into master now, if you want to: whenever you update to 1.36, just delete packages/network/connman/patches/connman-01-build-add-disable-stats-option.patch. Minor changes are needed if you want to backport it to 8.2, but that also works fine (again, tested on RPi3). I can open a separate PR for that if necessary.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt Apr 20, 2018

Member

@chrisnovakovic it's not necessary to comment the package.mk - can you drop that and we're happy to merge with the patch - we'll spot the connection when testing the next connman release. NB: I've tried to encourage connman devs to roll something out recently as there are some NTP fixes we'd like to see in addition to this, but nothing has been forthcoming.

Member

chewitt commented Apr 20, 2018

@chrisnovakovic it's not necessary to comment the package.mk - can you drop that and we're happy to merge with the patch - we'll spot the connection when testing the next connman release. NB: I've tried to encourage connman devs to roll something out recently as there are some NTP fixes we'd like to see in addition to this, but nothing has been forthcoming.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt
Member

chewitt commented Apr 26, 2018

@chrisnovakovic ping! ^^

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic May 3, 2018

Contributor

Sorry, Christian - I'll get this done today if I can.

Contributor

chrisnovakovic commented May 3, 2018

Sorry, Christian - I'll get this done today if I can.

chrisnovakovic added some commits Jul 10, 2017

initramfs: write /dev/.flash_netboot if /flash is a remote filesystem
The init script currently touches a file at /dev/.storage_netboot if
/storage is a remote filesystem, so that scripts that run after the root
filesystem has been switched can behave differently depending on whether
/storage is mounted locally or remotely. Add similar functionality for
/flash by touching /dev/.flash_netboot if it is a remote filesystem.
ConnMan: disable interface stats and history files
ConnMan writes stats and history files for each configured interface to
/storage/.cache/connman/*/{data,history}. These files remain open while
ConnMan is running, and prevent the system from halting or rebooting
when /storage is an NFS mount (because ConnMan brings down the interface
through which the NFS mount is accessed and then tries to update the
stats and/or history file for that interface, but the file descriptors
are no longer valid, so the system hangs).

The stats and history files are superfluous, especially since the means
of viewing them isn't included in LibreELEC (the stats tool is missing
because ConnMan is compiled with --disable-tools), so there's no harm in
not generating them on systems that don't mount /storage over NFS
either; in fact, it benefits LibreELEC installations where /storage is
mounted on a flash device by reducing unnecessary flash writes.
@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt May 3, 2018

Member

@chrisnovakovic Looks good. All done?

Member

chewitt commented May 3, 2018

@chrisnovakovic Looks good. All done?

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic May 3, 2018

Contributor

Yep, this is good to go on master. I can do a separate PR against libreelec-8.2 later (I've actually been using this on 8.2 rather than master since February, and everything's been fine, so it should be a safe backport).

Contributor

chrisnovakovic commented May 3, 2018

Yep, this is good to go on master. I can do a separate PR against libreelec-8.2 later (I've actually been using this on 8.2 rather than master since February, and everything's been fine, so it should be a safe backport).

@chrisnovakovic

This comment has been minimized.

Show comment
Hide comment
@chrisnovakovic

chrisnovakovic May 3, 2018

Contributor

Actually, it seems that with the latest modifications, this applies cleanly against the libreelec-8.2 branch, so there's no need for a separate PR after all.

Contributor

chrisnovakovic commented May 3, 2018

Actually, it seems that with the latest modifications, this applies cleanly against the libreelec-8.2 branch, so there's no need for a separate PR after all.

@chewitt

This comment has been minimized.

Show comment
Hide comment
@chewitt

chewitt May 3, 2018

Member

There are no current plans for more 8.2 releases, we're overdue on 9.0 already. Thanks!

Member

chewitt commented May 3, 2018

There are no current plans for more 8.2 releases, we're overdue on 9.0 already. Thanks!

@chewitt

chewitt approved these changes May 3, 2018

@chewitt chewitt merged commit a4ef7a8 into LibreELEC:master May 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment