Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

busybox fsck breaks NixOS's fsck handling logic #40174

Closed
nh2 opened this issue May 8, 2018 · 9 comments
Closed

busybox fsck breaks NixOS's fsck handling logic #40174

nh2 opened this issue May 8, 2018 · 9 comments

Comments

@nh2
Copy link
Contributor

nh2 commented May 8, 2018

This code in stage-1-init.sh runs fsck $fsckFlags "$device":

fsck $fsckFlags "$device"
fsckResult=$?
if test $(($fsckResult | 2)) = $fsckResult; then
echo "fsck finished, rebooting..."
sleep 3
reboot -f
fi
if test $(($fsckResult | 4)) = $fsckResult; then
echo "$device has unrepaired errors, please fix them manually."
fail
fi
if test $fsckResult -ge 8; then
echo "fsck on $device failed."
fail
fi

This fsck seems to be the generic wrapper that calls the corresponding fsck for the corresponding filesystem type, like fsck.ext4.

Note in the above code how NixOS has (correct) code to handle the exit code, e.g. when the bits of decimal 2 or 4 are set in the exit code (see man fsck).

But the generic fsck wrapper seems to not correctly pass through the exit code.

On my server I get this boot failure:

Checking /dev/disk/by-label/root...
fsck (busybox 1.27.2)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/disk/by-label/root
root: Superblock hint for external superblock should be 0x97f.  FIXED.
root: 73146/122093568 files (0.5% non-contiguous), 8267968/488345344 blocks
Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1
fsck on /dev/disk/by-label/root failed.

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on '/mnt-root' and then start stage 2. Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

Here fsck.ext4 found a fixable problem and fixed it, then exiting with exit code 1 as expected.
But the generic fsck wrapper tuns that into exit code 8.

As a result, in stage-1-init.sh the wrong if-branch is taken,
and my server is stuck waiting for a key to be pressed instead of booting through.

@nh2
Copy link
Contributor Author

nh2 commented May 8, 2018

A workaround is to replace the line

fsck $fsckFlags "$device" 

by

fsck.ext4 $fsckFlags "$device" 

to directly invoke the fsck for your file system type (in my case ext4) without the generic fsck wrapper.


But we should really find out what the fsck wrapper is doing wrong. What code is running here, is this wrapper from busybox as suggested by the error message?

@dtzWill
Copy link
Member

dtzWill commented May 8, 2018

FWIW I had issues with busybox utils in stage1 that were "resolved" by using utillinux instead for things like mount, etc. This was w/musl (and shouldn't happen, busybox+musl should work great!) but it would be interesting to see if the same workaround worked here as well.

Is there a NixOS test we can put together to explore this?
Also it says "terminated by signal 1" which seems curious-- it often means exit code lower bits will be 1 but doesn't seem normal behavior. Who's sending SIGHUP to fsck?

Also, FWIW:
https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?id=72089cf6b4a77214ec4fd21d5ee5bf56958781cb#n70

Which seems to suggest '8' means fsck exit error.

@nh2
Copy link
Contributor Author

nh2 commented May 8, 2018

Looks like it's a bug in busybox.

I have found the source code in busybox that emits the message

Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1

It is https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?h=1_28_3#n456

 child_died:

	status = WEXITSTATUS(status);
	if (WIFSIGNALED(status)) {
		sig = WTERMSIG(status);
		status = EXIT_UNCORRECTED;
		if (sig != SIGINT) {
			printf("Warning: %s %s terminated "
				"by signal %d\n",
				inst->prog, inst->device, sig);
			status = EXIT_ERROR;
		}
	}

Luckily it is easy to distinguish this from the upstream, non-busybox e2progs fsck wrapper, because that one has the following source code instead: https://github.com/tytso/e2fsprogs/blob/d5bd126cd6a06dd2a8652557776ea298fb442912/misc/fsck.c#L619-L620

The message is different (with signal vs by signal).

Suspiciously, this code is handles the case where fsck was terminated by a signal (WIFSIGNALED), but in my case, fsck.ext4 simply exits with code 1, without any signal being involved. So I suspect that somehow the wrong code path is taken.


I have not confirmed whether the upstream fsck wrapper, too, suffers from the bug of not forwarding the exit code of fsck.ext4 correctly. And I cannot easily try it, because I can't run it from the busybox shell.

But I am reasonably sure that this issue only exists in busybox's fsck wrapper, and that it was introduced in this commit:

https://git.busybox.net/busybox/commit/?id=c4fb8c6ad52e8007c6fa07e40f043bb2e0c043d1

@@ -439,9 +446,8 @@ static int wait_one(int flags)
 	}
  child_died:
 
-	if (WIFEXITED(status))
-		status = WEXITSTATUS(status);
-	else if (WIFSIGNALED(status)) {
+	status = WEXITSTATUS(status);
+	if (WIFSIGNALED(status)) {
 		sig = WTERMSIG(status);
 		status = EXIT_UNCORRECTED;
 		if (sig != SIGINT) {

This change busybox changed the semantics of the original fsck code.

You can only reasonably call WEXITSTATUS() on status if you checked WIFEXITED() before, because as documented in waitstatus.h:

/* If WIFEXITED(STATUS), the low-order 8 bits of the status.  */
#define	__WEXITSTATUS(status)	(((status) & 0xff00) >> 8)

Doing status = WEXITSTATUS(status); unconditionally will mask away the parts of status that indicate whether a signal was raised, so that afterwards WIFSIGNALED() is true even if no signal was raised.

So that's why busybox's fsck doesn't forward fsck.ext4's status code correctly.

nh2 added a commit to nh2/busybox that referenced this issue May 8, 2018
In commit

  c4fb8c6 - fsck: do not use statics

not only statics were changed but also a couple of
statics-unrelated changes were made.

This included the handling of the child termination status
as follows:

    - if (WIFEXITED(status))
    -   status = WEXITSTATUS(status);
    - else if (WIFSIGNALED(status)) {
    + status = WEXITSTATUS(status);
    + if (WIFSIGNALED(status)) {

Change to unconditionally call WEXITSTATUS() was not semantics-preserving,
since you can only reasonably call WEXITSTATUS() on status if
you checked WIFEXITED() before; see `man 2 waitpid`:

    WEXITSTATUS(status)
      [...]
      This macro should be employed only if WIFEXITED
      returned true.

`status = WEXITSTATUS(status)` unconditionally masks away the
parts of status that indicate whether a signal was raised,
so that afterwards WIFSIGNALED() is true even if no signal
was raised.

As a result, busybox's `fsck` wrapper set `status = EXIT_ERROR = 8`,
thus not forwarding the exit code of the filesystem-specific
fsck utility (such as fsck.ext4) to the caller and returning
exit code 8 instead.

The exit codes of fsck have well-specified meanings (see `man fsck`)
that operating systems use in order to decide whether they should
prompt the user due to unrecoverable errors, or continue booting
after errors were successfully fixed automatically.

Consequently, this regression in busybox's fsck stopped my server
from booting though, and manual intervention via a keyboard was
needed.

Remark: Tracking down this issue would have been significantly
less effort if unrelated code changes were not snuck into a
commit labelled "fsck: do not use statics".

This issue was found as part of the NixOS project
(NixOS/nixpkgs#40174 (comment))
and this fix has been tested on it.

Signed-off-by: Niklas Hambüchen <mail@nh2.me>
@nh2
Copy link
Contributor Author

nh2 commented May 8, 2018

I have written a patch for busybox that fixes the issue (nh2/busybox@b9dbf69), tested that it fixes the reboot hang for my NixOS server, and submitted it to the upstream mailing list.

On NixOS I'm using

{
  busybox = super.busybox.overrideAttrs (oldAttrs: rec {
    patches = oldAttrs.patches ++ [
      ./fsck-Fix-incorrect-handling-of-child-exit-code.patch
    ];
  });
}

@nh2
Copy link
Contributor Author

nh2 commented May 8, 2018

Upstream patch is at http://lists.busybox.net/pipermail/busybox/2018-May/086417.html

dtzWill added a commit to dtzWill/nixpkgs that referenced this issue May 22, 2018
dtzWill added a commit to dtzWill/nixpkgs that referenced this issue May 24, 2018
Fixes NixOS#40174

Patch taken from upstream, added local copy
because fetchpatch doesn't work with fetchurlBoot.
@stale
Copy link

stale bot commented Jun 4, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020
@nh2
Copy link
Contributor Author

nh2 commented Jun 4, 2020

still important to me

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020
@TethysSvensson
Copy link
Contributor

@nh2 Is this still an issue, or has it been fixed upstream?

@nh2
Copy link
Contributor Author

nh2 commented Aug 7, 2020

Ah yes, you are right. In the linked #41024 (comment) I wrote

So I'm OK with having this in non-staging only with the next NixOS release.

The upstream patch is:

https://git.busybox.net/busybox/commit/?id=ccb8e4bc4fb74701d0d323d61e9359d8597a4272

which according to

mirror/busybox@ccb8e4b

is in busybox releases > 1.29.0, and it's fixed in at least NixOS 20.03.

Closing.

@nh2 nh2 closed this as completed Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants