busybox fsck breaks NixOS's fsck handling logic #40174

nh2 · 2018-05-08T12:23:09Z

This code in stage-1-init.sh runs fsck $fsckFlags "$device":

nixpkgs/nixos/modules/system/boot/stage-1-init.sh

Lines 260 to 277 in c2b668e

    
           fsck $fsckFlags "$device" 
        
           fsckResult=$? 
        
           if test $(($fsckResult | 2)) = $fsckResult; then 
        
               echo "fsck finished, rebooting..." 
        
               sleep 3 
        
               reboot -f 
        
           fi 
        
           if test $(($fsckResult | 4)) = $fsckResult; then 
        
               echo "$device has unrepaired errors, please fix them manually." 
        
               fail 
        
           fi 
        
           if test $fsckResult -ge 8; then 
        
               echo "fsck on $device failed." 
        
               fail 
        
           fi

This fsck seems to be the generic wrapper that calls the corresponding fsck for the corresponding filesystem type, like fsck.ext4.

Note in the above code how NixOS has (correct) code to handle the exit code, e.g. when the bits of decimal 2 or 4 are set in the exit code (see man fsck).

But the generic fsck wrapper seems to not correctly pass through the exit code.

On my server I get this boot failure:

Checking /dev/disk/by-label/root...
fsck (busybox 1.27.2)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/disk/by-label/root
root: Superblock hint for external superblock should be 0x97f.  FIXED.
root: 73146/122093568 files (0.5% non-contiguous), 8267968/488345344 blocks
Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1
fsck on /dev/disk/by-label/root failed.

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on '/mnt-root' and then start stage 2. Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

Here fsck.ext4 found a fixable problem and fixed it, then exiting with exit code 1 as expected.
But the generic fsck wrapper tuns that into exit code 8.

As a result, in stage-1-init.sh the wrong if-branch is taken,
and my server is stuck waiting for a key to be pressed instead of booting through.

The text was updated successfully, but these errors were encountered:

nh2 · 2018-05-08T12:25:28Z

A workaround is to replace the line

fsck $fsckFlags "$device"

by

fsck.ext4 $fsckFlags "$device"

to directly invoke the fsck for your file system type (in my case ext4) without the generic fsck wrapper.

But we should really find out what the fsck wrapper is doing wrong. What code is running here, is this wrapper from busybox as suggested by the error message?

dtzWill · 2018-05-08T12:56:35Z

FWIW I had issues with busybox utils in stage1 that were "resolved" by using utillinux instead for things like mount, etc. This was w/musl (and shouldn't happen, busybox+musl should work great!) but it would be interesting to see if the same workaround worked here as well.

Is there a NixOS test we can put together to explore this?
Also it says "terminated by signal 1" which seems curious-- it often means exit code lower bits will be 1 but doesn't seem normal behavior. Who's sending SIGHUP to fsck?

Also, FWIW:
https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?id=72089cf6b4a77214ec4fd21d5ee5bf56958781cb#n70

Which seems to suggest '8' means fsck exit error.

nh2 · 2018-05-08T13:40:01Z

Looks like it's a bug in busybox.

I have found the source code in busybox that emits the message

Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1

It is https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?h=1_28_3#n456

 child_died:

	status = WEXITSTATUS(status);
	if (WIFSIGNALED(status)) {
		sig = WTERMSIG(status);
		status = EXIT_UNCORRECTED;
		if (sig != SIGINT) {
			printf("Warning: %s %s terminated "
				"by signal %d\n",
				inst->prog, inst->device, sig);
			status = EXIT_ERROR;
		}
	}

Luckily it is easy to distinguish this from the upstream, non-busybox e2progs fsck wrapper, because that one has the following source code instead: https://github.com/tytso/e2fsprogs/blob/d5bd126cd6a06dd2a8652557776ea298fb442912/misc/fsck.c#L619-L620

The message is different (with signal vs by signal).

Suspiciously, this code is handles the case where fsck was terminated by a signal (WIFSIGNALED), but in my case, fsck.ext4 simply exits with code 1, without any signal being involved. So I suspect that somehow the wrong code path is taken.

I have not confirmed whether the upstream fsck wrapper, too, suffers from the bug of not forwarding the exit code of fsck.ext4 correctly. And I cannot easily try it, because I can't run it from the busybox shell.

But I am reasonably sure that this issue only exists in busybox's fsck wrapper, and that it was introduced in this commit:

https://git.busybox.net/busybox/commit/?id=c4fb8c6ad52e8007c6fa07e40f043bb2e0c043d1

@@ -439,9 +446,8 @@ static int wait_one(int flags)
 	}
  child_died:
 
-	if (WIFEXITED(status))
-		status = WEXITSTATUS(status);
-	else if (WIFSIGNALED(status)) {
+	status = WEXITSTATUS(status);
+	if (WIFSIGNALED(status)) {
 		sig = WTERMSIG(status);
 		status = EXIT_UNCORRECTED;
 		if (sig != SIGINT) {

This change busybox changed the semantics of the original fsck code.

You can only reasonably call WEXITSTATUS() on status if you checked WIFEXITED() before, because as documented in waitstatus.h:

/* If WIFEXITED(STATUS), the low-order 8 bits of the status.  */
#define	__WEXITSTATUS(status)	(((status) & 0xff00) >> 8)

Doing status = WEXITSTATUS(status); unconditionally will mask away the parts of status that indicate whether a signal was raised, so that afterwards WIFSIGNALED() is true even if no signal was raised.

So that's why busybox's fsck doesn't forward fsck.ext4's status code correctly.

In commit c4fb8c6 - fsck: do not use statics not only statics were changed but also a couple of statics-unrelated changes were made. This included the handling of the child termination status as follows: - if (WIFEXITED(status)) - status = WEXITSTATUS(status); - else if (WIFSIGNALED(status)) { + status = WEXITSTATUS(status); + if (WIFSIGNALED(status)) { Change to unconditionally call WEXITSTATUS() was not semantics-preserving, since you can only reasonably call WEXITSTATUS() on status if you checked WIFEXITED() before; see `man 2 waitpid`: WEXITSTATUS(status) [...] This macro should be employed only if WIFEXITED returned true. `status = WEXITSTATUS(status)` unconditionally masks away the parts of status that indicate whether a signal was raised, so that afterwards WIFSIGNALED() is true even if no signal was raised. As a result, busybox's `fsck` wrapper set `status = EXIT_ERROR = 8`, thus not forwarding the exit code of the filesystem-specific fsck utility (such as fsck.ext4) to the caller and returning exit code 8 instead. The exit codes of fsck have well-specified meanings (see `man fsck`) that operating systems use in order to decide whether they should prompt the user due to unrecoverable errors, or continue booting after errors were successfully fixed automatically. Consequently, this regression in busybox's fsck stopped my server from booting though, and manual intervention via a keyboard was needed. Remark: Tracking down this issue would have been significantly less effort if unrelated code changes were not snuck into a commit labelled "fsck: do not use statics". This issue was found as part of the NixOS project (NixOS/nixpkgs#40174 (comment)) and this fix has been tested on it. Signed-off-by: Niklas Hambüchen <mail@nh2.me>

nh2 · 2018-05-08T14:48:43Z

I have written a patch for busybox that fixes the issue (nh2/busybox@b9dbf69), tested that it fixes the reboot hang for my NixOS server, and submitted it to the upstream mailing list.

On NixOS I'm using

{
  busybox = super.busybox.overrideAttrs (oldAttrs: rec {
    patches = oldAttrs.patches ++ [
      ./fsck-Fix-incorrect-handling-of-child-exit-code.patch
    ];
  });
}

nh2 · 2018-05-08T15:09:26Z

Upstream patch is at http://lists.busybox.net/pipermail/busybox/2018-May/086417.html

Fixes NixOS#40174

Fixes NixOS#40174 Patch taken from upstream, added local copy because fetchpatch doesn't work with fetchurlBoot.

stale · 2020-06-04T17:28:10Z

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

nh2 · 2020-06-04T20:16:24Z

still important to me

TethysSvensson · 2020-08-07T14:32:04Z

@nh2 Is this still an issue, or has it been fixed upstream?

nh2 · 2020-08-07T16:01:55Z

Ah yes, you are right. In the linked #41024 (comment) I wrote

So I'm OK with having this in non-staging only with the next NixOS release.

The upstream patch is:

https://git.busybox.net/busybox/commit/?id=ccb8e4bc4fb74701d0d323d61e9359d8597a4272

which according to

mirror/busybox@ccb8e4b

is in busybox releases > 1.29.0, and it's fixed in at least NixOS 20.03.

Closing.

dtzWill added a commit to dtzWill/nixpkgs that referenced this issue May 22, 2018

busybox: patch to fix fsck exit code

4f8078b

Fixes NixOS#40174

dtzWill added a commit to dtzWill/nixpkgs that referenced this issue May 24, 2018

busybox: patch to fix fsck exit code

3e9c74b

Fixes NixOS#40174 Patch taken from upstream, added local copy because fetchpatch doesn't work with fetchurlBoot.

dtzWill mentioned this issue May 24, 2018

busybox: 1.28.3 -> 1.28.4, fix fsck exit code #41024

Closed

8 tasks

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020

nh2 closed this as completed Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

busybox fsck breaks NixOS's fsck handling logic #40174

busybox fsck breaks NixOS's fsck handling logic #40174

nh2 commented May 8, 2018 •

edited

nh2 commented May 8, 2018

dtzWill commented May 8, 2018

nh2 commented May 8, 2018 •

edited

nh2 commented May 8, 2018 •

edited

nh2 commented May 8, 2018

stale bot commented Jun 4, 2020

nh2 commented Jun 4, 2020

TethysSvensson commented Aug 7, 2020

nh2 commented Aug 7, 2020

busybox fsck breaks NixOS's fsck handling logic #40174

busybox fsck breaks NixOS's fsck handling logic #40174

Comments

nh2 commented May 8, 2018 • edited

nh2 commented May 8, 2018

dtzWill commented May 8, 2018

nh2 commented May 8, 2018 • edited

nh2 commented May 8, 2018 • edited

nh2 commented May 8, 2018

stale bot commented Jun 4, 2020

nh2 commented Jun 4, 2020

TethysSvensson commented Aug 7, 2020

nh2 commented Aug 7, 2020

nh2 commented May 8, 2018 •

edited

nh2 commented May 8, 2018 •

edited

nh2 commented May 8, 2018 •

edited