New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
busybox fsck breaks NixOS's fsck handling logic #40174
Comments
A workaround is to replace the line fsck $fsckFlags "$device" by fsck.ext4 $fsckFlags "$device" to directly invoke the fsck for your file system type (in my case But we should really find out what the fsck wrapper is doing wrong. What code is running here, is this wrapper from |
FWIW I had issues with busybox utils in stage1 that were "resolved" by using utillinux instead for things like mount, etc. This was w/musl (and shouldn't happen, busybox+musl should work great!) but it would be interesting to see if the same workaround worked here as well. Is there a NixOS test we can put together to explore this? Also, FWIW: Which seems to suggest '8' means fsck exit error. |
Looks like it's a bug in busybox. I have found the source code in busybox that emits the message
It is https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?h=1_28_3#n456 child_died:
status = WEXITSTATUS(status);
if (WIFSIGNALED(status)) {
sig = WTERMSIG(status);
status = EXIT_UNCORRECTED;
if (sig != SIGINT) {
printf("Warning: %s %s terminated "
"by signal %d\n",
inst->prog, inst->device, sig);
status = EXIT_ERROR;
}
} Luckily it is easy to distinguish this from the upstream, non-busybox The message is different ( Suspiciously, this code is handles the case where fsck was terminated by a signal ( I have not confirmed whether the upstream fsck wrapper, too, suffers from the bug of not forwarding the exit code of But I am reasonably sure that this issue only exists in busybox's fsck wrapper, and that it was introduced in this commit: https://git.busybox.net/busybox/commit/?id=c4fb8c6ad52e8007c6fa07e40f043bb2e0c043d1 @@ -439,9 +446,8 @@ static int wait_one(int flags)
}
child_died:
- if (WIFEXITED(status))
- status = WEXITSTATUS(status);
- else if (WIFSIGNALED(status)) {
+ status = WEXITSTATUS(status);
+ if (WIFSIGNALED(status)) {
sig = WTERMSIG(status);
status = EXIT_UNCORRECTED;
if (sig != SIGINT) { This change busybox changed the semantics of the original fsck code. You can only reasonably call /* If WIFEXITED(STATUS), the low-order 8 bits of the status. */
#define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) Doing So that's why busybox's |
In commit c4fb8c6 - fsck: do not use statics not only statics were changed but also a couple of statics-unrelated changes were made. This included the handling of the child termination status as follows: - if (WIFEXITED(status)) - status = WEXITSTATUS(status); - else if (WIFSIGNALED(status)) { + status = WEXITSTATUS(status); + if (WIFSIGNALED(status)) { Change to unconditionally call WEXITSTATUS() was not semantics-preserving, since you can only reasonably call WEXITSTATUS() on status if you checked WIFEXITED() before; see `man 2 waitpid`: WEXITSTATUS(status) [...] This macro should be employed only if WIFEXITED returned true. `status = WEXITSTATUS(status)` unconditionally masks away the parts of status that indicate whether a signal was raised, so that afterwards WIFSIGNALED() is true even if no signal was raised. As a result, busybox's `fsck` wrapper set `status = EXIT_ERROR = 8`, thus not forwarding the exit code of the filesystem-specific fsck utility (such as fsck.ext4) to the caller and returning exit code 8 instead. The exit codes of fsck have well-specified meanings (see `man fsck`) that operating systems use in order to decide whether they should prompt the user due to unrecoverable errors, or continue booting after errors were successfully fixed automatically. Consequently, this regression in busybox's fsck stopped my server from booting though, and manual intervention via a keyboard was needed. Remark: Tracking down this issue would have been significantly less effort if unrelated code changes were not snuck into a commit labelled "fsck: do not use statics". This issue was found as part of the NixOS project (NixOS/nixpkgs#40174 (comment)) and this fix has been tested on it. Signed-off-by: Niklas Hambüchen <mail@nh2.me>
I have written a patch for busybox that fixes the issue (nh2/busybox@b9dbf69), tested that it fixes the reboot hang for my NixOS server, and submitted it to the upstream mailing list. On NixOS I'm using {
busybox = super.busybox.overrideAttrs (oldAttrs: rec {
patches = oldAttrs.patches ++ [
./fsck-Fix-incorrect-handling-of-child-exit-code.patch
];
});
} |
Upstream patch is at http://lists.busybox.net/pipermail/busybox/2018-May/086417.html |
Fixes NixOS#40174 Patch taken from upstream, added local copy because fetchpatch doesn't work with fetchurlBoot.
Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:
|
still important to me |
@nh2 Is this still an issue, or has it been fixed upstream? |
Ah yes, you are right. In the linked #41024 (comment) I wrote
The upstream patch is: https://git.busybox.net/busybox/commit/?id=ccb8e4bc4fb74701d0d323d61e9359d8597a4272 which according to is in busybox releases > 1.29.0, and it's fixed in at least NixOS 20.03. Closing. |
This code in
stage-1-init.sh
runsfsck $fsckFlags "$device"
:nixpkgs/nixos/modules/system/boot/stage-1-init.sh
Lines 260 to 277 in c2b668e
This
fsck
seems to be the generic wrapper that calls the corresponding fsck for the corresponding filesystem type, likefsck.ext4
.Note in the above code how NixOS has (correct) code to handle the exit code, e.g. when the bits of decimal
2
or4
are set in the exit code (seeman fsck
).But the generic fsck wrapper seems to not correctly pass through the exit code.
On my server I get this boot failure:
Here
fsck.ext4
found a fixable problem and fixed it, then exiting with exit code 1 as expected.But the generic
fsck
wrapper tuns that into exit code 8.As a result, in
stage-1-init.sh
the wrong if-branch is taken,and my server is stuck waiting for a key to be pressed instead of booting through.
The text was updated successfully, but these errors were encountered: