-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random test failures in nixos.tests.boot.biosUsb #15690
Comments
I'm not sure what to do about this. I can imagine syslinux being a bit pickier about the read-times, so it might timeout earlier than Linux itself. Maybe setting io limits on the host system can reproduce this problem. I'll give that a try later. However, this doesn't help the build much. If an IO timeout in syslinux is actually the problem, patching syslinux might solve the problem, but that seems like something you'd want to avoid for the boot test. Is there some way to do this boot test not parallel to other tests? |
For now the test has been removed from release blockers: a78ecb0 |
@bobvanderlinden we could set |
Just FYI, this is still happening on 16.09 |
Would removing/disabling this test be the better option? I'm not sure what "big-parallel" does for this test nor whether it will help in this case? |
This still occurs quite often. I've gathered logs in #104642. I don't watch hydra closely but in my experience this is the most frequent flaky "blocker". |
Should we just disable it? I'm not sure how to solve this other than not using the usb emulation of qemu that we're using right now. |
Your earlier comment may be worth considering
This means we don't test one failure mode, but we still test many other usb-specific failure modes. Are you familiar with |
Not really. Only a bit as a user when I initially worked on USB boot support of NixOS. That said, I did do a bit of spelunking in the syslinux code and did find that it tries to do this a number of times (retry count): There is does call interrupt 13, which is used to read from disks. The error that @edolstra mentioned in the description is elsewhere: It seems that error is also triggered when interrupt 13 is tried too many times. It has its own int13 call, so it also handles its own retry-count. The retry-count of this function is here: There doesn't seem to be a way to change this using configuration, so we would have to patch the code to increase the timeout. This is all because interrupt 13 (IO) on the build server inside qemu is failing 6 times... or so it seems from what I'm reading in the syslinux code. I'm not sure we should be fixing syslinux, as there might be many other low-level tests that might be failing this way. Maybe there is some configuration within qemu to increase disk timeout? |
This might be failing inside SeaBIOS; my hypothesis is that we're hitting SeaBIOS' timeout on USB I/O (which is ~5s) Given that this seems to succeed when it's retried, presumably because of lower overhead on whatever machine's getting used for the build: can we do something to force the USB image to remain resident in the disk cache throughout the test, e.g. by using vmtouch? |
@lukegb I tried implementing what you suggested: testScript =
''
${optionalString (extraConfig ? usb) ''
import subprocess
subprocess.run("${pkgs.vmtouch}/bin/vmtouch -lfvd -m 1G ${extraConfig.usb}", shell=True, check=True)
''}
machine = create_machine(${machineConfig})
... But that fails because vmtouch can't lock that much memory. Is there a way to raise the ulimits for a test? |
Some new developments in #170803 |
Proposed fix: #172059 |
This patch fixes a problem that caused the NixOS tests that tested booting from USB to fail periodically. Fixes NixOS#15690, fixes NixOS#104642, fixes NixOS#170803 Also submitted upstream at https://lists.nongnu.org/archive/html/qemu-devel/2022-05/msg01484.html
http://hydra.nixos.org/build/36146783:
Note: error 0c00 = unsupported track or invalid media (http://www.ctyme.com/intr/rb-0606.htm#Table234)
http://hydra.nixos.org/build/36146731 reaches the syslinux menu, then hangs loading the initrd:
http://hydra.nixos.org/build/35988622:
http://hydra.nixos.org/build/36146446:
Haven't been able to reproduce these locally, so it might be triggered by high load on the host.
This may be a QEMU USB emulation issue. It doesn't appear to affect the USB UEFI boot test though.
The text was updated successfully, but these errors were encountered: