Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random test failures in nixos.tests.boot.biosUsb #15690

Closed
edolstra opened this issue May 25, 2016 · 14 comments · Fixed by #172070
Closed

Random test failures in nixos.tests.boot.biosUsb #15690

edolstra opened this issue May 25, 2016 · 14 comments · Fixed by #172070
Labels
0.kind: bug Something is broken 6.topic: testing Tooling for automated testing of packages and modules

Comments

@edolstra
Copy link
Member

http://hydra.nixos.org/build/36146783:

machine# ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
machine# CHS: Error 0c00 reading sector 4220 (4/2/63)
machine# EDD: Error 0c00 reading sector 4220
error: timed out waiting for the VM to connect

Note: error 0c00 = unsupported track or invalid media (http://www.ctyme.com/intr/rb-0606.htm#Table234)

http://hydra.nixos.org/build/36146731 reaches the syslinux menu, then hangs loading the initrd:

Loading /boot/bzImage... ok
error: timed out waiting for the VM to connect

http://hydra.nixos.org/build/35988622:

machine# ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
error: timed out waiting for the VM to connect

http://hydra.nixos.org/build/36146446:

Loading /boot/bzImage... CHS: Error 0c00 reading sector 11339 (11/3/63)
machine# EDD: Error 0c00 reading sector 11339
error: timed out waiting for the VM to connect

Haven't been able to reproduce these locally, so it might be triggered by high load on the host.

This may be a QEMU USB emulation issue. It doesn't appear to affect the USB UEFI boot test though.

@edolstra edolstra added the 0.kind: bug Something is broken label May 25, 2016
@domenkozar
Copy link
Member

@bobvanderlinden

@bobvanderlinden
Copy link
Member

I'm not sure what to do about this. I can imagine syslinux being a bit pickier about the read-times, so it might timeout earlier than Linux itself. Maybe setting io limits on the host system can reproduce this problem. I'll give that a try later. However, this doesn't help the build much. If an IO timeout in syslinux is actually the problem, patching syslinux might solve the problem, but that seems like something you'd want to avoid for the boot test.

Is there some way to do this boot test not parallel to other tests?

@domenkozar
Copy link
Member

For now the test has been removed from release blockers: a78ecb0

@domenkozar domenkozar added the 1.severity: blocker This is preventing another PR or issue from being completed label Jul 21, 2016
@domenkozar domenkozar added this to the 16.09 milestone Jul 21, 2016
@domenkozar
Copy link
Member

@bobvanderlinden we could set requiredSystemFeatures = [ "big-parallel" ];. @edolstra opinions?

@domenkozar domenkozar removed the 1.severity: blocker This is preventing another PR or issue from being completed label Sep 20, 2016
@domenkozar
Copy link
Member

Just FYI, this is still happening on 16.09

@domenkozar domenkozar modified the milestones: 17.03, 16.09 Sep 20, 2016
@bobvanderlinden
Copy link
Member

Would removing/disabling this test be the better option? I'm not sure what "big-parallel" does for this test nor whether it will help in this case?

@fpletz fpletz modified the milestones: 17.09, 17.03 Jul 25, 2017
@fpletz fpletz removed this from the 17.09 milestone Mar 4, 2018
@c0bw3b c0bw3b added the 6.topic: testing Tooling for automated testing of packages and modules label Apr 28, 2019
@vcunat vcunat closed this as completed in ea79a83 Feb 22, 2020
@roberth roberth reopened this Feb 9, 2021
@roberth
Copy link
Member

roberth commented Feb 9, 2021

This still occurs quite often. I've gathered logs in #104642. I don't watch hydra closely but in my experience this is the most frequent flaky "blocker".

@bobvanderlinden
Copy link
Member

Should we just disable it? I'm not sure how to solve this other than not using the usb emulation of qemu that we're using right now.

@roberth
Copy link
Member

roberth commented Feb 9, 2021

Your earlier comment may be worth considering

If an IO timeout in syslinux is actually the problem, patching syslinux might solve the problem, but that seems like something you'd want to avoid for the boot test.

This means we don't test one failure mode, but we still test many other usb-specific failure modes.
I don't think this timeout is a problem in practice anyway, or we could just ship a syslinux with an increased timeout.

Are you familiar with syslinux?

@bobvanderlinden
Copy link
Member

bobvanderlinden commented Feb 10, 2021

Are you familiar with syslinux?

Not really. Only a bit as a user when I initially worked on USB boot support of NixOS.

That said, I did do a bit of spelunking in the syslinux code and did find that it tries to do this a number of times (retry count):

https://github.com/geneC/syslinux/blob/5e426532210bb830d2d7426eb8d8c154d9dfcba6/com32/lib/syslinux/disk.c#L52

There is does call interrupt 13, which is used to read from disks.

The error that @edolstra mentioned in the description is elsewhere:

https://github.com/geneC/syslinux/blob/5e426532210bb830d2d7426eb8d8c154d9dfcba6/core/fs/diskio_bios.c#L139-L142

It seems that error is also triggered when interrupt 13 is tried too many times. It has its own int13 call, so it also handles its own retry-count. The retry-count of this function is here:

https://github.com/geneC/syslinux/blob/5e426532210bb830d2d7426eb8d8c154d9dfcba6/core/fs/diskio_bios.c#L6

There doesn't seem to be a way to change this using configuration, so we would have to patch the code to increase the timeout.

This is all because interrupt 13 (IO) on the build server inside qemu is failing 6 times... or so it seems from what I'm reading in the syslinux code. I'm not sure we should be fixing syslinux, as there might be many other low-level tests that might be failing this way. Maybe there is some configuration within qemu to increase disk timeout?

@lukegb
Copy link
Contributor

lukegb commented Apr 6, 2021

This might be failing inside SeaBIOS; my hypothesis is that we're hitting SeaBIOS' timeout on USB I/O (which is ~5s)

Given that this seems to succeed when it's retried, presumably because of lower overhead on whatever machine's getting used for the build: can we do something to force the USB image to remain resident in the disk cache throughout the test, e.g. by using vmtouch?

@ncfavier
Copy link
Member

@lukegb I tried implementing what you suggested:

        testScript =
          ''
            ${optionalString (extraConfig ? usb) ''
            import subprocess
            subprocess.run("${pkgs.vmtouch}/bin/vmtouch -lfvd -m 1G ${extraConfig.usb}", shell=True, check=True)
            ''}
            machine = create_machine(${machineConfig})
            ...

But that fails because vmtouch can't lock that much memory. Is there a way to raise the ulimits for a test?

@ncfavier
Copy link
Member

Some new developments in #170803

@ncfavier
Copy link
Member

ncfavier commented May 8, 2022

Proposed fix: #172059

raboof added a commit to raboof/nixpkgs that referenced this issue May 8, 2022
This patch fixes a problem that caused the NixOS tests that tested booting
from USB to fail periodically.

Fixes NixOS#15690, fixes NixOS#104642, fixes NixOS#170803

Also submitted upstream at https://lists.nongnu.org/archive/html/qemu-devel/2022-05/msg01484.html
@raboof raboof mentioned this issue May 8, 2022
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: testing Tooling for automated testing of packages and modules
Projects
None yet
8 participants