nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

raboof · 2022-04-28T18:51:19Z

https://hydra.nixos.org/build/174964149

machine # Booting from Hard Disk...
machine # 
machine # ISOLINUX 6.04  EHDD Copyright (C) 1994-2015 H. Peter Anvin et a
machine # ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
machine # l
machine # e%@)0(B�lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionsAutomatic boot in 10 seconds... Automatic boot in 9 seconds... Automatic boot in 8 seconds...Automatic boot in 7 seconds...Automatic boot in 6 seconds...Automatic boot in 5 seconds...Automatic boot in 4 seconds...Automatic boot in 3 seconds...Automatic boot in 2 seconds... Automatic boot in 1 second... �lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionse%@)0(BLoading /boot/bzImage... cpage out of range (5)
machine # processing error - resetting ehci HC
machine # CHS: Error 0c00 reading sector 3058 (1/16/35)
machine # EDD: Error 0c00 reading sector 3058
machine # CHS: Error 0c00 reading sector 15543 (7/22/46)
machine # EDD: Error 0c00 reading sector 15543
machine # ok
machine # Loading /boot/initrd...CHS: Error 0c00 reading sector 70844 (35/4/33)
machine # EDD: Error 0c00 reading sector 70844
machine # CHS: Error 0c00 reading sector 96311 (47/24/48)
machine # EDD: Error 0c00 reading sector 96311
machine # ok
machine: connected to guest root shell
machine: (connecting took 1017.66 seconds)
(finished: waiting for the VM to finish booting, in 1017.66 seconds)
cleanup
(finished: cleanup, in 0.00 seconds)
Traceback (most recent call last):
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/__init__.py", line 114, in main
    driver.run_tests()
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 146, in run_tests
    self.test_script()
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 142, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 9, in <module>
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 459, in wait_for_unit
    retry(check_active)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 126, in retry
    if fn(False):
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 436, in check_active
    info = self.get_unit_info(unit, user)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 462, in get_unit_info
    status, lines = self.systemctl('--no-pager show "{}"'.format(unit), user)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 493, in systemctl
    return self.execute("systemctl {}".format(q))
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 541, in execute
    self.shell.send(out_command.encode())
BrokenPipeError: [Errno 32] Broken pipe
machine # cProbing EDD (edd=off to disable)... oc

Specifically:

Loading /boot/bzImage... cpage out of range (5)

For comparison, a successful run loads bzImage successfully:

machine # ISOLINUX 6.04  EHDD Copyright (C) 1994-2015 H. Peter Anvin et a
machine # ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
machine # l
machine # e%@)0(B�lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionsAutomatic boot in 10 seconds... Automatic boot in 9 seconds... Automatic boot in 8 seconds...Automatic boot in 7 seconds...Automatic boot in 6 seconds...Automatic boot in 5 seconds...Automatic boot in 4 seconds...Automatic boot in 3 seconds...Automatic boot in 2 seconds... Automatic boot in 1 second... �lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionse%@)0(BLoading /boot/bzImage... ok
machine # Loading /boot/initrd...ok

The text was updated successfully, but these errors were encountered:

ncfavier · 2022-04-29T14:51:44Z

Might this be related to #15690 ?

raboof · 2022-04-29T15:02:03Z

Possibly, the Error 0c00 reading sector certainly overlaps, and it seems cpage out of range (5) is also USB-related. I guess the question now is which is the cause and which is the effect?

vcunat · 2022-04-29T20:49:23Z

Even in the past weeks the job has been failing occasionally. I usually don't even look into the log anymore unless it failed twice. Locally I don't get an error on the same commit. Once Hydra's queue runner gets fixed, we'll see if it succeeds.

raboof · 2022-04-30T08:22:18Z

Even in the past weeks the job has been failing occasionally. I usually don't even look into the log anymore unless it failed twice. Locally I don't get an error on the same commit. Once Hydra's queue runner gets fixed, we'll see if it succeeds.

Yes, I completely agree. Let's keep this ticket to track the instability, though.

Possibly, the Error 0c00 reading sector certainly overlaps, and it seems cpage out of range (5) is also USB-related. I guess the question now is which is the cause and which is the effect?

Ok, I now think the cpage out of range (5) is the cause and the Error 0c00 reading sector is the effect: AFAICT what happens is:

ISOLINUX asks the BIOS to read bzImage (presumably with int 0x13 or somesuch)
the BIOS (SeaBIOS) uses USB to get this file from qemu
qemu produces the cpage out of range (5) error and aborts the USB transfer
the BIOS presumably picks up the error and reports it as 0c00

So it seems like there is something wonky in the USB communication between SeaBIOS and qemu. I had a bit of a look but the implementations look reasonable on both sides at first glance. I wonder if we could run this test with SeaBIOS debugging enabled, but using a different SeaBIOS was not as simple as adding bios = "${pkgs.seabios}/Csm16.bin"; to the biosUsb test ;). Anyone more well-versed in qemu?

ncfavier · 2022-04-30T10:16:55Z

Thanks for looking into this! I found this bug report https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3

raboof · 2022-05-02T21:18:52Z

Thanks for looking into this! I found this bug report https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3

Great find! Indeed with that ISO it seems easier to reproduce (while loading the kernel - if you pass '-nographics' you can hit 'enter' and select '2' to load from the console) - perhaps it's just bigger? I also found it sometimes just hangs instead of hitting the cpage error, which might be what causes #15690.

Adding some additional diagnostics shows:

reading 20480 bytes from offset 1652 at page 0

Indeed that's an invalid USB packet: you can fit 20480 bytes in there, but only if you start at offset 0, since there's only 4 pages and each page is 4096 bytes.

Adding a bunch of logging to qemu, seeing the following pattern:

 44%requested 31 at offset 785
requested 20480 at offset 0
requested 20480 at offset 0
requested 20480 at offset 0
requested 3584 at offset 0
requested 13 at offset 772
requested 31 at offset 785
requested 512 at offset 0
requested 13 at offset 772
 44%requested 31 at offset 785
requested 20480 at offset 0
requested 20480 at offset 0
requested 20480 at offset 0
requested 3584 at offset 0
requested 13 at offset 772
requested 31 at offset 785
requested 512 at offset 0
requested 13 at offset 772
 44%requested 31 at offset 785
requested 20480 at offset 816

It seems that qemu(?) 'writes back' the new offset (785+31=816), and somehow that value can 'leak' into the qTD structure for the next request for transferring 20480 bytes (which should start at offset 0, not 816).

I can confirm I can reproduce the problem with seabios b3fa8577 and so far not with 1.13.0, though tbh nothing jumps out of me looking at the commits in there...

ncfavier · 2022-05-03T07:40:33Z

Would you be able to bisect between b3fa8577 and 1.13.0?

raboof · 2022-05-03T07:51:03Z

Would you be able to bisect between b3fa8577 and 1.13.0?

I could, though it's rather slow work, since the problem doesn't occur every time. Also it's not entirely clear it would help much: I noticed that when I add some debugging statements to seabios in just the right/wrong places, I can no longer reproduce the problem either... so even if we find the first commit that reproduces the problem, that might not be the commit that actually introduced the bug.

ncfavier · 2022-05-04T12:09:05Z

After a painful bisect I found the same commit as the person in the bug report: b3fa857 "kvm: add support for reading tsc frequency from kvmclock".

I can say quite confidently that commenting out this line (from that commit) or this one makes the problem go away.

Also, sometimes instead of the cpage error I get "non queue head request in async schedule". This all seems to point at a concurrency/timing problem.

Command used to test the issue:

~/qemu/build/qemu-system-x86_64 -bios ~/seabios/out/bios.bin -device usb-ehci -blockdev driver=file,read-only=on,filename=./openSUSE-Tumbleweed-GNOME-Live-x86_64-Snapshot20220502-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -m 1024 -enable-kvm

hit Enter and wait a few seconds after "Loading initrd...".

raboof · 2022-05-04T12:48:48Z

aha interesting... so on my machine this updates TimerKHz from 1194 (~1 MHz - seems kinda weird, not sure where that came from?) to 2400000 (~2.4 GHz, indeed my host machine cpu speed), and this value is then used when usb-ehci.c performs calls to timer_calc or ticks_to_ms.

ncfavier · 2022-05-04T12:50:41Z

Also it updates TimerPort from 0x40 to 0, and commenting out that line also seems to fix the issue.

raboof · 2022-05-08T09:13:39Z

So before b3fa857 it would use the 3.5 MHz PM TIMER, and since then it uses the 2.4GHz TSC (Time-Stamp Counter) timer.

When I comment out kvmclock_init the problem indeed goes away.
When I replace it with timer_setup, which switches to TSC, I can reproduce the problem again.

I don't see any code that obviously would be impacted by a different clock: where the clock is used in EHCI it's usually in timeouts, and AFAICS we're not hitting any timeouts in this scenario in either the 'happy' or the 'problematic' change. So preliminary it seems the different timer causes a timing difference that triggers the bug, but I don't see strong evidence yet that the timer itself is really 'wrong'.

The qemu controller uses DMA calls (put_dwords/get_dwords) to read and write what I think is called the 'overlay area' to/from main memory. It looks like that is where things go wrong: 'usually' the 'writeback' writes "0 bytes, offset X" to the 'overlay area' and immediately after that reads "20480 bytes, offset 0" from the same main memory address. However, in the problematic scenario, it writes "0 bytes, offset X" and then reads "20480 bytes, offset X" (same X).

So it looks like either seabios writes the 'new' 0 offset too early (before qemu writes the 'old' offset X) or too late (after qemu reads the new offset, which should be 0 but in the problematic case is X).

I changed qemu to retry fetching the offset when the values don't make sense (https://gitlab.com/raboof/qemu/-/commit/3692a11ff3e2b96ec596d2260e921369e8ba4729), and with that change I can still reproduce the problem:

Read qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480

OK, not very scientific, but that kinda suggests qemu does the writeback after seabios writes the '0' offset. If that's true, then the question becomes: is seabios writing too early, or is qemu writing too late? I haven't figured out yet how EHCI is supposed to guard against such race conditions, but https://gitlab.com/raboof/qemu/-/blob/master/hw/usb/hcd-ehci.c#L1937 is making me slightly nervous ;)

ncfavier · 2022-05-08T10:06:49Z

That's bone-chilling, but on a more practical note: should we just patch kvmclock_init out of seabios in nixpkgs? Is using the TSC timer supposed to make a difference? I haven't noticed a speed-up, certainly not on a factor of 1000.

raboof · 2022-05-08T10:21:38Z

That's bone-chilling

😆

on a more practical note: should we just patch kvmclock_init out of seabios in nixpkgs? Is using the TSC timer supposed to make a difference?

That seems a bit heavy-handed, but I guess we could use a build with CONFIG_TSC_TIMER=n.

I haven't noticed a speed-up, certainly not on a factor of 1000.

I agree I don't think it's supposed to change the speed at which things run, it's just another way to keep the time

ncfavier · 2022-05-08T10:44:27Z

Right. Actually we'd need to change the seabios shipped with qemu, not the one in nixpkgs.

raboof · 2022-05-08T12:07:22Z

Right. Actually we'd need to change the seabios shipped with qemu, not the one in nixpkgs.

I think you can override the one shipped with qemy by passing a bios = to the test - just have to use CONFIG_CSM=n (and CONFIG_TSC_TIMER=n) while building seabios

ncfavier · 2022-05-08T13:34:17Z

Opened #172059

This patch fixes a problem that caused the NixOS tests that tested booting from USB to fail periodically. Fixes NixOS#15690, fixes NixOS#104642, fixes NixOS#170803 Also submitted upstream at https://lists.nongnu.org/archive/html/qemu-devel/2022-05/msg01484.html

raboof · 2022-05-08T15:49:06Z

Opened #172059

This is great, but I think I figured out the 'real' problem now! #172070

ncfavier · 2022-05-08T16:42:42Z

Neat!

raboof · 2022-05-08T17:03:36Z

Neat!

Thanks a lot for finding that post by Lin Ma and bouncing ideas here, without that I'd surely have given up much earlier ;)

The 'active' bit passes control over a qTD between the guest and the controller: set to 1 by guest to enable execution by the controller, and the controller sets it to '0' to hand back control to the guest. ehci_state_writeback write two dwords to main memory using DMA: the third dword of the qTD (containing dt, total bytes to transfer, cpage, cerr and status) and the fourth dword of the qTD (containing the offset). This commit makes sure the fourth dword is written before the third, avoiding a race condition where a new offset written into the qTD by the guest after it observed the status going to go to '0' gets overwritten by a 'late' DMA writeback of the previous offset. This race condition could lead to 'cpage out of range (5)' errors, and reproduced by: ./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic (press a key, select 'Installation' (2), and accept the default values. On my machine the 'cpage out of range' is reproduced while loading the Linux Kernel about once per 7 attempts. With the fix in this commit it no longer fails) This problem was previously reported as a seabios problem in https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/ and as a nixos CI build failure in NixOS/nixpkgs#170803 Signed-off-by: Arnout Engelen <arnout@bzzt.net> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Git-commit: f471e8b References: bsc#1192115 The 'active' bit passes control over a qTD between the guest and the controller: set to 1 by guest to enable execution by the controller, and the controller sets it to '0' to hand back control to the guest. ehci_state_writeback write two dwords to main memory using DMA: the third dword of the qTD (containing dt, total bytes to transfer, cpage, cerr and status) and the fourth dword of the qTD (containing the offset). This commit makes sure the fourth dword is written before the third, avoiding a race condition where a new offset written into the qTD by the guest after it observed the status going to go to '0' gets overwritten by a 'late' DMA writeback of the previous offset. This race condition could lead to 'cpage out of range (5)' errors, and reproduced by: ./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic (press a key, select 'Installation' (2), and accept the default values. On my machine the 'cpage out of range' is reproduced while loading the Linux Kernel about once per 7 attempts. With the fix in this commit it no longer fails) This problem was previously reported as a seabios problem in https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/ and as a nixos CI build failure in NixOS/nixpkgs#170803 Signed-off-by: Arnout Engelen <arnout@bzzt.net> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Lin Ma <lma@suse.com> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

Git-commit f471e8b References: bsc#1192115 The 'active' bit passes control over a qTD between the guest and the controller: set to 1 by guest to enable execution by the controller, and the controller sets it to '0' to hand back control to the guest. ehci_state_writeback write two dwords to main memory using DMA: the third dword of the qTD (containing dt, total bytes to transfer, cpage, cerr and status) and the fourth dword of the qTD (containing the offset). This commit makes sure the fourth dword is written before the third, avoiding a race condition where a new offset written into the qTD by the guest after it observed the status going to go to '0' gets overwritten by a 'late' DMA writeback of the previous offset. This race condition could lead to 'cpage out of range (5)' errors, and reproduced by: ./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic (press a key, select 'Installation' (2), and accept the default values. On my machine the 'cpage out of range' is reproduced while loading the Linux Kernel about once per 7 attempts. With the fix in this commit it no longer fails) This problem was previously reported as a seabios problem in https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/ and as a nixos CI build failure in NixOS/nixpkgs#170803 Signed-off-by: Arnout Engelen <arnout@bzzt.net> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Lin Ma <lma@suse.com> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

mainline inclusion commit f471e8b category: bugfix --------------------------------------------------------------- The 'active' bit passes control over a qTD between the guest and the controller: set to 1 by guest to enable execution by the controller, and the controller sets it to '0' to hand back control to the guest. ehci_state_writeback write two dwords to main memory using DMA: the third dword of the qTD (containing dt, total bytes to transfer, cpage, cerr and status) and the fourth dword of the qTD (containing the offset). This commit makes sure the fourth dword is written before the third, avoiding a race condition where a new offset written into the qTD by the guest after it observed the status going to go to '0' gets overwritten by a 'late' DMA writeback of the previous offset. This race condition could lead to 'cpage out of range (5)' errors, and reproduced by: ./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic (press a key, select 'Installation' (2), and accept the default values. On my machine the 'cpage out of range' is reproduced while loading the Linux Kernel about once per 7 attempts. With the fix in this commit it no longer fails) This problem was previously reported as a seabios problem in https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/ and as a nixos CI build failure in NixOS/nixpkgs#170803 Signed-off-by: Arnout Engelen <arnout@bzzt.net> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: tangbinzy <tangbin_yewu@cmss.chinamobile.com>

raboof added 1.severity: channel blocker Blocks a channel 0.kind: build failure A package fails to build labels Apr 28, 2022

raboof changed the title ~~nixos.tests.boot.biosUsb.x86_64-linux fails on hydra~~ nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" Apr 28, 2022

veprbl added the 6.topic: testing Tooling for automated testing of packages and modules label Apr 29, 2022

ncfavier mentioned this issue Apr 30, 2022

Random test failures in nixos.tests.boot.biosUsb #15690

Closed

zowoq removed the 1.severity: channel blocker Blocks a channel label May 1, 2022

ncfavier mentioned this issue May 8, 2022

nixos/tests/boot: fix intermittent biosUsb failure #172059

Closed

raboof mentioned this issue May 8, 2022

qemu: stabilize USB EHCI #172070

Merged

13 tasks

raboof closed this as completed in #172070 May 11, 2022

raboof mentioned this issue May 13, 2022

failure: nixos.tests.boot.biosUsb: keeps booting after USB transfer failure #172922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

raboof commented Apr 28, 2022

ncfavier commented Apr 29, 2022

raboof commented Apr 29, 2022

vcunat commented Apr 29, 2022

raboof commented Apr 30, 2022

ncfavier commented Apr 30, 2022 •

edited

Loading

raboof commented May 2, 2022

ncfavier commented May 3, 2022

raboof commented May 3, 2022

ncfavier commented May 4, 2022 •

edited

Loading

raboof commented May 4, 2022 •

edited

Loading

ncfavier commented May 4, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022 •

edited

Loading

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

Comments

raboof commented Apr 28, 2022

ncfavier commented Apr 29, 2022

raboof commented Apr 29, 2022

vcunat commented Apr 29, 2022

raboof commented Apr 30, 2022

ncfavier commented Apr 30, 2022 • edited Loading

raboof commented May 2, 2022

ncfavier commented May 3, 2022

raboof commented May 3, 2022

ncfavier commented May 4, 2022 • edited Loading

raboof commented May 4, 2022 • edited Loading

ncfavier commented May 4, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022 • edited Loading

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

ncfavier commented May 8, 2022

raboof commented May 8, 2022

ncfavier commented Apr 30, 2022 •

edited

Loading

ncfavier commented May 4, 2022 •

edited

Loading

raboof commented May 4, 2022 •

edited

Loading

ncfavier commented May 8, 2022 •

edited

Loading