Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

Closed
raboof opened this issue Apr 28, 2022 · 20 comments · Fixed by #172070
Closed

nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" #170803

raboof opened this issue Apr 28, 2022 · 20 comments · Fixed by #172070
Labels
0.kind: build failure A package fails to build 6.topic: testing Tooling for automated testing of packages and modules

Comments

@raboof
Copy link
Member

raboof commented Apr 28, 2022

https://hydra.nixos.org/build/174964149

machine # Booting from Hard Disk...
machine # 
machine # ISOLINUX 6.04  EHDD Copyright (C) 1994-2015 H. Peter Anvin et a
machine # ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
machine # l
machine # e%@)0(B�lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionsAutomatic boot in 10 seconds... Automatic boot in 9 seconds... Automatic boot in 8 seconds...Automatic boot in 7 seconds...Automatic boot in 6 seconds...Automatic boot in 5 seconds...Automatic boot in 4 seconds...Automatic boot in 3 seconds...Automatic boot in 2 seconds... Automatic boot in 1 second... �lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionse%@)0(BLoading /boot/bzImage... cpage out of range (5)
machine # processing error - resetting ehci HC
machine # CHS: Error 0c00 reading sector 3058 (1/16/35)
machine # EDD: Error 0c00 reading sector 3058
machine # CHS: Error 0c00 reading sector 15543 (7/22/46)
machine # EDD: Error 0c00 reading sector 15543
machine # ok
machine # Loading /boot/initrd...CHS: Error 0c00 reading sector 70844 (35/4/33)
machine # EDD: Error 0c00 reading sector 70844
machine # CHS: Error 0c00 reading sector 96311 (47/24/48)
machine # EDD: Error 0c00 reading sector 96311
machine # ok
machine: connected to guest root shell
machine: (connecting took 1017.66 seconds)
(finished: waiting for the VM to finish booting, in 1017.66 seconds)
cleanup
(finished: cleanup, in 0.00 seconds)
Traceback (most recent call last):
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/__init__.py", line 114, in main
    driver.run_tests()
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 146, in run_tests
    self.test_script()
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 142, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 9, in <module>
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 459, in wait_for_unit
    retry(check_active)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 126, in retry
    if fn(False):
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 436, in check_active
    info = self.get_unit_info(unit, user)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 462, in get_unit_info
    status, lines = self.systemctl('--no-pager show "{}"'.format(unit), user)
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 493, in systemctl
    return self.execute("systemctl {}".format(q))
  File "/nix/store/7m0m07v3yhj81l2p1sbj8krlzximmd21-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 541, in execute
    self.shell.send(out_command.encode())
BrokenPipeError: [Errno 32] Broken pipe
machine # cProbing EDD (edd=off to disable)... oc

Specifically:

Loading /boot/bzImage... cpage out of range (5)

For comparison, a successful run loads bzImage successfully:

machine # ISOLINUX 6.04  EHDD Copyright (C) 1994-2015 H. Peter Anvin et a
machine # ISOLINUX 6.04   Copyright (C) 1994-2015 H. Peter Anvin et al
machine # l
machine # e%@)0(B�lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionsAutomatic boot in 10 seconds... Automatic boot in 9 seconds... Automatic boot in 8 seconds...Automatic boot in 7 seconds...Automatic boot in 6 seconds...Automatic boot in 5 seconds...Automatic boot in 4 seconds...Automatic boot in 3 seconds...Automatic boot in 2 seconds... Automatic boot in 1 second... �lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx�                                             NixOS                                              �xtqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu��x� NixOS 22.05pre-git Installer                                                                   �x��x� NixOS 22.05pre-git Installer (nomodeset)                                                       �x��x� NixOS 22.05pre-git Installer (copytoram)                                                       �x��x� NixOS 22.05pre-git Installer (debug)                                                           �x��x� NixOS 22.05pre-git Installer (serial console=ttyS0,115200n8)                                   �x��x� Memtest86+                                                                                     �x��mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�Press [Tab] to edit optionse%@)0(BLoading /boot/bzImage... ok
machine # Loading /boot/initrd...ok
@raboof raboof added 1.severity: channel blocker Blocks a channel 0.kind: build failure A package fails to build labels Apr 28, 2022
@raboof raboof changed the title nixos.tests.boot.biosUsb.x86_64-linux fails on hydra nixos.tests.boot.biosUsb.x86_64-linux fails on hydra: "cpage out of range (5)" Apr 28, 2022
@veprbl veprbl added the 6.topic: testing Tooling for automated testing of packages and modules label Apr 29, 2022
@ncfavier
Copy link
Member

Might this be related to #15690 ?

@raboof
Copy link
Member Author

raboof commented Apr 29, 2022

Possibly, the Error 0c00 reading sector certainly overlaps, and it seems cpage out of range (5) is also USB-related. I guess the question now is which is the cause and which is the effect?

@vcunat
Copy link
Member

vcunat commented Apr 29, 2022

Even in the past weeks the job has been failing occasionally. I usually don't even look into the log anymore unless it failed twice. Locally I don't get an error on the same commit. Once Hydra's queue runner gets fixed, we'll see if it succeeds.

@raboof
Copy link
Member Author

raboof commented Apr 30, 2022

Even in the past weeks the job has been failing occasionally. I usually don't even look into the log anymore unless it failed twice. Locally I don't get an error on the same commit. Once Hydra's queue runner gets fixed, we'll see if it succeeds.

Yes, I completely agree. Let's keep this ticket to track the instability, though.

Possibly, the Error 0c00 reading sector certainly overlaps, and it seems cpage out of range (5) is also USB-related. I guess the question now is which is the cause and which is the effect?

Ok, I now think the cpage out of range (5) is the cause and the Error 0c00 reading sector is the effect: AFAICT what happens is:

So it seems like there is something wonky in the USB communication between SeaBIOS and qemu. I had a bit of a look but the implementations look reasonable on both sides at first glance. I wonder if we could run this test with SeaBIOS debugging enabled, but using a different SeaBIOS was not as simple as adding bios = "${pkgs.seabios}/Csm16.bin"; to the biosUsb test ;). Anyone more well-versed in qemu?

@ncfavier
Copy link
Member

ncfavier commented Apr 30, 2022

Thanks for looking into this! I found this bug report https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3

@raboof
Copy link
Member Author

raboof commented May 2, 2022

Thanks for looking into this! I found this bug report https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3

Great find! Indeed with that ISO it seems easier to reproduce (while loading the kernel - if you pass '-nographics' you can hit 'enter' and select '2' to load from the console) - perhaps it's just bigger? I also found it sometimes just hangs instead of hitting the cpage error, which might be what causes #15690.

Adding some additional diagnostics shows:

reading 20480 bytes from offset 1652 at page 0

Indeed that's an invalid USB packet: you can fit 20480 bytes in there, but only if you start at offset 0, since there's only 4 pages and each page is 4096 bytes.

Adding a bunch of logging to qemu, seeing the following pattern:

 44%requested 31 at offset 785
requested 20480 at offset 0
requested 20480 at offset 0
requested 20480 at offset 0
requested 3584 at offset 0
requested 13 at offset 772
requested 31 at offset 785
requested 512 at offset 0
requested 13 at offset 772
 44%requested 31 at offset 785
requested 20480 at offset 0
requested 20480 at offset 0
requested 20480 at offset 0
requested 3584 at offset 0
requested 13 at offset 772
requested 31 at offset 785
requested 512 at offset 0
requested 13 at offset 772
 44%requested 31 at offset 785
requested 20480 at offset 816

It seems that qemu(?) 'writes back' the new offset (785+31=816), and somehow that value can 'leak' into the qTD structure for the next request for transferring 20480 bytes (which should start at offset 0, not 816).

I can confirm I can reproduce the problem with seabios b3fa8577 and so far not with 1.13.0, though tbh nothing jumps out of me looking at the commits in there...

@ncfavier
Copy link
Member

ncfavier commented May 3, 2022

Would you be able to bisect between b3fa8577 and 1.13.0?

@raboof
Copy link
Member Author

raboof commented May 3, 2022

Would you be able to bisect between b3fa8577 and 1.13.0?

I could, though it's rather slow work, since the problem doesn't occur every time. Also it's not entirely clear it would help much: I noticed that when I add some debugging statements to seabios in just the right/wrong places, I can no longer reproduce the problem either... so even if we find the first commit that reproduces the problem, that might not be the commit that actually introduced the bug.

@ncfavier
Copy link
Member

ncfavier commented May 4, 2022

After a painful bisect I found the same commit as the person in the bug report: b3fa857 "kvm: add support for reading tsc frequency from kvmclock".

I can say quite confidently that commenting out this line (from that commit) or this one makes the problem go away.

Also, sometimes instead of the cpage error I get "non queue head request in async schedule". This all seems to point at a concurrency/timing problem.

Command used to test the issue:

~/qemu/build/qemu-system-x86_64 -bios ~/seabios/out/bios.bin -device usb-ehci -blockdev driver=file,read-only=on,filename=./openSUSE-Tumbleweed-GNOME-Live-x86_64-Snapshot20220502-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -m 1024 -enable-kvm

hit Enter and wait a few seconds after "Loading initrd...".

@raboof
Copy link
Member Author

raboof commented May 4, 2022

aha interesting... so on my machine this updates TimerKHz from 1194 (~1 MHz - seems kinda weird, not sure where that came from?) to 2400000 (~2.4 GHz, indeed my host machine cpu speed), and this value is then used when usb-ehci.c performs calls to timer_calc or ticks_to_ms.

@ncfavier
Copy link
Member

ncfavier commented May 4, 2022

Also it updates TimerPort from 0x40 to 0, and commenting out that line also seems to fix the issue.

@raboof
Copy link
Member Author

raboof commented May 8, 2022

So before b3fa857 it would use the 3.5 MHz PM TIMER, and since then it uses the 2.4GHz TSC (Time-Stamp Counter) timer.

When I comment out kvmclock_init the problem indeed goes away.
When I replace it with timer_setup, which switches to TSC, I can reproduce the problem again.

I don't see any code that obviously would be impacted by a different clock: where the clock is used in EHCI it's usually in timeouts, and AFAICS we're not hitting any timeouts in this scenario in either the 'happy' or the 'problematic' change. So preliminary it seems the different timer causes a timing difference that triggers the bug, but I don't see strong evidence yet that the timer itself is really 'wrong'.

The qemu controller uses DMA calls (put_dwords/get_dwords) to read and write what I think is called the 'overlay area' to/from main memory. It looks like that is where things go wrong: 'usually' the 'writeback' writes "0 bytes, offset X" to the 'overlay area' and immediately after that reads "20480 bytes, offset 0" from the same main memory address. However, in the problematic scenario, it writes "0 bytes, offset X" and then reads "20480 bytes, offset X" (same X).

So it looks like either seabios writes the 'new' 0 offset too early (before qemu writes the 'old' offset X) or too late (after qemu reads the new offset, which should be 0 but in the problematic case is X).

I changed qemu to retry fetching the offset when the values don't make sense (https://gitlab.com/raboof/qemu/-/commit/3692a11ff3e2b96ec596d2260e921369e8ba4729), and with that change I can still reproduce the problem:

Read qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480
Reread qtd from e93c0, offset 1460, length 20480

OK, not very scientific, but that kinda suggests qemu does the writeback after seabios writes the '0' offset. If that's true, then the question becomes: is seabios writing too early, or is qemu writing too late? I haven't figured out yet how EHCI is supposed to guard against such race conditions, but https://gitlab.com/raboof/qemu/-/blob/master/hw/usb/hcd-ehci.c#L1937 is making me slightly nervous ;)

@ncfavier
Copy link
Member

ncfavier commented May 8, 2022

That's bone-chilling, but on a more practical note: should we just patch kvmclock_init out of seabios in nixpkgs? Is using the TSC timer supposed to make a difference? I haven't noticed a speed-up, certainly not on a factor of 1000.

@raboof
Copy link
Member Author

raboof commented May 8, 2022

That's bone-chilling

😆

on a more practical note: should we just patch kvmclock_init out of seabios in nixpkgs? Is using the TSC timer supposed to make a difference?

That seems a bit heavy-handed, but I guess we could use a build with CONFIG_TSC_TIMER=n.

I haven't noticed a speed-up, certainly not on a factor of 1000.

I agree I don't think it's supposed to change the speed at which things run, it's just another way to keep the time

@ncfavier
Copy link
Member

ncfavier commented May 8, 2022

Right. Actually we'd need to change the seabios shipped with qemu, not the one in nixpkgs.

@raboof
Copy link
Member Author

raboof commented May 8, 2022

Right. Actually we'd need to change the seabios shipped with qemu, not the one in nixpkgs.

I think you can override the one shipped with qemy by passing a bios = to the test - just have to use CONFIG_CSM=n (and CONFIG_TSC_TIMER=n) while building seabios

@ncfavier
Copy link
Member

ncfavier commented May 8, 2022

Opened #172059

raboof added a commit to raboof/nixpkgs that referenced this issue May 8, 2022
This patch fixes a problem that caused the NixOS tests that tested booting
from USB to fail periodically.

Fixes NixOS#15690, fixes NixOS#104642, fixes NixOS#170803

Also submitted upstream at https://lists.nongnu.org/archive/html/qemu-devel/2022-05/msg01484.html
@raboof raboof mentioned this issue May 8, 2022
13 tasks
@raboof
Copy link
Member Author

raboof commented May 8, 2022

Opened #172059

This is great, but I think I figured out the 'real' problem now! #172070

@ncfavier
Copy link
Member

ncfavier commented May 8, 2022

Neat!

@raboof
Copy link
Member Author

raboof commented May 8, 2022

Neat!

Thanks a lot for finding that post by Lin Ma and bouncing ideas here, without that I'd surely have given up much earlier ;)

kraxel pushed a commit to kraxel/qemu that referenced this issue Jun 9, 2022
The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
kraxel pushed a commit to kraxel/qemu that referenced this issue Jun 10, 2022
The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
bonzini pushed a commit to qemu/qemu that referenced this issue Jun 10, 2022
The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
kraxel pushed a commit to kraxel/qemu that referenced this issue Jun 13, 2022
The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
kraxel pushed a commit to kraxel/qemu that referenced this issue Jun 14, 2022
The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Sep 16, 2022
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue May 12, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Jul 27, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Jul 28, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Jul 28, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Jul 28, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Jul 31, 2023
Git-commit: f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
dfaggioli pushed a commit to openSUSE/qemu that referenced this issue Aug 7, 2023
Git-commit f471e8b
References: bsc#1192115

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Lin Ma <lma@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
it-is-a-robot pushed a commit to openeuler-mirror/qemu that referenced this issue Nov 12, 2023
mainline inclusion
commit f471e8b
category: bugfix

---------------------------------------------------------------

The 'active' bit passes control over a qTD between the guest and the
controller: set to 1 by guest to enable execution by the controller,
and the controller sets it to '0' to hand back control to the guest.

ehci_state_writeback write two dwords to main memory using DMA:
the third dword of the qTD (containing dt, total bytes to transfer,
cpage, cerr and status) and the fourth dword of the qTD (containing
the offset).

This commit makes sure the fourth dword is written before the third,
avoiding a race condition where a new offset written into the qTD
by the guest after it observed the status going to go to '0' gets
overwritten by a 'late' DMA writeback of the previous offset.

This race condition could lead to 'cpage out of range (5)' errors,
and reproduced by:

./qemu-system-x86_64 -enable-kvm -bios $SEABIOS/bios.bin -m 4096 -device usb-ehci -blockdev driver=file,read-only=on,filename=/home/aengelen/Downloads/openSUSE-Tumbleweed-DVD-i586-Snapshot20220428-Media.iso,node-name=iso -device usb-storage,drive=iso,bootindex=0 -chardev pipe,id=shell,path=/tmp/pipe -device virtio-serial -device virtconsole,chardev=shell -device virtio-rng-pci -serial mon:stdio -nographic

(press a key, select 'Installation' (2), and accept the default
values. On my machine the 'cpage out of range' is reproduced while
loading the Linux Kernel about once per 7 attempts. With the fix in
this commit it no longer fails)

This problem was previously reported as a seabios problem in
https://mail.coreboot.org/hyperkitty/list/seabios@seabios.org/thread/OUTHT5ISSQJGXPNTUPY3O5E5EPZJCHM3/
and as a nixos CI build failure in
NixOS/nixpkgs#170803

Signed-off-by: Arnout Engelen <arnout@bzzt.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Signed-off-by: tangbinzy <tangbin_yewu@cmss.chinamobile.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: build failure A package fails to build 6.topic: testing Tooling for automated testing of packages and modules
Projects
None yet
5 participants