GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

Garbage4F · 2023-12-20T22:52:45Z

Qubes OS release

4.2 kernel 6.1.62-1.qubes.fc37.x86_64 (kernel-latest also gives the same behavior)

Brief summary

I used GPU passthrough extensively in Qubes since 4.0, hardware working in 4.0 and 4.1, no longer works with Linux HVMs.

Steps to reproduce

dom0

Fresh install 4.2 (with nomodeset=0 otherwise installer fails for me)
Update templates and dom0
Create HVM gpu_test
Set HVM memory to 64GB
Assign 8 VCPU
Uncheck "Include in memory balancing"
Select "(provided by qube)" for kernel

Appvm

Boot from cdrom, install debian (tested with 11 and 12)
Enable non-free repos and firmware, install nvidia driver 525.147.05 (known good with GPU)

Attach devices in dom0 to gpu_test appvm:

# gpu
qvm-device pci attach -p gpu_test dom0:0c_00.0 
# gpu audio
qvm-device pci attach -p gpu_test dom0:0c_00.1 
# usb controller
qvm-device pci attach -p gpu_test dom0:0b_00.0 
# ssd
qvm-device pci attach -p gpu_test dom0:41_00.0

Patched /usr/libexec/xen/boot/qemu-stubdom-linux-rootfs as per AppVM with GPU pass-through crashes when more than 3.5 GB (3584MB) of RAM is assigned to it · Issue #4321 · QubesOS/qubes-issues · GitHub with the following inserted at line 160:

vm_name=$(xenstore-read "/local/domain/$domid/name")
if [ $(echo "$vm_name" | grep -iEc '^gpu_' ) -eq 1 ]; then
      dm_args=$(echo "$dm_args" | sed -n '1h;2,$H;${g;s/\(-machine\nxenfv\)/\1,max-ram-below-4g=2G/g;p}')
fi

Update dom0 bootloader, add:
iommu=soft amd_iommu=on rd.qubes.hide_pci=0c:00.0,0c:00.1,0b:00.0,41:00.0
Reboot dom0
Start the HVM

Expected behavior

Xorg or greeter displays on the GPU assigned to the HVM

Actual behavior

No display on HVM,

PCI devices showing in dom0:

$ lspci | egrep -i "(0c:00.0|0c:00.1|0b:00.0|41:00.0)"
0b:00.0 USB controller: Fresco Logic FL1100 USB 3.0 Host Controller (rev 10)
0c:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1)
0c:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO

PCI devices showing in HVM:

user@debian:~$ lspci | egrep "00:0[6789].0"
00:06.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
00:07.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
00:08.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1)
00:09.0 USB controller: Fresco Logic FL1100 USB 3.0 Host Controller (rev 10)

/etc/X11/xorg.conf is configured to correct device:

Section "Device"
    Identifier     "Card0"
    Driver         "nvidia"
    BusID "0:08:0"
EndSection

Possibly relevant errors in /var/log/xen/console/guest-gpu_test-dm.log

764:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
765:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 0 (addr: 0xe1900000, len: 0x10000) is overlapped.
766:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
767:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 4 (addr: 0xe1910000, len: 0x1000) is overlapped.
768:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
769:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 2 (addr: 0xe1911000, len: 0x1000) is overlapped.
770:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 0 (addr: 0xe1900000, len: 0x10000)
771:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 2 (addr: 0xe1911000, len: 0x1000)
772:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 4 (addr: 0xe1910000, len: 0x1000)
773:[2023-12-20 23:04:55] [00:08.0] xen_pt_region_update: Warning: Region: 1 (addr: 0xe0000000, len: 0x10000000) is overlapped.

and in syslog

user@debian:~$ sudo journalctl -b | egrep -i "(nvrm|nvidia)" | head -50

Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:06.0/sound/card0/input7
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:06.0/sound/card0/input8
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:06.0/sound/card0/input9
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:06.0/sound/card0/input10
Dec 20 22:47:12 debian kernel: nvidia: loading out-of-tree module taints kernel.
Dec 20 22:47:12 debian kernel: nvidia: module license 'NVIDIA' taints kernel.
Dec 20 22:47:12 debian kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 20 22:47:12 debian kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 242
Dec 20 22:47:12 debian kernel: nvidia 0000:00:08.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 20 22:47:12 debian kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
Dec 20 22:47:12 debian audit[564]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian audit[564]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: audit: type=1400 audit(1703108832.820:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: audit: type=1400 audit(1703108832.820:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
Dec 20 22:47:13 debian systemd-modules-load[271]: Inserted module 'nvidia_drm'
Dec 20 22:47:13 debian kernel: [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
Dec 20 22:47:13 debian kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:08.0 on minor 1
Dec 20 22:47:13 debian systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Dec 20 22:47:13 debian nvidia-persistenced[642]: Started (642)
Dec 20 22:47:13 debian kernel: NVRM: Xid (PCI:0000:00:08): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 20 22:47:13 debian kernel: NVRM: GPU 0000:00:08.0: GPU has fallen off the bus.
Dec 20 22:47:13 debian kernel: NVRM: A GPU crash dump has been created. If possible, please run
                               NVRM: nvidia-bug-report.sh as root to collect this data before
                               NVRM: the NVIDIA kernel module is unloaded.

(latter repeats multiple times)

The above is true regardless whether max-ram-below-4g is 2G or 3.5G. I also tested using the "known good" stubroot from Qubes 4.1 that doesn't work and gives different errors.

I've also tried every solution listed in issue # 4321 that doesn't involve the nouveau drivers or recompiling parts of xen which is unfortunately out of my ken.

Help!

The text was updated successfully, but these errors were encountered:

Garbage4F · 2023-12-20T23:56:23Z

Replacing the nvidia driver with nouveau gives the following

[    12.920] (II) NOUVEAU driver Date:   Sat Jan 23 12:24:42 2021 -0500
[    12.920] (II) NOUVEAU driver for NVIDIA chipset families :
[    12.920] 	RIVA TNT            (NV04)
[    12.920] 	RIVA TNT2           (NV05)
[    12.920] 	GeForce 256         (NV10)
[    12.920] 	GeForce 2           (NV11, NV15)
[    12.920] 	GeForce 4MX         (NV17, NV18)
[    12.920] 	GeForce 3           (NV20)
[    12.920] 	GeForce 4Ti         (NV25, NV28)
[    12.920] 	GeForce FX          (NV3x)
[    12.920] 	GeForce 6           (NV4x)
[    12.920] 	GeForce 7           (G7x)
[    12.920] 	GeForce 8           (G8x)
[    12.920] 	GeForce 9           (G9x)
[    12.920] 	GeForce GTX 2xx/3xx (GT2xx)
[    12.920] 	GeForce GTX 4xx/5xx (GFxxx)
[    12.920] 	GeForce GTX 6xx/7xx (GKxxx)
[    12.920] 	GeForce GTX 9xx     (GMxxx)
[    12.920] 	GeForce GTX 10xx    (GPxxx)
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.922] (EE) Unknown chipset: NV174
[    12.922] (II) [drm] nouveau interface version: 1.3.1
[    12.922] (EE) Unknown chipset: NV174
[    12.922] (EE) No devices detected.
[    12.922] (EE) 
Fatal server error:
[    12.922] (EE) no screens found(EE) 
[    12.922] (EE)

Garbage4F · 2023-12-21T15:29:41Z

I saw this post by @neowutran and set in my BIOS (MSI X399 SLI PLUS)

Re-Size BAR Support: [Disabled] --> [Enabled]
which forced the following changes:

Windoiws 10 WHQL Support: [Disabled] -> [Enabled]
Above 4G memory/Crypto Currency mining: [Disabled] -> [Enabled]
Restore PCIE Registers: [Disabled] -> [Enabled]
CSM Support: [Enabled] -> [Disabled]

Which rendered the system unbootable (hangs on BIOS), I was able to reset CMOS, then fix efibootmgr settings, convert from Legacy -> UEFI boot and then reset all the above bios settings.

Same results with max-ram-below-4g set at 2G and 3.5G: GPU has fallen off the bus :( Attached are the nvidia bug reports logs for the system in both configurations.

nvidia-bug-report.log-3.5G-RESIZE-BAR-DOM0.gz
nvidia-bug-report.log-3.5G-NO-RESIZE-BAR-DOM0.gz

Any and all ideas greatly appreciated! ty

Garbage4F · 2023-12-21T22:56:28Z

As a workaround I was able to get a copy of the affected hvm to work correctly with the open-gpu-kernel-modules driver which replaces the nvidia kernel driver but keeps the userland stuff from NVIDIA-Linux-x86_64-545.29.06.run

I needed to disable resize BAR in dom0's bios otherwise the hvm would not boot with >2GB RAM regardless of the max-ram-below-4g value

neowutran · 2023-12-22T07:44:42Z

For "I needed to disable resize BAR in dom0's bios otherwise the hvm would not boot with >2GB RAM regardless of the max-ram-below-4g value" you could try the patch I was suggesting here: QubesOS/qubes-vmm-xen#172

If it work as intended, no need to define max-ram-below-4g and will correctly work regardless of resize BAR value

matheusd · 2023-12-28T14:18:47Z

May be a dupe of #8631

* origin/pr/65: Fix integer overflow in qemu patch "hw-xen-xen_pt-Save-back-data-only-for-declared-regis" Fixes QubesOS/qubes-issues#8631 Fixes QubesOS/qubes-issues#8783 Fixes QubesOS/qubes-issues#9003

qubesos-bot · 2024-03-27T03:35:44Z

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot · 2024-03-27T03:36:03Z

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

Garbage4F added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Dec 20, 2023

Garbage4F mentioned this issue Dec 21, 2023

Fix guest memory corruption caused by hvmloader QubesOS/qubes-vmm-xen#172

Closed

andrewdavidwong added C: other hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. labels Dec 21, 2023

marmarek closed this as completed in QubesOS/qubes-vmm-xen-stubdom-linux@f498195 Mar 14, 2024

qubesos-bot mentioned this issue Mar 14, 2024

vmm-xen-stubdom-linux v4.2.11 (r4.2) QubesOS/updates-status#4453

Closed

qubesos-bot added the r4.2-host-cur-test label Mar 14, 2024

andrewdavidwong added C: Xen diagnosed Technical diagnosis has been performed (see issue comments). and removed C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 14, 2024

qubesos-bot added the r4.2-vm-bookworm-stable label Mar 27, 2024

qubesos-bot added r4.2-vm-trixie-stable r4.2-host-stable and removed r4.2-host-cur-test labels Mar 27, 2024

qubesos-bot mentioned this issue Apr 18, 2024

vmm-xen-stubdom-linux v4.2.11 (r4.3) QubesOS/updates-status#4527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

Garbage4F commented Dec 20, 2023 •

edited

Loading

Garbage4F commented Dec 20, 2023 •

edited

Loading

Garbage4F commented Dec 21, 2023

Garbage4F commented Dec 21, 2023

neowutran commented Dec 22, 2023

matheusd commented Dec 28, 2023

qubesos-bot commented Mar 27, 2024

qubesos-bot commented Mar 27, 2024

GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

Comments

Garbage4F commented Dec 20, 2023 • edited Loading

Qubes OS release

Brief summary

Steps to reproduce

Expected behavior

Actual behavior

Garbage4F commented Dec 20, 2023 • edited Loading

Garbage4F commented Dec 21, 2023

Garbage4F commented Dec 21, 2023

neowutran commented Dec 22, 2023

matheusd commented Dec 28, 2023

qubesos-bot commented Mar 27, 2024

qubesos-bot commented Mar 27, 2024

Garbage4F commented Dec 20, 2023 •

edited

Loading

Garbage4F commented Dec 20, 2023 •

edited

Loading