Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU passthrough broken in 4.2 (GPU has fallen off the bus / xen_pt_check_bar_overlap ) #8783

Closed
Garbage4F opened this issue Dec 20, 2023 · 7 comments
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-stable r4.2-vm-bookworm-stable r4.2-vm-trixie-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@Garbage4F
Copy link

Garbage4F commented Dec 20, 2023

Qubes OS release

4.2 kernel 6.1.62-1.qubes.fc37.x86_64 (kernel-latest also gives the same behavior)

Brief summary

I used GPU passthrough extensively in Qubes since 4.0, hardware working in 4.0 and 4.1, no longer works with Linux HVMs.

Steps to reproduce

dom0

  • Fresh install 4.2 (with nomodeset=0 otherwise installer fails for me)
  • Update templates and dom0
  • Create HVM gpu_test
  • Set HVM memory to 64GB
  • Assign 8 VCPU
  • Uncheck "Include in memory balancing"
  • Select "(provided by qube)" for kernel

Appvm

  • Boot from cdrom, install debian (tested with 11 and 12)
  • Enable non-free repos and firmware, install nvidia driver 525.147.05 (known good with GPU)

Attach devices in dom0 to gpu_test appvm:

# gpu
qvm-device pci attach -p gpu_test dom0:0c_00.0 
# gpu audio
qvm-device pci attach -p gpu_test dom0:0c_00.1 
# usb controller
qvm-device pci attach -p gpu_test dom0:0b_00.0 
# ssd
qvm-device pci attach -p gpu_test dom0:41_00.0 

Patched /usr/libexec/xen/boot/qemu-stubdom-linux-rootfs as per AppVM with GPU pass-through crashes when more than 3.5 GB (3584MB) of RAM is assigned to it · Issue #4321 · QubesOS/qubes-issues · GitHub with the following inserted at line 160:

vm_name=$(xenstore-read "/local/domain/$domid/name")
if [ $(echo "$vm_name" | grep -iEc '^gpu_' ) -eq 1 ]; then
      dm_args=$(echo "$dm_args" | sed -n '1h;2,$H;${g;s/\(-machine\nxenfv\)/\1,max-ram-below-4g=2G/g;p}')
fi
  • Update dom0 bootloader, add:
    iommu=soft amd_iommu=on rd.qubes.hide_pci=0c:00.0,0c:00.1,0b:00.0,41:00.0
  • Reboot dom0
  • Start the HVM

Expected behavior

Xorg or greeter displays on the GPU assigned to the HVM

Actual behavior

No display on HVM,

PCI devices showing in dom0:

$ lspci | egrep -i "(0c:00.0|0c:00.1|0b:00.0|41:00.0)"
0b:00.0 USB controller: Fresco Logic FL1100 USB 3.0 Host Controller (rev 10)
0c:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1)
0c:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO

PCI devices showing in HVM:

user@debian:~$ lspci | egrep "00:0[6789].0"
00:06.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
00:07.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
00:08.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1)
00:09.0 USB controller: Fresco Logic FL1100 USB 3.0 Host Controller (rev 10)

/etc/X11/xorg.conf is configured to correct device:

Section "Device"
    Identifier     "Card0"
    Driver         "nvidia"
    BusID "0:08:0"
EndSection

Possibly relevant errors in /var/log/xen/console/guest-gpu_test-dm.log

764:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
765:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 0 (addr: 0xe1900000, len: 0x10000) is overlapped.
766:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
767:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 4 (addr: 0xe1910000, len: 0x1000) is overlapped.
768:[2023-12-20 23:04:55] [00:09.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:08.0] Region: 1 (addr: 0xe0000000, len: 0x10000000)
769:[2023-12-20 23:04:55] [00:09.0] xen_pt_region_update: Warning: Region: 2 (addr: 0xe1911000, len: 0x1000) is overlapped.
770:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 0 (addr: 0xe1900000, len: 0x10000)
771:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 2 (addr: 0xe1911000, len: 0x1000)
772:[2023-12-20 23:04:55] [00:08.0] xen_pt_check_bar_overlap: Warning: Overlapped to device [00:09.0] Region: 4 (addr: 0xe1910000, len: 0x1000)
773:[2023-12-20 23:04:55] [00:08.0] xen_pt_region_update: Warning: Region: 1 (addr: 0xe0000000, len: 0x10000000) is overlapped.

and in syslog

user@debian:~$ sudo journalctl -b | egrep -i "(nvrm|nvidia)" | head -50

Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:06.0/sound/card0/input7
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:06.0/sound/card0/input8
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:06.0/sound/card0/input9
Dec 20 22:47:12 debian kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:06.0/sound/card0/input10
Dec 20 22:47:12 debian kernel: nvidia: loading out-of-tree module taints kernel.
Dec 20 22:47:12 debian kernel: nvidia: module license 'NVIDIA' taints kernel.
Dec 20 22:47:12 debian kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 20 22:47:12 debian kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 242
Dec 20 22:47:12 debian kernel: nvidia 0000:00:08.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 20 22:47:12 debian kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
Dec 20 22:47:12 debian audit[564]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian audit[564]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: audit: type=1400 audit(1703108832.820:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: audit: type=1400 audit(1703108832.820:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=564 comm="apparmor_parser"
Dec 20 22:47:12 debian kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
Dec 20 22:47:13 debian systemd-modules-load[271]: Inserted module 'nvidia_drm'
Dec 20 22:47:13 debian kernel: [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
Dec 20 22:47:13 debian kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:08.0 on minor 1
Dec 20 22:47:13 debian systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Dec 20 22:47:13 debian nvidia-persistenced[642]: Started (642)
Dec 20 22:47:13 debian kernel: NVRM: Xid (PCI:0000:00:08): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 20 22:47:13 debian kernel: NVRM: GPU 0000:00:08.0: GPU has fallen off the bus.
Dec 20 22:47:13 debian kernel: NVRM: A GPU crash dump has been created. If possible, please run
                               NVRM: nvidia-bug-report.sh as root to collect this data before
                               NVRM: the NVIDIA kernel module is unloaded.

(latter repeats multiple times)

The above is true regardless whether max-ram-below-4g is 2G or 3.5G. I also tested using the "known good" stubroot from Qubes 4.1 that doesn't work and gives different errors.

I've also tried every solution listed in issue # 4321 that doesn't involve the nouveau drivers or recompiling parts of xen which is unfortunately out of my ken.

Help!

@Garbage4F Garbage4F added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Dec 20, 2023
@Garbage4F
Copy link
Author

Garbage4F commented Dec 20, 2023

Replacing the nvidia driver with nouveau gives the following

[    12.920] (II) NOUVEAU driver Date:   Sat Jan 23 12:24:42 2021 -0500
[    12.920] (II) NOUVEAU driver for NVIDIA chipset families :
[    12.920] 	RIVA TNT            (NV04)
[    12.920] 	RIVA TNT2           (NV05)
[    12.920] 	GeForce 256         (NV10)
[    12.920] 	GeForce 2           (NV11, NV15)
[    12.920] 	GeForce 4MX         (NV17, NV18)
[    12.920] 	GeForce 3           (NV20)
[    12.920] 	GeForce 4Ti         (NV25, NV28)
[    12.920] 	GeForce FX          (NV3x)
[    12.920] 	GeForce 6           (NV4x)
[    12.920] 	GeForce 7           (G7x)
[    12.920] 	GeForce 8           (G8x)
[    12.920] 	GeForce 9           (G9x)
[    12.920] 	GeForce GTX 2xx/3xx (GT2xx)
[    12.920] 	GeForce GTX 4xx/5xx (GFxxx)
[    12.920] 	GeForce GTX 6xx/7xx (GKxxx)
[    12.920] 	GeForce GTX 9xx     (GMxxx)
[    12.920] 	GeForce GTX 10xx    (GPxxx)
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.921] (EE) Unknown chipset: NV174
[    12.921] (II) [drm] nouveau interface version: 1.3.1
[    12.922] (EE) Unknown chipset: NV174
[    12.922] (II) [drm] nouveau interface version: 1.3.1
[    12.922] (EE) Unknown chipset: NV174
[    12.922] (EE) No devices detected.
[    12.922] (EE) 
Fatal server error:
[    12.922] (EE) no screens found(EE) 
[    12.922] (EE) 

@Garbage4F
Copy link
Author

I saw this post by @neowutran and set in my BIOS (MSI X399 SLI PLUS)

Re-Size BAR Support: [Disabled] --> [Enabled]
which forced the following changes:

Windoiws 10 WHQL Support: [Disabled] -> [Enabled]
Above 4G memory/Crypto Currency mining: [Disabled] -> [Enabled]
Restore PCIE Registers: [Disabled] -> [Enabled]
CSM Support: [Enabled] -> [Disabled]

Which rendered the system unbootable (hangs on BIOS), I was able to reset CMOS, then fix efibootmgr settings, convert from Legacy -> UEFI boot and then reset all the above bios settings.

Same results with max-ram-below-4g set at 2G and 3.5G: GPU has fallen off the bus :( Attached are the nvidia bug reports logs for the system in both configurations.

nvidia-bug-report.log-3.5G-RESIZE-BAR-DOM0.gz
nvidia-bug-report.log-3.5G-NO-RESIZE-BAR-DOM0.gz

Any and all ideas greatly appreciated! ty

@andrewdavidwong andrewdavidwong added C: other hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. labels Dec 21, 2023
@Garbage4F
Copy link
Author

As a workaround I was able to get a copy of the affected hvm to work correctly with the open-gpu-kernel-modules driver which replaces the nvidia kernel driver but keeps the userland stuff from NVIDIA-Linux-x86_64-545.29.06.run

I needed to disable resize BAR in dom0's bios otherwise the hvm would not boot with >2GB RAM regardless of the max-ram-below-4g value

@neowutran
Copy link

For "I needed to disable resize BAR in dom0's bios otherwise the hvm would not boot with >2GB RAM regardless of the max-ram-below-4g value" you could try the patch I was suggesting here: QubesOS/qubes-vmm-xen#172

If it work as intended, no need to define max-ram-below-4g and will correctly work regardless of resize BAR value

@matheusd
Copy link

May be a dupe of #8631

marmarek added a commit to QubesOS/qubes-vmm-xen-stubdom-linux that referenced this issue Mar 14, 2024
* origin/pr/65:
  Fix integer overflow in qemu patch "hw-xen-xen_pt-Save-back-data-only-for-declared-regis" Fixes QubesOS/qubes-issues#8631 Fixes QubesOS/qubes-issues#8783 Fixes QubesOS/qubes-issues#9003
@andrewdavidwong andrewdavidwong added C: Xen diagnosed Technical diagnosis has been performed (see issue comments). and removed C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 14, 2024
@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-stable r4.2-vm-bookworm-stable r4.2-vm-trixie-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants