Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10 crashes while installing nvidia drivers #9003

Closed
fore567 opened this issue Mar 2, 2024 · 32 comments
Closed

Windows 10 crashes while installing nvidia drivers #9003

fore567 opened this issue Mar 2, 2024 · 32 comments
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: windows-vm diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-stable r4.2-vm-bookworm-stable r4.2-vm-trixie-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@fore567
Copy link

fore567 commented Mar 2, 2024

How to file a helpful issue

Qubes OS release

R4.2

Brief summary

Qubes 4.1.2 doesn't crash while installing nvidia drivers but qubes 4.2 crashes SYSTEM_THREAD_EXCEPTION_NOT_HANDLED(Nvlddmkm.sys). Nvidia rtx 2070

Steps to reproduce

Gpu passthrough and install nvidia drivers

Expected behavior

Windows not crashing

Actual behavior

Windows 10 crashes while installing nvidia drivers SYSTEM_THREAD_EXCEPTION_NOT_HANDLED(Nvlddmkm.sys)

@fore567 fore567 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Mar 2, 2024
@DemiMarie DemiMarie added C: windows-vm needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 2, 2024
@neowutran
Copy link

I was trying to debug this issue for quite some time and ended up writing an article on my website: https://web.neowutran.ovh/qubes/articles/nvidia.pdf

@DemiMarie
Copy link

@neowutran Nice! Looks like the problem is that writes to the command field are denied. There are two possible solutions:

  1. Allow these writes.
  2. Emulate them such that the guest reads the value it last wrote, but the writes are not sent to the hardware.

@marmarek which of those solutions do you think is better?

@marmarek
Copy link
Member

marmarek commented Mar 2, 2024

@neowutran awesome writeup :)

@marmarek which of those solutions do you think is better?

Neither. See below.

First lets take a closer look what the write actually is.

os_pci_write_dword: write value 1049603 to offset 4
os_pci_write_dword: write value 1049607 to offset 4

BTW @neowutran it would be much more convenient to log those values in hex (%x instead of %u).

EDIT: skip to the next chapter, the analysis here is correct, but doesn't lead to the real issue

I don't know what the value before was (I'm sure the driver read it just before writing, logging a read might be useful too), but I guess it was just flipping bit 2 off and back on. It's bus mastering aka DMA. If my hypothesis about turning it off and then back on is correct, then the end state is the same as initial state, so it shouldn't break anything (regardless whether it was passed through the hardware or not). But since avoiding the write fixed the situation, that's probably not the case.

All the config space writes (and reads) should end up in QEMU, which decides which one goes to the hardware, which one is emulated (so VM sees the written value, but hardware doesn't get it) and which one is discarded. That decision is for each bit separately. For the command register specifically, it's here: https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L614-L623

    {
        .offset     = PCI_COMMAND,
        .size       = 2,
        .init_val   = 0x0000,
        .res_mask   = 0xF880,
        .emu_mask   = 0x0743,
        .init       = xen_pt_common_reg_init,
        .u.w.read   = xen_pt_word_reg_read,
        .u.w.write  = xen_pt_cmd_reg_write,
    },

What should happen is the res_mask defines which bits are reserved (writes there are ignored) and emu_mask defines which bits are emulated. There can be also ro_mask - bits which will be read from the hardware, but writes are ignored, but it isn't used for this register. Bits that are neither reserved nor emulated should go to the hardware. The 0x0743 mask does not include bit 2, so write to PCI_COMMAND_MASTER should be passed through to the hardware. The patch you identified should not change behavior for bits that are passed through to the hardware. It is possible there is some mistake in it, but that would likely cause much broader breakage.

As a side note, you can get more info from pciback about actual read/writes to the config space that happen (after QEMU apply the masks) by enabling debug of pciback in dom0:

echo file drivers/xen/xen-pciback/conf_space.c +p > /sys/kernel/debug/dynamic_debug/control

The actual issue (I think):
write_dword is about writing 32 bits (double word). But the command register is 16 bits. There are two options:

  1. The driver intended to write to just the command register - in that case, it should use write_word function.
  2. The driver intended to write to both command and status registers (which is the next after command). In that case, the driver should use two write_word calls (but register emulation code in QEMU would handle also 32-bit write to modify two 16-bit registers).

I think the first case is much more likely. The behavior here indeed was changed (fixed) by the patch you identified. Before the patch, the ro_mask was not properly enforced, and bits there wre more or less treated as it was emu_mask (changed bits were stored in QEMU and next read would get the changed value even for read-only bits; but the changed value wasn't passed through to the hardware). After the patch, the ro_mask should be properly enforced so writes to those bits are ignored.

The status register is defined as:

    {
        .offset     = PCI_STATUS,
        .size       = 2,
        .init_val   = 0x0000,
        .res_mask   = 0x0007,
        .ro_mask    = 0x06F8,
        .rw1c_mask  = 0xF900,
        .emu_mask   = 0x0010,
        .init       = xen_pt_status_reg_init,
        .u.w.read   = xen_pt_word_reg_read,
        .u.w.write  = xen_pt_word_reg_write,
    },

The value written to it (after discarding bits for the command register) would be 0x10. It sets the single emulated bit, but clear all the others. When ro_mask was not properly enforced, it caused most of the bits to be cleared in QEMU (not a big deal since they are outside of emu_mask, so they would be read from the hardware anyway), but also bit 4 was set even though it shouldn't be allows (due to being on ro_mask). Patched version should properly refuse (or rather: ignore) this bit change, which may confuse the driver (it reads something else than it has written).

The proper solution would be first to verify if writing just to the command register was intended. If so, it should use 16-bit write function. You can easily find where that write comes from by adding WARN_ON(offset == 4) in that driver and observe kernel logs - you should get the stack trace leading to the right place. There is likely a matching read nearby - it also needs to be changed to the 16-bit one.
But if writing to the status register too was intended (IMO highly unlikely), first of all it should be two 16 bit writes. But then, we would need a better understanding what those writes are for, because the status register is mostly about reading device state, not changing it (but there could be also some nvidia-specific behavior that abuses the PCI spec...).

@DemiMarie
Copy link

An, so this is an Nvidia driver bug that happens to go uncaught on bare hardware.

@neowutran
Copy link

neowutran commented Mar 2, 2024

@marmarek You are indeed correct, I just did a quick test:
Right before the writes to offset 4, there is a read to offset 4.
If the driver tried to write a dword on offset 4, instead I forced to write a word.
Driver work as expected.

NV_STATUS NV_API_CALL os_pci_write_dword(
    void *handle,
    NvU32 offset,
    NvU32 value
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
        return NV_ERR_NOT_SUPPORTED;
    printk(KERN_ALERT "NEOWUTRAN os_pci_write_dword : try to write %x %x \n",offset, value);
    if (offset != 4){
        pci_write_config_dword( (struct pci_dev *) handle, offset, value);
    }else{
        pci_write_config_word( (struct pci_dev *) handle, offset, value);
    }
    return NV_OK;
}

@andrewdavidwong
Copy link
Member

An, so this is an Nvidia driver bug that happens to go uncaught on bare hardware.

Should this be closed with R: upstream issue Resolution: This issue pertains to software that the Qubes OS Project does not develop or control. ?

@andrewdavidwong andrewdavidwong added the affects-4.2 This issue affects Qubes OS 4.2. label Mar 2, 2024
@fore567
Copy link
Author

fore567 commented Mar 3, 2024

An, so this is an Nvidia driver bug that happens to go uncaught on bare hardware.

Can this be fixed in Qubes? Since it didn't crash in the older versions and I can install the nvidia drivers in other hypervisors too without windows crashing

@marmarek
Copy link
Member

marmarek commented Mar 3, 2024

@neowutran can you log the values read there? and also, can you log both read and writes with stubdomain without that patch?

@neowutran
Copy link

I used your trace command (thanks a lot, didn't know it existed)

echo file drivers/xen/xen-pciback/conf_space.c +p > /sys/kernel/debug/dynamic_debug/control

Without the patch, the driver try to initialize itself multiple time before stopping trying. It does tens of thousands read and write to pci configuration.

Full logs: https://web.neowutran.ovh/marmarek_gpu_pciconfig_logs_not_patched

Most interesting part is

--
[111340.627664] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[111340.627673] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100007
[111340.627735] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x8
--
[111340.981856] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4
[111340.981869] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 7
[111340.981984] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4
[111340.981995] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 7
[111340.982033] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 407
[111340.982123] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
--
[111341.419350] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[111341.419381] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[111341.423220] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[111341.423241] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[111341.485521] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100403
[111341.487172] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a

With the patch: https://web.neowutran.ovh/marmarek_gpu_pciconfig_logs_patched
Write to offset 4 with the patch:

[111997.410644] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 3
[111997.411418] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.411860] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 400
[111997.412127] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.412482] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 400
[111997.412804] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.413368] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 400
[111997.414912] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.415093] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 400
[111997.415360] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.415675] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 400
[111997.418419] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111997.975102] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[111998.477511] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 3
[111998.560862] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.173504] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 407
[112002.173800] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.176317] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.177092] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.177393] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.178999] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.179338] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.180696] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.183410] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.184124] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.184425] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.185175] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.240707] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112002.244339] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 7
[112007.411147] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 407
[112007.870619] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[112007.877160] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 407

What are you looking for ?

@marmarek
Copy link
Member

marmarek commented Mar 4, 2024

Can you check also with older stubdomain (that worked before) and using not patched driver?

@neowutran
Copy link

neowutran commented Mar 4, 2024

Old stubdom, driver not patched: https://web.neowutran.ovh/marmarek_gpu_pciconfig_logs_old-stubdom_driver-not-patched

I have a hard time understanding the logs.

Recent stubdom, driver not patched (crashing): extract of the logs around the first references to reading things maching value "10040" for offset 4

[111340.990921] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x8
[111340.990930] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x8 = a1
[111340.990992] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xe
[111340.990999] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xe = 80
[111341.419350] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[111341.419381] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[111341.423220] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[111341.423241] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[111341.485521] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100403
[111341.487172] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487197] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81
[111341.487406] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487427] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81
[111341.487581] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487596] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81
[111341.487659] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x6a = 80
[111341.487854] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x6a = 80
[111341.487974] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4
[111341.487989] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 403
[111341.489797] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34
[111341.489812] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34 = 60

Same thing, but with a old stubdom (without the RO enforcement, and without msi-x support ):

[114715.551793] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x8
[114715.551804] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x8 = a1
[114715.553307] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xe
[114715.553320] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xe = 80
[114715.993682] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[114715.993728] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[114715.996961] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[114715.996984] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[114715.997168] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100403
[114716.000774] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[114716.000826] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100403
[114716.000941] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100407
[114716.001173] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x2c
[114716.001181] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x2c = 167a10de
[114716.037147] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0xc
[114716.037182] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0xc = 800010
[114716.037443] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0xa
[114716.037454] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0xa = 300
[114716.037688] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34
[114716.037701] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34 = 60

In both case, the write done to "command" field is a success, on the next read to offset 0x4, we can see that the value have been modified accordingly.
For the overflow to the "status" field, it seems to always be the same value, the write to the "status" field is probably an error on nvidia side.
Maybe trying to write something to the "status" field trigger a special nvidia behavior, but that seems, unlikely ? since for my driver patch, both "ignoring write to offset 4" and "replacing dword write to offset 4 by a word write" end up working.
No one is using the return value of function "pci_ * config * ".

Either I am confusing myself, or there is another kind of interaction between the driver and the hardware that I do not known.

If these logs doesn't make sense to you, I can redo them later

@marmarek
Copy link
Member

marmarek commented Mar 4, 2024

[111341.485521] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100403

So, here it disables bus mastering. Is that the last write to this register before failing? Can you correlate whether is it before failing or after?
Interestingly, the next read from the register is 2 bytes, which suggests it's from another place in the code base (maybe even in core Linux PCI subsystem, outside of nvidia driver), but next 4 bytes read from there is much later.
In the patched log, when looking also at reads, I see this:

[112007.864522] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4
[112007.864557] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100407
[112007.870370] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4
[112007.870396] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 407
[112007.870619] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 403
[112007.876896] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4
[112007.876913] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 403
[112007.877160] pciback 0000:01:00.0: xen-pciback: write request 2 bytes at 0x4 = 407

So, it reads 4 bytes (the read of 2 bytes is I guess from qemu to get ro bits values), writes 2 bytes (I guess this was originally 4 bytes). But in the non-patched version I don't see read of that value immediately after, only some time later. I guess we don't see full picture here, there may be also some MMIO access in between that aren't going through QEMU nor pciback.

But looking into it a bit deeper, I think xen-pciback ignores higher bits of that write anyway: it calls write variant of appropriate size and the command register is registered with size 2.
I'm equally confused...

To be clear, I do all this analysis still to figure out if the issue is only in the nvidia driver, or also in some of the qubes code (patched qemu being the most likely candidate).

@neowutran
Copy link

neowutran commented Mar 4, 2024

The very last things (read or write, doesn't matter dword/word/byte) before the first formal message indicating a driver crash ("fallen off the bus") is :

[    6.013960] NEOWUTRAN os_pci_write_dword : try to write 4 100403
### Below, it is not a crash, just a callstack print I put to debug the thing 
[    6.013968] ------------[ cut here ]------------
[    6.013969] WARNING: CPU: 1 PID: 616 at /home/user/NVIDIA-Linux-x86_64-545.29.06/kernel/nvidia/os-pci.c:175 os_pci_write_dword+0x78/0x80 [nvidia]
[    6.014083] Modules linked in: nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_ct nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c intel_rapl_msr intel_rapl_com>
[    6.014121] CPU: 1 PID: 616 Comm: Xorg Tainted: P        W  OE      6.7.3-arch1-1 #1 b8291227ebee24f0bec9b3471a94151938512264
[    6.014123] Hardware name: Xen HVM domU, BIOS 4.17.3 03/01/2024
[    6.014124] RIP: 0010:os_pci_write_dword+0x78/0x80 [nvidia]
[    6.014232] Code: 4c 89 e7 45 31 ed e8 d7 cc 1c c1 48 8b 74 24 20 48 c7 c7 80 69 55 c0 e8 86 f0 00 00 44 89 e8 5b 5d 41 5c 41 5d e9 13 d7 9a c1 <0f> 0b eb ce 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[    6.014233] RSP: 0018:ffffc900018f38e8 EFLAGS: 00010246
[    6.014235] RAX: 0000000000000035 RBX: 0000000000000004 RCX: 0000000000000027
[    6.014236] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88839f3216c0
[    6.014237] RBP: 0000000000100403 R08: 0000000000000000 R09: ffffc900018f3788
[    6.014238] R10: 0000000000000003 R11: ffff8883e6ffdfe8 R12: ffff888101413000
[    6.014238] R13: 0000000000000056 R14: 0000000000000004 R15: 0000000000000000
[    6.014240] FS:  00007ffff70b15c0(0000) GS:ffff88839f300000(0000) knlGS:0000000000000000
[    6.014241] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.014242] CR2: 00007ffff7e96790 CR3: 0000000106a38000 CR4: 0000000000750ef0
[    6.014243] PKRU: 55555554
[    6.014244] Call Trace:
[    6.014245]  <TASK>
[    6.014246]  ? os_pci_write_dword+0x78/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.014354]  ? __warn+0x81/0x130
[    6.014356]  ? os_pci_write_dword+0x78/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.014465]  ? report_bug+0x171/0x1a0
[    6.014468]  ? handle_bug+0x3c/0x80
[    6.014470]  ? exc_invalid_op+0x17/0x70
[    6.014471]  ? asm_exc_invalid_op+0x1a/0x20
[    6.014474]  ? os_pci_write_dword+0x78/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.014596]  ? os_pci_write_dword+0x45/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.014704]  _nv049294rm+0xfc/0x170 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.014854]  _nv040313rm+0x27/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.015009]  _nv013425rm+0xc1/0x130 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.015236]  _nv043858rm+0x59/0x60 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.015444]  _nv043857rm+0xcb/0x1c0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.015645]  _nv026084rm+0x2c/0x70 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.015924]  ? _nv025958rm+0x20/0x20 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.016231]  _nv014953rm+0x50/0x80 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.016397]  _nv035515rm+0x81/0x3c0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.016629]  _nv000708rm+0x9af/0x1f60 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.016782]  rm_init_adapter+0xcd/0xf0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.016930]  nv_start_device+0x5a4/0xab0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.017043]  nv_open_device+0xc5/0x1c0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.017159]  nvidia_open+0x1bf/0x2c0 [nvidia 0f10eba124e8277b5c81f9120f9b2ac726697a8c]
[    6.017567]  chrdev_open+0xd1/0x250
[    6.017570]  ? __pfx_chrdev_open+0x10/0x10
[    6.017572]  do_dentry_open+0x203/0x590
[    6.017575]  path_openat+0xd7d/0x1260
[    6.017577]  ? security_inode_alloc+0x28/0x90
[    6.017579]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017582]  do_filp_open+0xb3/0x160
[    6.017586]  do_sys_openat2+0xab/0xe0
[    6.017588]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017590]  __x64_sys_openat+0x57/0xa0
[    6.017592]  do_syscall_64+0x64/0xe0
[    6.017594]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017595]  ? syscall_exit_to_user_mode+0x2b/0x40
[    6.017597]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017598]  ? do_syscall_64+0x70/0xe0
[    6.017600]  ? syscall_exit_to_user_mode+0x2b/0x40
[    6.017601]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017602]  ? do_syscall_64+0x70/0xe0
[    6.017603]  ? syscall_exit_to_user_mode+0x2b/0x40
[    6.017605]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017606]  ? do_syscall_64+0x70/0xe0
[    6.017608]  ? srso_alias_return_thunk+0x5/0xfbef5
[    6.017609]  ? exc_page_fault+0x7f/0x180
[    6.017611]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[    6.017613] RIP: 0033:0x7ffff7a96c02
[    6.017617] Code: 83 e2 40 75 53 89 f0 f7 d0 a9 00 00 41 00 74 48 80 3d 71 d7 0d 00 00 74 6c 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 92 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[    6.017618] RSP: 002b:00007fffffffd900 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[    6.017620] RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007ffff7a96c02
[    6.017621] RDX: 0000000000080002 RSI: 00007fffffffd990 RDI: 00000000ffffff9c
[    6.017622] RBP: 00007fffffffd990 R08: 0000000000000064 R09: 0000000000000000
[    6.017624] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fffffffd990
[    6.017624] R13: 00007ffff6a22a60 R14: 00007ffff6a22a60 R15: 00007ffff6a22a60
[    6.017627]  </TASK>
[    6.017628] ---[ end trace 0000000000000000 ]---
####### Below the crash notification 
[    6.056162] NVRM: Xid (PCI:0000:00:05): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[    6.056175] NVRM: GPU 0000:00:05.0: GPU has fallen off the bus.
[    6.056241] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[    6.056864] NVRM: VM: nv_free_pages: 0xa
[    6.056867] NVRM: VM: nv_free_pages:3609: 0x0000000058669eee, 10 page(s), count = 1, flags = 0x00000005, page_table = 0x00000000845e625a

However, between '[ 6.017628]' and '[ 6.056162]', that quite some time, so something interesting is probably happening in between, and it is not related to a call to pci_XX_config_XXX from the driver .

I see no way the driver could make the difference between a dword write (command + status, but with status ignored ) and a word write (one of the driver patch that work, write to command only). The driver crash before re reading the value.
So it can't be on the driver side ?

For the hardware, there is no difference, between the RO enforcement or the patch to force a word write instead of a dword, for the GPU, the 'status' field is never written.

So yeah, a layer between the driver and the hardware is not happy about the try write to a RO field and trash the GPU state.

I am unable to find any reference to offset 0x6A

[111341.487172] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487197] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81
[111341.487406] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487427] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81
[111341.487581] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a
[111341.487596] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6a = 81

in the guest log.
So thoses read are either done by something not being the driver in the guest, or those read are not being done by the guest at all / OR, I am confusing myself and 0x6A is not related to 'pci_config'.

@neowutran
Copy link

On the stubdom side, when the driver crash I see a lot of those messages

...
[2024-03-04 17:50:25] [00:05.0] xen_pt_msi_set_enable: disabling MSI.

[2024-03-04 17:50:25] [00:05.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0

[2024-03-04 17:50:25] [00:05.0] msi_msix_disable: Unmap MSI pirq 87

[2024-03-04 17:50:25] [00:05.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).

[2024-03-04 17:50:25] [00:05.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)

[2024-03-04 17:50:25] [00:05.0] xen_pt_msi_setup: MSI mapped with pirq 87.

[2024-03-04 17:50:25] [00:05.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)

[2024-03-04 17:50:25] [00:05.0] xen_pt_msi_set_enable: disabling MSI.

[2024-03-04 17:50:25] [00:05.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0

[2024-03-04 17:50:25] [00:05.0] msi_msix_disable: Unmap MSI pirq 87

[2024-03-04 17:50:25] [00:05.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).

[2024-03-04 17:50:25] [00:05.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)

[2024-03-04 17:50:25] [00:05.0] xen_pt_msi_setup: MSI mapped with pirq 87.

[2024-03-04 17:50:25] [00:05.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)
[2024-03-04 17:50:25] [00:05.0] xen_pt_msi_set_enable: disabling MSI.

[2024-03-04 17:50:25] [00:05.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0
...

Definitely related to the issue, but I don't understand if it is useful

@neowutran
Copy link

I decided to get new log file.
Before trying to get the log files, in dom0 I ran

echo file drivers/xen/xen-pciback/conf_space.c +p > /sys/kernel/debug/dynamic_debug/control

For the nvidia driver, I modified the os-pci.c file to print more logs. The modified function are below:

NV_STATUS NV_API_CALL os_pci_read_byte(
    void *handle,
    NvU32 offset,
    NvU8 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_byte : try to read %u \n",offset);
    pci_read_config_byte( (struct pci_dev *) handle, offset, pReturnValue);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_read_word(
    void *handle,
    NvU32 offset,
    NvU16 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xffff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_word : try to read %x \n",offset);
    pci_read_config_word( (struct pci_dev *) handle, offset, pReturnValue);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_read_dword(
    void *handle,
    NvU32 offset,
    NvU32 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xffffffff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_dword : try to read %x \n",offset);
    WARN_ON(offset == 4);
     pci_read_config_dword( (struct pci_dev *) handle, offset, pReturnValue);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_write_byte(
    void *handle,
    NvU32 offset,
    NvU8 value
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
        return NV_ERR_NOT_SUPPORTED;
    printk(KERN_ALERT "NEOWUTRAN os_pci_write_byte : try to write %x %x \n",offset, value);
    pci_write_config_byte( (struct pci_dev *) handle, offset, value);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_write_word(
    void *handle,
    NvU32 offset,
    NvU16 value
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
        return NV_ERR_NOT_SUPPORTED;
    printk(KERN_ALERT "NEOWUTRAN os_pci_write_word : try to write %x %x \n",offset, value);
    pci_write_config_word( (struct pci_dev *) handle, offset, value);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_write_dword(
    void *handle,
    NvU32 offset,
    NvU32 value
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
        return NV_ERR_NOT_SUPPORTED;
    printk(KERN_ALERT "NEOWUTRAN os_pci_write_dword : try to write %x %x \n",offset, value);
    WARN_ON(offset == 4);
    pci_write_config_dword( (struct pci_dev *) handle, offset, value);
    return NV_OK;
}

Before the each of the following tests, in dom0 I ran sudo dmesg -C so in the dom0 logs file there is only the logs between the moment where I launch the guest startup command and the moment the Xfce terminal of the guest popup.

To gather the logs in dom0, I executed the command sudo dmesg -cw | tee /tmp/logs
To gather the logs in the guest, I executed the command sudo dmesg -P | tee /tmp/logs

First test: With the most recent release of qubes-vmm-xen-stubdom-linux.
dom0: https://web.neowutran.ovh/v2_dom0-logs_uptodate-stubdom_non-patched-nvidia
guest: https://web.neowutran.ovh/v2_guest-logs_uptodate-stubdom_non-patched-nvidia

Second test: With 4.2.6 version of qubes-vmm-xen-stubdom-linux.
dom0: https://web.neowutran.ovh/v2_dom0-logs_4.2.6-stubdom_non-patched-nvidia
guest: https://web.neowutran.ovh/v2_guest-logs_4.2.6-stubdom_non-patched-nvidia

@marmarek
Copy link
Member

marmarek commented Mar 5, 2024

For the nvidia driver, I modified the os-pci.c file to print more logs. The modified function are below:

While at it, may be also useful to log value that was read, as seen by the driver (the *pReturnValue value).
(I haven't read the new logs yet)

@neowutran
Copy link

(
Do you have any references to good diagrams or documentation that explain the following things?

  • What are the layers between the driver and the hardware (guest driver -> guest kernel -> qemu -> xen stubdom -> dom0 -> hardware)?
  • What are the different ways of interacting with hardware on the PCI bus and how do they work? ( MSI / MSIX / IRQ / MMIO / ... ? )

There is a lot of things that I don't understand regarding those things, and it is hard to find good and up to date explanations

)

@neowutran
Copy link

functions that I have modified:

NV_STATUS NV_API_CALL os_pci_read_byte(
    void *handle,
    NvU32 offset,
    NvU8 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_byte : try to read %u \n",offset);
    pci_read_config_byte( (struct pci_dev *) handle, offset, pReturnValue);
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_byte : result : offset %x, value %x \n",offset, *pReturnValue);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_read_word(
    void *handle,
    NvU32 offset,
    NvU16 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xffff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_word : try to read %x \n",offset);
    pci_read_config_word( (struct pci_dev *) handle, offset, pReturnValue);
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_word : result : offset %x, value %x \n",offset, *pReturnValue);
    return NV_OK;
}

NV_STATUS NV_API_CALL os_pci_read_dword(
    void *handle,
    NvU32 offset,
    NvU32 *pReturnValue
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
    {
        *pReturnValue = 0xffffffff;
        return NV_ERR_NOT_SUPPORTED;
    }
    printk(KERN_ALERT "NEOWUTRAN os_pci_read_dword : try to read %x \n",offset);
    WARN_ON(offset == 4);
       pci_read_config_dword( (struct pci_dev *) handle, offset, pReturnValue);
       printk(KERN_ALERT "NEOWUTRAN os_pci_read_dword : result : offset %x, value %x \n",offset, *pReturnValue);
    return NV_OK;
}

First test: With the most recent release of qubes-vmm-xen-stubdom-linux.
dom0: https://web.neowutran.ovh/v3_dom0-logs_uptodate-stubdom_non-patched-nvidia
guest: https://web.neowutran.ovh/v3_guest-logs_uptodate-stubdom_non-patched-nvidia

Second test: With 4.2.6 version of qubes-vmm-xen-stubdom-linux.
dom0: https://web.neowutran.ovh/v3_dom0-logs_4.2.6-stubdom_non-patched-nvidia
guest: https://web.neowutran.ovh/v3_guest-logs_4.2.6-stubdom_non-patched-nvidia

(didn't checked the logs, maybe tomorrow)

@marmarek
Copy link
Member

marmarek commented Mar 5, 2024

Do you have any references to good diagrams or documentation that explain the following things?

I don't have any, but for config space it works more or less this way:

  1. guest driver does read/write
  2. it's trapped into Xen, which then asks QEMU what to do about it
  3. QEMU processes the request, it may apply some filtering, emulation etc (code linked before in this issue)
  4. if QEMU decide to passthrough the request to the real hardware (it may choose to completely emulate or ignore the request too), the it access the real config space via sysfs (/sys/bus/pci/devices/.../config)
  5. since QEMU runs in a stubdomain, the sysfs access goes to xen-pcifront (in stubdomain) which communicates with xen-pciback (in dom0)
  6. xen-pciback in dom0 may apply some more filtering, and then attempts write to the actual hardware
  7. the write by dom0 kernel is trapped into Xen again, but this time it is really forwarded to the hardware

For MMIO access (all BARs), the flow is much simpler, since that device memory is mapped directly to the guest address space, so there is no trapping and emulation involved for the most part (QEMU or Xen may decide to trap some of the pages anyway, like the page for MSI-X config, but I don't think it's relevant here). Similarly for DMA, thanks to IOMMU, all the translation is done by the CPU, so no extra emulation is involved.

Interrupts are a bit more complicated (there are different ways how device can report interrupts, and the different ways how they can be delivered to the guest), but I hope it isn't related to this issue either.

@marmarek
Copy link
Member

marmarek commented Mar 6, 2024

I've checked the last set of logs and I'm confused. The values read and written both from the driver perspective and from dom0 perspective looks the same before the failure. So, it doesn't seem to be QEMU changing any of those values in unintended ways. So, my hypothesis that some read-only bit was not enforced is weaker - values that are read back by the driver are the same in both cases.
There is one difference - with new stubdomain I see extra set of reads:

[82175.158954] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x6
[82175.158964] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x6 = 10
[82175.158988] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34
[82175.158998] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34 = 60
[82175.159021] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x60
[82175.159033] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x60 = 6801
[82175.159060] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x61
[82175.159072] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x61 = 68
[82175.159096] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x68
[82175.159108] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x68 = 7805
[82175.159129] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x69
[82175.159141] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x69 = 78
[82175.159162] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x78
[82175.159169] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x78 = 10
[82175.159191] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x79
[82175.159199] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x79 = b4
[82175.159221] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb4
[82175.159231] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb4 = 9
[82175.159254] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb5
[82175.159264] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb5 = 0

This looks like enumerating capabilities. And if I'm not mistaken, the timestamp suggests it happens even before Linux starts - so, likely stubdomain itself. This is more likely related to either qemu update or stubdomain's Linux update, not to the problematic patch.

BTW, I've checked also the 0x6a register that is read/written directly after writing to the control register. It's MSI Message Control, and the bit 0 is "MSI Enable". So, I guess the write of 0x80 there is already part of a teardown on error, especially since it looks to be done from the Linux's PCI core code, not from nvidia directly (there is no log entry about that write on the guest side). And while at it, looking at the capabilities enumeration, this device does not support MSI-X, so (theoretically at least) the MSI-X code in stubdomain is irrelevant. FYI the capabilities here are:

  • cap 0x01 (power management) at offset 0x60
  • cap 0x05 (MSI) at offset 0x68
  • cap 0x10 (PCI Express) at offset 0x78
  • cap 0x09 (vendor specific) at offset 0xb4

Honestly, I'm running short on ideas. One remaining idea is that some internal qemu state changes differently due to the patch. Indeed, one of the reason for the patch is that write-handling functions would see previous value too. But I don't see any function using this feature right now. It could be also about enforcing read-only bits in some other register that is written but not read back afterwards (so, the difference doesn't show up in the log) or maybe register that is written from outside of the nvidia driver (so, again, doesn't show up in the log, as you add logging in the nvidia module only). But both of those hypothesis are quite weak, as changing write to the command register to be 16 bits instead of 32 bits fixes the issue.

Are you absolutely sure it is just this patch that breaks it? Have you tried stubdomains that differ with only the 0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch patch being applied or not?

While the fix in the nvidia driver has been identified (at least in the Linux version), and in fact it's already fixed in the nvidia-open driver, I'm still confused why this patch breaks it. I suspect there is some subtle side effect that may affect other devices too...

So, I have some more ideas, a bit more crazy this time:

  • (easy) check lspci -vvxxxs 0000:00:05.0 inside the guest on both stubdomain versions; if you can, try to collect it also before loading nvidia driver for the first time; I suspect you'll get the same output, but maybe? ...
  • (harder) in the nvidia module, read value of both command and status registers, before and after writing 4 bytes to 0x4; theoretically it shouldn't matter if you read it with two os_pci_read_word() calls or one os_pci_read_dword() (and my patch doesn't change read path) so the latter sounds easier

This may answer the question if there is a difference in the emulated bits of those registers...

@neowutran
Copy link

I've checked the last set of logs and I'm confused. The values read and written both from the driver perspective and from dom0 perspective looks the same before the failure. So, it doesn't seem to be QEMU changing any of those values in unintended ways. So, my hypothesis that some read-only bit was not enforced is weaker - values that are read back by the driver are the same in both cases. There is one difference - with new stubdomain I see extra set of reads:

[82175.158954] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x6
[82175.158964] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x6 = 10
[82175.158988] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34
[82175.158998] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x34 = 60
[82175.159021] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x60
[82175.159033] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x60 = 6801
[82175.159060] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x61
[82175.159072] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x61 = 68
[82175.159096] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x68
[82175.159108] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x68 = 7805
[82175.159129] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x69
[82175.159141] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x69 = 78
[82175.159162] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x78
[82175.159169] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x78 = 10
[82175.159191] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x79
[82175.159199] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0x79 = b4
[82175.159221] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb4
[82175.159231] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb4 = 9
[82175.159254] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb5
[82175.159264] pciback 0000:01:00.0: xen-pciback: read 1 bytes at 0xb5 = 0

This looks like enumerating capabilities. And if I'm not mistaken, the timestamp suggests it happens even before Linux starts - so, likely stubdomain itself. This is more likely related to either qemu update or stubdomain's Linux update, not to the problematic patch.

Maybe the enumeration of capabilities have been added by qemu update or stubdomain linux update.
However I am sure it is the patch "xen-xen_pt-Save-back-data-only-for-declared-regis.patch" that create the error

BTW, I've checked also the 0x6a register that is read/written directly after writing to the control register. It's MSI Message Control, and the bit 0 is "MSI Enable". So, I guess the write of 0x80 there is already part of a teardown on error, especially since it looks to be done from the Linux's PCI core code, not from nvidia directly (there is no log entry about that write on the guest side). And while at it, looking at the capabilities enumeration, this device does not support MSI-X, so (theoretically at least) the MSI-X code in stubdomain is irrelevant. FYI the capabilities here are:

* cap 0x01 (power management) at offset 0x60

* cap 0x05 (MSI) at offset 0x68

* cap 0x10 (PCI Express) at offset 0x78

* cap 0x09 (vendor specific) at offset 0xb4

Honestly, I'm running short on ideas. One remaining idea is that some internal qemu state changes differently due to the patch. Indeed, one of the reason for the patch is that write-handling functions would see previous value too. But I don't see any function using this feature right now. It could be also about enforcing read-only bits in some other register that is written but not read back afterwards (so, the difference doesn't show up in the log) or maybe register that is written from outside of the nvidia driver (so, again, doesn't show up in the log, as you add logging in the nvidia module only). But both of those hypothesis are quite weak, as changing write to the command register to be 16 bits instead of 32 bits fixes the issue.

Are you absolutely sure it is just this patch that breaks it? Have you tried stubdomains that differ with only the 0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch patch being applied or not?

Yes and yes, and that was my main lead for my writeup. But I will redo that in the following days to be absolutely sure, the logs are quite weird.

While the fix in the nvidia driver has been identified (at least in the Linux version), and in fact it's already fixed in the nvidia-open driver, I'm still confused why this patch breaks it. I suspect there is some subtle side effect that may affect other devices too...

For the Windows driver, there is a working patch in the writeup.
However, I spend a lot more time debugging and patching the windows driver than the linux one.
The thing that confused me is that the nvidia driver never do any write to pci config (HalSetBusDataByOffset is never called and that is the only kernel function related to pci config write referenced in the IAT of the windows driver).
On Windows, the crash happen in two functions related to reading pci configuration, called by "nvDumpConfig".
I did not identified the exact assembly lines and conditions that trigger the patch, instead I patched windows driver by not calling those problematic function.

So, I have some more ideas, a bit more crazy this time:

* (easy) check `lspci -vvxxxs 0000:00:05.0` inside the guest on both stubdomain versions; if you can, try to collect it also before loading `nvidia` driver for the first time; I suspect you'll get the same output, but maybe? ...

* (harder) in the nvidia module, read value of both command and status registers, before and after writing 4 bytes to 0x4; theoretically it shouldn't matter if you read it with two `os_pci_read_word()` calls or one `os_pci_read_dword()` (and my patch doesn't change read path) so the latter sounds easier

This may answer the question if there is a difference in the emulated bits of those registers...

I will test that in the following days, but I think we are both pretty confident that we will likely won't find anything suspicious.

If I am sure that the issue is created by "0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch",
that you are sure that your patch doesn't / can't change anything related to read path,
combined to the fact that the windows driver never call "HalSetBusDataByOffset"...

Maybe need a patch in qemu to add more logs. Like printing something everytime to RO mask would actually block a write.


I did another comparison, best to use "diff" between the data below.
Using the latest version stubdom, I started a guest with nvidia-open driver on it.
Then I ran "sudo lspci -vv -s 01:00.0" on dom0.

01:00.0 VGA compatible controller: NVIDIA Corporation AD103 [GeForce RTX 4080] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 167a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at f800000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at fc00000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at f000 [size=128]
	Expansion ROM at f5000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0a000  Data: 4000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <16us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: pciback
	Kernel modules: nouveau

latest stubdom, nvidia driver (crashing)

01:00.0 VGA compatible controller: NVIDIA Corporation AD103 [GeForce RTX 4080] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 167a
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at f800000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at fc00000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at f000 [size=128]
	Expansion ROM at f5000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 00000000fee07000  Data: 4000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: pciback
	Kernel modules: nouveau

stubdom 4.2.6, nvidia driver

01:00.0 VGA compatible controller: NVIDIA Corporation AD103 [GeForce RTX 4080] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 167a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at f800000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at fc00000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at f000 [size=128]
	Expansion ROM at f5000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee06000  Data: 4000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: pciback
	Kernel modules: nouveau

stubdom 4.2.6, nvidia-open driver

01:00.0 VGA compatible controller: NVIDIA Corporation AD103 [GeForce RTX 4080] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 167a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at f800000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at fc00000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at f000 [size=128]
	Expansion ROM at f5000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee04000  Data: 4000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: pciback
	Kernel modules: nouveau

@marmarek
Copy link
Member

marmarek commented Mar 6, 2024

For lspci, include also -xxx to print registers not understood by lspci. And also get that from inside the guest to see the emulated values. That said, the above listings looks as expected - the differences look to be related to teardown after failed init only...

As for extra logging in stubdomain, this may indeed make sense. Maybe even a verbose one. First of all, add -DXEN_PT_DEBUG_PCI_CONFIG_ACCESS=1 to qemu cflags. But also, add logging of the final value in emulated register (*(uint32_t*)(d->config + addr)) somewhere near the end of xen_pt_pci_write_config(). And maybe log also the value at the start of the function too, just in case? We already see what's being written to the hardware in pciback logs.

@neowutran
Copy link

We are probably closer to finding the root issue, but I am even more confused.
Indeed probably coming from qemu side. Will probably check more in deep later.

Some informations:

New stubdom I am using for testing:
https://github.com/neowutran/qubes-vmm-xen-stubdom-linux/tree/issue9003withRO
https://github.com/neowutran/qubes-vmm-xen-stubdom-linux/tree/issue9003withoutRO

new implementation of os_pci_write_dword:

NV_STATUS NV_API_CALL os_pci_write_dword(
    void *handle,
    NvU32 offset,
    NvU32 value
)
{
    if (offset >= NV_PCIE_CFG_MAX_OFFSET)
        return NV_ERR_NOT_SUPPORTED;
    if (offset == 4){
      NvU16 command = 0;
      NvU16 status = 0;
      NvU32 command_status = 0;
      pci_read_config_word( (struct pci_dev *) handle, 4, &command);
      pci_read_config_word( (struct pci_dev *) handle, 6, &status);
      pci_read_config_dword( (struct pci_dev *) handle, 4, &command_status);
      printk(KERN_ALERT "NEOWUTRAN ; PRE_WRITE 4 ; command: %x ; status: %x ; command and status: %x ; %x \n",command, status, command_status, command_status == status << 2 & command);
    }

    printk(KERN_ALERT "NEOWUTRAN os_pci_write_dword : try to write %x %x \n",offset, value);

        pci_write_config_dword( (struct pci_dev *) handle, offset, value);

    if (offset == 4){
      NvU16 command = 0;
      NvU16 status = 0;
      NvU32 command_status = 0;
      pci_read_config_word( (struct pci_dev *) handle, 4, &command);
      pci_read_config_word( (struct pci_dev *) handle, 6, &status);
      pci_read_config_dword( (struct pci_dev *) handle, 4, &command_status);
      printk(KERN_ALERT "NEOWUTRAN ; POST_WRITE 4 ; command: %x ; status: %x ; command and status: %x ; %x \n",command, status, command_status, command_status == status << 2 & command);
    }
    return NV_OK;
}

Are you absolutely sure it is just this patch that breaks it? Have you tried stubdomains that differ with only the 0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch patch being applied or not?

100% sure.

(harder) in the nvidia module, read value of both command and status registers, before and after writing 4 bytes to 0x4; theoretically it shouldn't matter if you read it with two os_pci_read_word() calls or one os_pci_read_dword() (and my patch doesn't change read path) so the latter sounds easier

One log example for guest kernel logs:

[  112.218859] NEOWUTRAN ; PRE_WRITE 4 ; command: 403 ; status: 10 ; command and status: 100403 ; 0 
[  112.218861] NEOWUTRAN os_pci_write_dword : try to write 4 100403 
[  112.311262] NEOWUTRAN ; POST_WRITE 4 ; command: 0 ; status: 10 ; command and status: 100000 ; 0 

Associated qemu log:

[2024-03-07 19:46:45] [00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000403 len=2

[2024-03-07 19:46:45] [00:05.0] xen_pt_pci_read_config: address=0x0006 val=0x00000010 len=2

[2024-03-07 19:46:45] [00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00100403 len=4

[2024-03-07 19:46:45] [00:05.0] NEOWUTRAN QEMU 1: 4 ; 100403
[00:05.0] xen_pt_pci_write_config: address=0x0004 val=0x00100403 len=4

[2024-03-07 19:46:45] [00:05.0] NEOWUTRAN QEMU 2: 4 ; 100403
[00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000000 len=2

[2024-03-07 19:46:45] [00:05.0] xen_pt_pci_read_config: address=0x0006 val=0x00000010 len=2

[2024-03-07 19:46:45] [00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00100000 len=4

Associated xen logs:

[253819.464453] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 403
[253819.464626] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6 = 10
[253819.464861] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100403
[253819.464979] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100403
[253819.556406] pciback 0000:01:00.0: xen-pciback: write request 4 bytes at 0x4 = 100403
[253819.556860] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x4 = 403
[253819.557101] pciback 0000:01:00.0: xen-pciback: read 2 bytes at 0x6 = 10
[253819.557264] pciback 0000:01:00.0: xen-pciback: read 4 bytes at 0x4 = 100403

I get strong 'WTF' for the line:

[  112.311262] NEOWUTRAN ; POST_WRITE 4 ; command: 0 ; status: 10 ; command and status: 100000 ; 0 

@neowutran
Copy link

neowutran commented Mar 8, 2024

another interesting part (but after more thinking, #9003 (comment) is more interesting)

I removed the nvidia driver and started the guest. With the RO patch and without the RO patch.

Beginning of the qemu logs with the RO patch

... 
 [00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000000 len=2
 [00:05.0] NEOWUTRAN QEMU 1: 4 ; 100000
 [00:05.0] xen_pt_pci_write_config: address=0x0004 val=0x00000000 len=2
 [00:05.0] NEOWUTRAN QEMU 2: 4 ; 100000
...
[00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000003 len=2

Beginning of the qemu logs without the RO patch

...
[00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000000 len=2
[00:05.0] NEOWUTRAN QEMU 1: 4 ; 100000
[00:05.0] xen_pt_pci_write_config: address=0x0004 val=0x00000000 len=2
[00:05.0] NEOWUTRAN QEMU 2: 4 ; 100000
...
[00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00000000 len=2

@neowutran
Copy link

neowutran commented Mar 9, 2024

I did some more tests with another custom version of stubdom https://github.com/neowutran/qubes-vmm-xen-stubdom-linux/tree/issue9003withROmodified

Guest logs:

grep -A 4 "PRE_WRITE" /tmp/dmesg 
[Sat Mar  9 17:56:01 2024] NEOWUTRAN ; PRE_WRITE 4 ; command: 407 ; status: 10 ; command and status: 100407 ; 0 
[Sat Mar  9 17:56:02 2024] NEOWUTRAN os_pci_write_dword : try to write 4 100403 
[Sat Mar  9 17:56:03 2024] NEOWUTRAN ; POST_WRITE 4 ; command: 0 ; status: 10 ; command and status: 100000 ; 0 
[Sat Mar  9 17:56:03 2024] NVRM: Xid (PCI:0000:00:05): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Sat Mar  9 17:56:03 2024] NVRM: GPU 0000:00:05.0: GPU has fallen off the bus.

qemu logs

[2024-03-09 17:56:01] [00:05.0] xen_pt_pci_read_config: address=0x0004 val=0x00100407 len=4
[2024-03-09 17:56:02] [00:05.0] xen_pt_pci_write_config: address=0x0004 val=0x00100403 len=4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU 1: 4 ; 100407 ; val: 100403
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask1: 743 ; 0 ; 743 ; ffffffff ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask2: 743 ; 0 ; ffc3 ; ffffffff ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask3: 743 ; 0 ; ffc3 ; ffffffff ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU word write: 100407 ; ffffffff ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask1: 0 ; 2 ; 4 ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask2: ffff ; 2 ; 4 ; 4
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask1: 10 ; 6f8 ; 6f8 ; ffff ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask2: 10 ; 6f8 ; 6ff ; ffff ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU mask3: 10 ; 6f8 ; 6ff ; ffff ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU word write: 10 ; ffff ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask1: ffff ; 2 ; 6 ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask2: ffffffff ; 2 ; 6 ; 6
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask A42: 0 ; ffffffff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask B42: 1 ; ffffff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask A42: 1 ; ffffff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask B42: 2 ; ffff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask A42: 2 ; ffff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask B42: 3 ; ff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask A42: 3 ; ff ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU emul mask B42: 4 ; 0 ;
[2024-03-09 17:56:02] [00:05.0] NEOWUTRAN QEMU ACTUAl WRITE: 4 ; 100403 ; val: 0 ; mask : 0 ; original val: 100403 ; len: 4; index: 0

(Note: removed a section of this comment that was incorrect )

@neowutran
Copy link

I identified a integer overflow bug in here https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/main/qemu/patches/0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch#L69

#include <stdint.h>
#include <stdio.h>

// gcc XXX.c ; ./a.out

int main(int argc, char *argv[]) {

 int emul_len = 4;
 uint32_t write_val = 0x100403;

 // Integer overflow
 uint32_t mask1 = ((1 << (emul_len * 8)) - 1);
 printf("%x %x \n", mask1, write_val & mask1);

 // The value here is probably calculated at compile time using int64 so the overflow doesn't occur ?
 uint32_t mask2 = ((1 << (4 * 8)) - 1);
 printf("%x %x \n", mask2, write_val & mask2);

 // Fixed
 uint32_t mask3 = (((uint64_t)1 << (emul_len * 8)) - 1);
 printf("%x %x \n", mask3, write_val & mask3);

}

Fixing the integer overflow seems to solve the driver issue ( https://github.com/neowutran/qubes-vmm-xen-stubdom-linux/blob/issue9003tryfix3/qemu/patches/0005-hw-xen-xen_pt-Save-back-data-only-for-declared-regis.patch#L69 ).
However I still don't fully understand if the patch do what it is supposed to do.

@fore567
Copy link
Author

fore567 commented Mar 14, 2024

Thanks for fixing the issue. Will this be available to qubesos immediately to update?

@neowutran
Copy link

it is available in the testing repository

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 14, 2024
@KeithHanson
Copy link

@marmarek @neowutran You two are the heroes we need! 😁

I hit this today after spending two days in trial and error (a lot of it because I'm new to Qubes but not virtualization or qemu).

Wanted to report that after running in dom0:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

And restarting, my 4060 Ti is working in both a debian12 cuda focused VM (genai dev stuff) and a win10 VM with passthrough enabled, with NO changes to xen.xml, and 8 and 16GB of RAM (without crashing)!

Many thanks!!

@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: windows-vm diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-stable r4.2-vm-bookworm-stable r4.2-vm-trixie-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

7 participants