Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAtheros 928x PCI passthrough not working #3609
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 19, 2018
Currently attempting to get Qubes 4.0's xen-hvm-stubdom-linux running on Xen 4.8.3 to see if it's a stubdom issue.
awokd
commented
Feb 19, 2018
|
Currently attempting to get Qubes 4.0's xen-hvm-stubdom-linux running on Xen 4.8.3 to see if it's a stubdom issue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
schnurentwickler
Feb 19, 2018
With atheros I had my issues as well. atheros was not usable even after reboots if the computer was in standby mode. Only a shutdown and even WITH power supply attached at boot brought it back to work.
The power supply issue I could not solve, but the standby issue I managed with nohwcrypt as module option. See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/568090
Maybe Xen does have heavier problems to load and assign a device with strange responses even for normal linux setups.
I could not get an atheros device to work in qubes 3.2. Should be noted in qubes first information page for a release to avoid atheros device modules.
schnurentwickler
commented
Feb 19, 2018
•
|
With atheros I had my issues as well. atheros was not usable even after reboots if the computer was in standby mode. Only a shutdown and even WITH power supply attached at boot brought it back to work. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 19, 2018
It's not all Atheros devices; I have a 9565 that works with Qubes 4.0 (although I never tested suspend). But you are probably right and the list of not working ones is longer than just 928x. I know Intel has issues with sleep mode too.
awokd
commented
Feb 19, 2018
|
It's not all Atheros devices; I have a 9565 that works with Qubes 4.0 (although I never tested suspend). But you are probably right and the list of not working ones is longer than just 928x. I know Intel has issues with sleep mode too. |
andrewdavidwong
added
bug
C: other
labels
Feb 20, 2018
andrewdavidwong
added this to the Release 4.0 milestone
Feb 20, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 20, 2018
Member
This is weird. The difference with plain Fedora setup may be usage of stubdomain at all. Running linux-based stubdomain require some libxl patching, but mini-os based one should work out of the box on non-qubes system.
Another thing we do differently, is enabling e820_host option in guest configuration - you can disable it with qvm-features sys-net pci-e820-host ''. I doubt it will help, but those are differences I'm aware of.
/cc @HW42
|
This is weird. The difference with plain Fedora setup may be usage of stubdomain at all. Running linux-based stubdomain require some libxl patching, but mini-os based one should work out of the box on non-qubes system. /cc @HW42 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 21, 2018
Tried qvm-features sys-net pci-e820-host '' but unfortunately, no effect.
Not sure if it's relevant, but the working version appears to be using MSI-X but it's MSI or legacy under Qubes. Basing this observation on the IRQ numbers only, not entirely positive how to decrypt the lspci -vvv output.
To summarize:
| Version | Result |
|---|---|
| Debian 9 | works |
| Xen 4.6 PV | fails (per Stackoverflow link) |
| Xen 4.8.3 dom0 | works |
| Xen 4.8.3 PV | spent 6 hours trying to get it to boot and xl console to connect, will try again later (not a tech support request but the learning curve sure is steep) |
| Xen 4.8.3 HVM | works except can't scan wireless networks |
| Xen 4.8.3 HVM traditional stubdomain | fault inside the stubdomain even with a very basic config and nothing passed through |
| Qubes 4.0 PV | fails similarly to Stackoverflow link |
| Qubes 4.0 HVM | fails with svm.c domain_crash |
| Qubes 4.0 HVM w/9565 | works |
awokd
commented
Feb 21, 2018
•
|
Tried Not sure if it's relevant, but the working version appears to be using MSI-X but it's MSI or legacy under Qubes. Basing this observation on the IRQ numbers only, not entirely positive how to decrypt the To summarize:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 21, 2018
@awokd: Could you please post lspci -vvv -xxxx -s XX:XX.X (replace XX:XX.X with the device) from both dom0 as well as from inside the VM.
HW42
commented
Feb 21, 2018
|
@awokd: Could you please post |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 21, 2018
FWIW: The ath9k card I have laying around (AR9287 according to lspci) works for me.
HW42
commented
Feb 21, 2018
|
FWIW: The ath9k card I have laying around (AR9287 according to lspci) works for me. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 21, 2018
@awokd: You wrote "Xen 4.8.3 HVM" works. Could you try to pass pci=nomsi to the VM kernel a see if it still works?
HW42
commented
Feb 21, 2018
|
@awokd: You wrote "Xen 4.8.3 HVM" works. Could you try to pass |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 21, 2018
Attached the files- I'm able to boot qubes domu with the ath9k module blacklisted. My AR9280 is on a corebooted AMD and the other user that told me about the AR9287 not working was as well. Tried pci=nomsi on the xen domu and it had no effect- verified it on the boot log options line and the IRQ was still 36. Had also tried that before on the qubes domu with no change, still the svm.c crash.
awokd
commented
Feb 21, 2018
|
Attached the files- I'm able to boot qubes domu with the ath9k module blacklisted. My AR9280 is on a corebooted AMD and the other user that told me about the AR9287 not working was as well. Tried |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 21, 2018
I should clarify what I mean by "works" for the Xen HVM- the ath9k driver loads without crashing and I can poke at the card with iw commands and set and get data. Can't actually scan wireless networks but it looks like that's a common problem with multiple possible solutions, so I haven't spent much time on it yet.
awokd
commented
Feb 21, 2018
|
I should clarify what I mean by "works" for the Xen HVM- the ath9k driver loads without crashing and I can poke at the card with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 22, 2018
My AR9280 is on a corebooted AMD and the other user that told me about the AR9287 not working was as well.
AFAIK @h01ger also has problems with an ath9k card on a coreboot machine. That has a Intel CPU. So this sounds like a coreboot problem. Can you try this on an non-coreboot machine (or even better stock BIOS on the same machine)?
HW42
commented
Feb 22, 2018
AFAIK @h01ger also has problems with an ath9k card on a coreboot machine. That has a Intel CPU. So this sounds like a coreboot problem. Can you try this on an non-coreboot machine (or even better stock BIOS on the same machine)? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
h01ger
Feb 22, 2018
h01ger
commented
Feb 22, 2018
|
On Wed, Feb 21, 2018 at 04:15:53PM -0800, HW42 wrote:
AFAIK @h01ger also has problems with an ath9k card on a coreboot machine. That has a Intel CPU. So this sounds like a coreboot problem. Can you try this on an non-coreboot machine (or even better stock BIOS on the same machine)?
the^wone problem with thinkpads is, that they only allow intel wlan cards with
the stock bios. IOW, you need coreboot to use those ath9k cards, and
with pure debian they work well. I gave one ath9k card to marmarek, but
I think he wasnt able to test it just yet.
my ath9k card is also not inside a laptop right now, but I hope to
change this soon.
…--
cheers,
Holger
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 22, 2018
one problem with thinkpads is, that they only allow intel wlan cards with the stock bios.
Ugh.
Let's see what @awokd reports.
HW42
commented
Feb 22, 2018
Ugh. Let's see what @awokd reports. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 22, 2018
Yes, it's a Lenovo too with a whitelist firmware, so I couldn't run this card on it if I flash it back. But should the domU's lspci output differ between Qubes and Xen?
qdomu: Capabilities blocks 40, 50, 60, legacy INT(?)
xdomu: Capability block 40, MSI-X
awokd
commented
Feb 22, 2018
•
|
Yes, it's a Lenovo too with a whitelist firmware, so I couldn't run this card on it if I flash it back. But should the domU's lspci output differ between Qubes and Xen? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 22, 2018
This could be an edge case too, in which case I apologize for wasting everyone's time. But I've seen similar reports of MSI interrupts being flaky on some devices under Qubes over the past few months I've been working on this (not solidly, but still...). Maybe it's a duplicate issue?
PS I've edited the test results table above with additional results I forgot to include.
one example
and #3217
[ 2.361791] iwlwifi 0000:00:01.0: Xen PCI mapped GSI17 to IRQ27
[ 2.365431] iwlwifi 0000:00:01.0: pci frontend enable msi failed for dev 0:8
[ 2.365465] iwlwifi 0000:00:01.0: Xen PCI frontend error: -22!
[ 2.365694] iwlwifi 0000:00:01.0: pci_enable_msi failed - -22
and #3235
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: Xen PCI mapped GSI18 to IRQ26
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: pci frontend enable msi failed for dev 0:0
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: Xen PCI frontend error: -22!
Oct 27 07:56:09 sys-net kernel: iwlwifi 0000:00:00.0: pci_enable_msi failed - -22
awokd
commented
Feb 22, 2018
•
|
This could be an edge case too, in which case I apologize for wasting everyone's time. But I've seen similar reports of MSI interrupts being flaky on some devices under Qubes over the past few months I've been working on this (not solidly, but still...). Maybe it's a duplicate issue? PS I've edited the test results table above with additional results I forgot to include. one example
and #3235
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 22, 2018
But should the domU's lspci output differ between Qubes and Xen?
That's expected since vanilla Xen doesn't use a stubdom by default (and we have a custom Linux based stubdom).
qdomu: Capabilities blocks 40, 50, 60, legacy INT(?)
xdomu: Capability block 40, MSI-X
Are you sure you didn't swap xdomu.txt and qdomu.txt? I would expect them the other way around.
Also I don't see MSI-X in neither (in Qubes that's expected). Why do you think it's using MSI-X?
HW42
commented
Feb 22, 2018
That's expected since vanilla Xen doesn't use a stubdom by default (and we have a custom Linux based stubdom).
Are you sure you didn't swap Also I don't see MSI-X in neither (in Qubes that's expected). Why do you think it's using MSI-X? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 22, 2018
Yes, I'm sure I didn't swap them. Note the lack of Kernel driver in use: ath9k in the Qubes one.
Because it's on IRQ 36. My understanding is Legacy interrupt values go up to 16, MSI up to 32, and MSI-X up to 2048 (but maybe that is folklore).
awokd
commented
Feb 22, 2018
|
Yes, I'm sure I didn't swap them. Note the lack of Because it's on IRQ 36. My understanding is Legacy interrupt values go up to 16, MSI up to 32, and MSI-X up to 2048 (but maybe that is folklore). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Feb 22, 2018
Anyway, I think it's rather not interrupt related but:
Region 0: Memory at f0100000 (64-bit, non-prefetchable) [disabled] [size=64K]
Note the disabled. Please post xl dmesg (ideally with loglvl=all. Dom0 dmesg also doesn't hurt but probably not needed)
HW42
commented
Feb 22, 2018
|
Anyway, I think it's rather not interrupt related but:
Note the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 22, 2018
Member
I gave one ath9k card to marmarek, but I think he wasnt able to test it just yet.
I've tried and the card isn't even visible on lspci in dom0. But it may be something with my laptop...
I've tried and the card isn't even visible on lspci in dom0. But it may be something with my laptop... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 22, 2018
That [disabled] is interesting. I'd assumed it was an artefact of Qubes hiding PCI devices but when I tested Xen with xen-pciback.hide=(02:00.0) just now, it continued to be enabled.
Attaching the xl dmesg from both.
awokd
commented
Feb 22, 2018
|
That |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
h01ger
Feb 22, 2018
h01ger
commented
Feb 22, 2018
|
On Thu, Feb 22, 2018 at 10:18:02AM +0000, Marek Marczykowski-Górecki wrote:
> I gave one ath9k card to marmarek, but I think he wasnt able to test it just yet.
I've tried and the card isn't even visible on lspci in dom0. But it may be something with my laptop...
what model is that? iirc it was an x230 with coreboot, right? that would
be quite strange indeed...
…--
cheers,
Holger
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 22, 2018
Member
Yes. I'll try another card in that slot (the slot that is working with the intel wifi is too small for this one).
|
Yes. I'll try another card in that slot (the slot that is working with the intel wifi is too small for this one). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Feb 24, 2018
@HW42 : Noticed something else in that qdomu.txt file- it has the 50 and 60 MSI capabilities but only the standard PCI configuration space (the -xxxx dump only goes up to 0xff). In qdom0 it shows the PCIe extended config space in the dump. Attempting to follow the logic in xen-4.8.3/tools/qemu-xen/hw/pci/pcie.c was uninformative, so not sure if one has anything to do with the other (or if I'm even in the right area). Could this also be related to the [disabled] memory?
The "missing" configuration space also seem to line up with the range of memory registers the driver crashes on when it attempts to write.
awokd
commented
Feb 24, 2018
•
|
@HW42 : Noticed something else in that |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 7, 2018
Member
Ok, I've tried the card in another slot and it is visible. And crashes sys-net very similar way: EPT violation (-w-/r-x). When I switch sys-net to PV, it also crashes, but with more useful message, very similar to the one from stackoverflow:
[ 4.324539] BUG: unable to handle kernel paging request at ffffc90001c70040
[ 4.324585] IP: iowrite32+0x2b/0x30
[ 4.324607] PGD 18818067 P4D 18818067 PUD 18817067 PMD 11beb067 PTE 80100000f1500075
[ 4.324665] Oops: 0003 [#1] SMP NOPTI
[ 4.324688] Modules linked in: ath9k(+) ath9k_common ath9k_hw mac80211 ath cfg80211 rfkill e1000e ptp pps_core intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel intel_rapl_perf pcspkr xen_pcifront xenfs xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn u2mfn(O) xen_blkfront
[ 4.324842] CPU: 0 PID: 233 Comm: kworker/0:2 Tainted: G O 4.14.18-1.pvops.qubes.x86_64 #1
[ 4.324891] Workqueue: events work_for_cpu_fn
[ 4.324918] task: ffff88001059db80 task.stack: ffffc900019d4000
[ 4.324952] RIP: e030:iowrite32+0x2b/0x30
[ 4.324973] RSP: e02b:ffffc900019d7cc0 EFLAGS: 00010296
[ 4.325008] RAX: 0000000000000000 RBX: ffff880010f78028 RCX: 0000000000000005
[ 4.325048] RDX: ffffc90001c70040 RSI: ffffc90001c70040 RDI: 0000000000000000
[ 4.325077] RBP: ffff880010f78078 R08: 0000000000000000 R09: 00000000ffffff90
[ 4.325090] R10: 000000000000003f R11: 0000000000000000 R12: ffffffffc03467d0
[ 4.325104] R13: 0000000000000002 R14: 0000000000000100 R15: ffff880010f78028
[ 4.325127] FS: 0000000000000000(0000) GS:ffff880013a00000(0000) knlGS:0000000000000000
[ 4.325142] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.325153] CR2: ffff80000078a800 CR3: 0000000010a58000 CR4: 0000000000042660
[ 4.325175] Call Trace:
[ 4.325195] ath9k_enable_mib_counters+0x4a/0x80 [ath9k_hw]
[ 4.325212] ath9k_hw_init+0x632/0xb00 [ath9k_hw]
[ 4.325226] ? __queue_work+0x420/0x420
[ 4.325241] ath9k_init_device+0x5fb/0xdb0 [ath9k]
[ 4.325256] ? request_threaded_irq+0xfa/0x160
[ 4.325272] ath_pci_probe+0x20e/0x3d0 [ath9k]
[ 4.325287] local_pci_probe+0x3f/0x90
[ 4.325297] ? __schedule+0x3d3/0x850
[ 4.325307] work_for_cpu_fn+0x10/0x20
[ 4.325318] process_one_work+0x181/0x390
[ 4.325328] worker_thread+0x1d7/0x3c0
[ 4.325337] kthread+0xfc/0x130
[ 4.325347] ? process_one_work+0x390/0x390
[ 4.325357] ? kthread_create_on_node+0x70/0x70
[ 4.325368] ret_from_fork+0x35/0x40
[ 4.325378] Code: 48 81 fe ff ff 03 00 48 89 f2 77 1f 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ef c3 48 c7 c6 5c 8d 0d 82 48 89 d7 e9 95 fe ff ff <89> 3e c3 66 90 48 81 ff ff ff 03 00 77 28 48 81 ff 00 00 01 00
[ 4.325431] RIP: iowrite32+0x2b/0x30 RSP: ffffc900019d7cc0
[ 4.325441] CR2: ffffc90001c70040
[ 4.325452] ---[ end trace 4c9dd820b875aec9 ]---
[ 4.325460] Kernel panic - not syncing: Fatal exception
[ 4.325472] Kernel Offset: disabled
|
Ok, I've tried the card in another slot and it is visible. And crashes sys-net very similar way:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
h01ger
Mar 10, 2018
I pointed @nbd168 at this and this is what he said:
I dont believe that the drivers writes into wrong memory areas
I rather think that the pci ranges are not set up correctly
which is why legitimate accesses are blocked
but I know too little about pci to know what exactly happens there
but the register writes on addr < 0x1000 are definitly valid
who/what is setting up those pci ranges?
i think the BARs which the pci driver reads from the config registers
so either the BARs are broken themselves, or they are interpreted differently
h01ger
commented
Mar 10, 2018
|
I pointed @nbd168 at this and this is what he said: I dont believe that the drivers writes into wrong memory areas |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Mar 10, 2018
https://lists.gt.net/xen/devel/439033?page=last
Is this BAR the same BAR which has the MSI-X table in? For safety, Xen
has to trap and emulate updates to the MSI/MSI-X configuration. It is
possible that that logic has gone wrong.
Looks like that thread might be from the same Stackoverflow poster. Seems like his MSI-X interrupts might have been disabled as well. Can I force them somewhere in Qubes? Maybe it's an upstream bug that only shows up with legacy interrupts, but I still don't get why my device and others' are falling back to using legacy ints under Qubes HVM but not Xen.
awokd
commented
Mar 10, 2018
•
|
https://lists.gt.net/xen/devel/439033?page=last
Looks like that thread might be from the same Stackoverflow poster. Seems like his MSI-X interrupts might have been disabled as well. Can I force them somewhere in Qubes? Maybe it's an upstream bug that only shows up with legacy interrupts, but I still don't get why my device and others' are falling back to using legacy ints under Qubes HVM but not Xen. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 10, 2018
Member
MSI/MSI-X is broken in PV mode (#3217). But on Qubes HVM, MSI should work...
Relevant changes (possibly breaking MSI for PV) were part of XSA-237. But it was only about explicit enabling MSI/MSI-X by a hypercall, not direct config space write. The point about some trap on config space seems plausible.
There is possibly related code in Xen sources in arch/x86/hvm/vmsi.c, especially functions listed in msixtbl_mmio_ops structure.
I don't have that card plugged in anywhere right now to verify that hypothesis or collect more info. If you have, try collecting lspci -vv output before inserting the module. And also look at the address at which write fails. If that matches MSI address from lspci output, that's probably it.
|
MSI/MSI-X is broken in PV mode (#3217). But on Qubes HVM, MSI should work... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
awokd
Mar 14, 2018
Looking at the PCI bridge in front of the empty slot, it says the same thing under Xen and Qubes:
I/O behind bridge: 0000f000-00000fff [empty]
Memory behind bridge: fff00000-000fffff [empty]
Prefetchable memory behind bridge: fff00000-000fffff [empty]
Crash is at:
(XEN) svm.c:1540:d9v0 SVM violation gpa 0x000000f2020040, mfn 0xf0100, type 5
A different PCI bridge reports Cap [a0] with an MSI address, but the one associated with 02:00.0 reports no MSI capabilities (at least without a module installed.) Oddly, when I put in a different (Express v2) module, it gets an MSI-X interrupt assigned inside the Qubes HVM, the bridge still reports no MSI capabilities, but the device works perfectly.
I'll keep digging, thank you for the suggestions!
awokd
commented
Mar 14, 2018
|
Looking at the PCI bridge in front of the empty slot, it says the same thing under Xen and Qubes:
Crash is at:
A different PCI bridge reports I'll keep digging, thank you for the suggestions! |
awokd commentedFeb 19, 2018
Qubes OS version:
R4.0
Affected TemplateVMs:
Steps to reproduce the behavior:
Try to attach AR9280 to sys-net or other HVM. AR9287 also reported to have same behavior.
Expected behavior:
ath9k driver loads without crashing
Actual behavior:
ath9k driver crashes HVM with
General notes:
Filing this here because passing through the same device on the same hardware to an HVM on Xen 4.8.2 and 4.8.3 on Fedora 26 works, as does using it in dom0 in that configuration and under stock Debian Stretch. Not sure if it affects a "broad range" of users as much as Intel wireless, though if there's a bug in handling this type of PCI device it could also affect other similar devices under Qubes. The fix could well be to buy a new device, but it might be helpful to understand why it doesn't work.
https://stackoverflow.com/questions/38387504/xen-guest-atheros-wifi-driver-load-causes-memory-paging-failure has a good description of the problem. He was encountering it under Xen 4.6 instead of Qubes, but I had the same issue (kernel crash instead of domU) when trying to pass it through to a PV under Qubes:
According to the datasheet, this device uses a PCI Express 1.0a Configuration space of 0x00-0x62, DMA accessed registers from 0x0000-0x0FFC, and other registers from 0x1000-0x98FC. For example, the offset 0x40 PCI Express Configuration space register is used for Power Management Capability, while offset 0x0040 DMA device register is used for MIB Control. It has a single 64K BAR and no defined I/O port.
It's that first page of DMA registers that is causing problems. From Xen's perspective, the VM is trying to do an IO write to a page flagged as memory mapped (if I understand the error right), so it crashes. I verified this by commenting out the first couple register writes that were to offsets <0x1000 in the ath9k driver and recompiling it. The crash then occurred later in the driver initialization, but at a different <0x1000 location. Multiple writes to >0x1000 locations during driver initialization were processed successfully.
Related issues: