Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware reset during installation and boot of R4.2 on Ryzen 9 7950X #8322

Open
Eric678 opened this issue Jul 4, 2023 · 19 comments
Open

Hardware reset during installation and boot of R4.2 on Ryzen 9 7950X #8322

Eric678 opened this issue Jul 4, 2023 · 19 comments
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: installer C: usb proxy hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@Eric678
Copy link

Eric678 commented Jul 4, 2023

How to file a helpful issue

Qubes OS release

R4.2.0-rc1 + Ryzen 9 7950X + Gigabyte X670E motherboard

Brief summary

Installation proceeds normally till just after "Configure networking" when hardware resets.
Further system boots reset just after entering disk password.

Steps to reproduce

Run a default installation of R4.2.0-rc1.

Expected behavior

No hardware resets.

Actual behavior

As noted.

Problem appears to be caused by a single USB controller being mapped into sys-usb.
There are 5 USB controllers on the CPU and 670 chipset, only one causes a problem.
It is the last one in the devices list, address 37:00.0.

Workaround is to add qubes.skip_autostart option to the linux kernel boot parameters at any boot after installation, then unmap this controller from sys-usb once system is up.

I suspect that it is the on CPU controller that is used for the mouse and keyboard as others on different VM systems on the same CPU have a problem mapping running USB devices causing a hardware reset.

@Eric678 Eric678 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Jul 4, 2023
@andrewdavidwong andrewdavidwong added C: installer hardware support C: usb proxy needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Jul 5, 2023
@andrewdavidwong andrewdavidwong added this to the Release 4.2 milestone Jul 5, 2023
@DemiMarie
Copy link

I suspect this will need a hardware quirk in the installer.

@Eric678
Copy link
Author

Eric678 commented Jul 6, 2023

A simpler workaround turns out to be leaving IOMMU disabled during installation (the above MB defaults to auto and does not know about Qubes) then installation exits seconds before getting the hardware reset with a missing IOMMU error starting sys-firewall - presuming it was sys-net actually. Installation exits cleanly and one can immediately log in and remove the USB controller from sys-usb. I have no idea what I am missing out on in the install, this technically invalidates all further testing of R4.2. It does seem to work rather well actually...

@andrewdavidwong andrewdavidwong added the affects-4.2 This issue affects Qubes OS 4.2. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.2 milestone Aug 13, 2023
@Eric678
Copy link
Author

Eric678 commented Sep 23, 2023

Quick check on rc3 and still there, however a clean install can be made by adding the "qubes.skip_autostart" option to vmlinuz on 2nd pass of installation. The installer does take notice, oddly sys_usb is not started and sys_firewall & sys_net are, probably a bug. Just take last USB controller out of sys_usb and start it and proceed as normal. Only problems I am having with rc3 is with USB devices being a bit flakey, may be related to whatever this problem is.

@DemiMarie
Copy link

How would one add the needed quirk to Anaconda?

@0spinboson
Copy link

Is there a phase during installation where the installer boots sys-usb after assigning all usb devices to it?

@marmarek
Copy link
Member

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

But, if the device really should stay in dom0, not as a workaround for a crash, but as really intended behavior, then we have a mechanism for that - rd.qubes.dom0_usb=37:00.0 (example value) option to the kernel. It will leave this controller in dom0, and also salt will respect this setting when creating sys-usb. It can be added to the kernel at the start of installation in grub menu (anaconda will carry the kernel option to the final system too), or maybe somewhere within anaconda automatically (of which I'm very much not convinced it's the right thing to do).

@DemiMarie
Copy link

Has this been reported to Gigabyte? I wonder if SMM is getting an interrupt it did not expect to get and crashes as a result.

@DemiMarie
Copy link

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

What if the device was attached to nothing? Don’t assign it to sys-usb, but don’t assign it to any other qube (including dom0) either. Assign it to Xen’s quarantine domain. That might avoid the crash without the security consequences.

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

@marmarek
Copy link
Member

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

@DemiMarie
Copy link

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

Is “assign to quarantine domain” simple or elaborate?

@brxken128
Copy link

This is reproducible on my 7950X with an Asus Strix X670E-F, so I don't thnk it's Gigabyte-specific. I also have a 7900XTX which may not be helping things.

@Tehvan
Copy link

Tehvan commented Oct 16, 2023

Also happens to me on 7950X with Asrock X670E Steel Legend. I have two USB controllers that cause a reboot -- 16:00.4 and 17:00.0

@neowutran
Copy link

I have the same issue with my Asus Strix X670E-F.
I have one "USB controller" that always cause a reboot : 12:00.0

However I am not sure of what it is really. I tried every USB port on my setup, everything work, without this "USB controller".

(
I have two unused internal USB 2.0 port on my motherboard.
I have one USB controller that I can passthrough in qubes os, but this controller never receive any usb device, I suspect it is the USB controller for my two unused internal USB 2.0 port.
)

For the peoples having this issue, are you missing any USB port / functionality without the "USB controller" that you cannot passthrough ?

Result of "sudo lsci -vvs 12:00.0"

12:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8 (prog-if 30 [XHCI])
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 46
	Region 0: Memory at fc000000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
		Vector table: BAR=0 offset=000fe000
		PBA: BAR=0 offset=000ff000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [450 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

The uncommon lines in this:

  • "Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-"
  • No "Latency" line

@Tehvan
Copy link

Tehvan commented Oct 21, 2023

On mine the 17.00.0 is the Motherboard LED controller. But since there is no problem when not using sys-usb, it should be a passthrough problem (i.e. iommu groups) right?

@0spinboson
Copy link

iommu groups or soft reset

@Eric678
Copy link
Author

Eric678 commented Oct 21, 2023

4.2-rc4 6.5.6 still there. Behavior is different - normal install, I left machine for 2nd pass and when I returned much later it was shut down. Bringing it up with qubes.skip_autostart there were 3 USB controllers in sys-usb that were unknown and all had to be removed for it to start. Guessing not everything made it disk before the reset.
2nd try with qubes.skip_autostart to 2nd pass, completed the Anaconda progress bar, dropped back to console, finished systemd-tmpfiles-clean.service, then stuck at "Job initial-setup.service/start running" for a couple of hours before I reset the machine. Took last USB controller out of sys-usb and all seemed OK. All USB ports appear to be working (13 exposed on outside of motherboard including mouse and keyboard + 1 I am using on the motherboard internally).
There is definitely a problem with writing USB storage devices that I will post separately.

[ed] While writing up that issue I had a different event: an instant power off while typing here. Had been doing various testing on USB ports and had left a storage device plugged into one of the controllers on the 670 chipset. On trying to boot I got the same power off after entering the disk password, suspecting sys-usb, I took a couple more devices out and could then get up and running and then noticed the USB drive on the back panel, removed it and could put those devices back in sys-usb and boot OK. So it looks like all it takes is for a device to be plugged into a port that is mapped to sys-usb to cause a reset or power off on start. I did plug the mouse and keyboard into the only 2 ports that are USB 2.0/1.1 that are on a USB 2.0 hub direct on the CPU, hence my original suspicion.

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Oct 22, 2023
@Eric678
Copy link
Author

Eric678 commented Dec 4, 2023

rc5-latest test did not get very far: debian-12-xfce: qubes.PostInstall service failed. See attached. No other reports? Media OK.
Installing encrypted on SATA SSD while another copy (current stable) encrypted on different drive. This worked above for rc4.
20231203

@Eric678
Copy link
Author

Eric678 commented Dec 30, 2023

4.2.0 6.6.2 did not have above installation problem. Still get a power off starting sys-usb if the last USB device is mapped. Not getting the power off/reset if a storage device is plugged into another controller when sys-usb is started, however sys-usb does go into a loop: device available, device removed notifications every second that is cleared by removing the storage device.
Note sys_net and sys_firewall are autostarted even if qubes.skip_autostart is passed to the kernel.

@krystian-hebel
Copy link

I can see the same on Supermicro M11SDV-4C-LN4F, here's log from serial from attempted boot that resulted in hard restart:
xen.log

No panic, nothing unexpected in the last lines. I'm not sure why first lines (5th and 6th) look as they do. I had issue with another Supermicro board (X11-something) where the output was heavily modified by BMC (lines printed out of order with heavy jumping with ANSI escape codes, \n without \r or \n after each character depending on BIOS settings etc.), but here everything seems to work reliably, except those two lines.

I can start the OS with qubes.skip_autostart and sys-usb starts only with USB controller disabled. Unfortunately, this platform has just one controller and most likely I'll need it at some point.

SergiiDmytruk added a commit to TrenchBoot/openqa-tests-qubesos that referenced this issue Apr 2, 2024
Need this for Supermicro MBD-M11SDV-4C-LN4F which resets if sys-usb is
in use.

See
QubesOS/qubes-issues#8322 (comment)

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: installer C: usb proxy hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

9 participants