Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dom0 boot loop with kernel-latest-5.14.15 #7089

Closed
icequbes1 opened this issue Nov 26, 2021 · 4 comments
Closed

dom0 boot loop with kernel-latest-5.14.15 #7089

icequbes1 opened this issue Nov 26, 2021 · 4 comments
Labels
C: kernel hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@icequbes1
Copy link

icequbes1 commented Nov 26, 2021

Qubes OS release

R4.0, stable + security-testing repos enabled

Brief summary

After kernel-latest-5.14.15 was pushed to stable on November 22, 2021, an AMD Ryzen machine that uses amdgpu boot loops. It is OP's hypothesis that it is related to AMD/amdgpu, but this has not been proven.

Steps to reproduce

  1. Power on

Expected behavior

  1. Disk passphrase is requested

Actual behavior

  1. Disk passphrase is not requested
  2. Machine immediately reboots

Additional info

Initial comments were provided 0, where I stated:

  1. xen loads
  2. dom0 loads, kernel output shown
  3. dom0: "Please enter passphrase for disk...." text-mode prompt seem very briefly
  4. dom0 kernel: amdgpu: Topology: Add APU mode
  5. dom0 kernel: fb0: switching to amdgpudrmfb from EFI VGA
  6. reboot

On successful boots with working dom0 kernel (5.13.6-1), lots of amdgpu output immediately after, so my guess is something in amdgpu broke between 5.13.6 and 5.14.15.

After a weak kernel version bisect (with xen 4.8.5-36), the observations are as follows:

  1. kernel-latest-5.13.6: no issues
  2. kernel-latest-5.14.10: no issues
  3. kernel-latest-5.14.15: dom0 boot loop
  4. kernel-latest-5.14.16: dom0 boot loop
  5. kernel-latest-5.14.17: no issues

Based on the above, the fix was implemented in 5.14.17. The 5.14.17 changelog does reveal two notable changes in amdgpu, one of which mentions a certain feature being "severely broken" and reverting the feature 1. I have not bisected by kernel commit to verify this and cannot state that this is the reason for the boot loop; only that the boot loop was no longer observed in this version.

@icequbes1 icequbes1 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Nov 26, 2021
@ydirson
Copy link

ydirson commented Nov 27, 2021

@icequbes1 which AMD GPU(s) do you have on this machine ?

@icequbes1
Copy link
Author

icequbes1 commented Nov 27, 2021 via email

@andrewdavidwong andrewdavidwong added C: kernel hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Nov 28, 2021
@andrewdavidwong andrewdavidwong added this to the Release 4.0 updates milestone Nov 28, 2021
@isodude
Copy link

isodude commented Dec 21, 2021

Note:

I've done this trip, there is at least 5 proper broken commits between 5.14 and 5.16. There's

  • A patch for shifting early memory allocation
  • PCI/MSI broke, fixed somewhere in 5.15-rc4 I think
  • Offlining CPUs broke in 5.15-rc4+ solved in 5.16 I think ( affecting both Xen and Non-Xen )
    ..

The PCI-problem is top of my mind and can be isolated via amdgpu.msi=0 as kernel commandline.
The early memory allocation thing needs to be patched.

If you want you can fetch the log by:
Xen cmdline: console=vga vga=keep noreboot
Linux cmdline: console=hvc0

I found that my camera had OTR-support which helped a lot :) I can pinpoint the patches if needed.

@icequbes1
Copy link
Author

icequbes1 commented May 1, 2022

Closing as a later kernel-latest version [5.16] did not experience the issue.

@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: kernel hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

4 participants