Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

Closed
scallyob opened this issue Apr 24, 2024 · 21 comments
Closed
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@scallyob
Copy link

How to file a helpful issue

Qubes OS release

4.2.1

Brief summary

I did a full update on April 23 and rebooted April 24. First reboot since April 1.
Now no Qubes/VMs will start.

Steps to reproduce

  1. Update dom0 and all VMs.
  2. reboot

Expected behavior

VMs set to autostart start up

Actual behavior

"libxenlight failed to create new-domain" pop ups for sys-net, sys-firewall, etc

qvm-ls - shows all VMs halted

/var/log/libvirt/libxl/libxl-driver.log shows:

libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 1: Stubdom 2 for 1 startup: startup timed out
libxl: libxl_create.c:1975: domcreate_devmodel_started: Domain 1:device model did not start -9

This repeats many times with the Domain # changing

(there are also errors related to PCI device, but these are present for previous boots as well and are not new. The errors above do not appear until today in the log.)

@scallyob scallyob added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Apr 24, 2024
@krystian-hebel
Copy link

krystian-hebel commented Apr 24, 2024

Good thing I refreshed list of issues, I was about to report the same. I can confirm that VMs were restarted right after upgrade and they worked until following boot.

I also checked that after removing all network controllers from sys-net and changing its type to PVH all VMs can be started (obviously without networking), so this seems to be a problem with HVM.

This happened on KGPE-D16 with Opteron 6282 SE, with ASUS firmware 3001 (i.e. no coreboot).

@scallyob
Copy link
Author

KGPE-D16 with Opteron 6282 SE

Oh no, I'm on same hardware!

@scallyob
Copy link
Author

Kernel downgrade in dom0 and sys-net did not produce different results.

@marmarek
Copy link
Member

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

@scallyob
Copy link
Author

There's a lot in there, not sure what to look for. Don't see errors or warnings.
(talked about this here: https://forum.qubes-os.org/t/no-qubes-vms-boot-after-latest-updates/26033/14)

@marmarek
Copy link
Member

Errors can be buried quite deep there... look for anything after starting qemu.

@scallyob
Copy link
Author

I see it start qemu, with a bunch of options, each time it tries to boot. Nothing obvious to me after that that seems like a problem.

@krystian-hebel
Copy link

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

For me it ends with:

image

If full log may contain something useful I may try to get it tomorrow, without network it will require some finesse.

@marmarek
Copy link
Member

Indeed nothing obvious there... But one worrying thing is the timing: the "Rescanning PCI Frontend" messages are on stubdomain cleanup, and based on timestamps it's pretty close to starting qemu. AFAIR the startup timeout is 10s, but usually the stubdomain startup takes below 1s. This is pretty old system, it may be that recent workaround for speculative-execution bugs made it significantly slower.

Is it with current-testing (in dom0) enabled or not? Best to identify which update specifically broke it. dnf history may help, but my guess is Xen package. There is also Xen update in current-testing since yesterday, maybe this one will help?

@krystian-hebel
Copy link

Is it with current-testing (in dom0) enabled or not?

No, just default ones.

I also think this may be caused by Xen, that would explain why initial restart of VMs succeeded and only after full reboot they won't come up, will check different versions later.

@scallyob
Copy link
Author

Is it with current-testing (in dom0) enabled or not?

No.

Tried downgrading to: xen-hvm-stubdom-linux-4.2.9-1.fc37.x86_64.rpm AND xen-hvm-stubdom-linux-full-4.2.9-1.fc37.x86_64.rpm

No change.

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

@marmarek
Copy link
Member

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

xen package needs to match exact version of python3-xen - so you need this one too

@scallyob
Copy link
Author

scallyob commented Apr 24, 2024

Ok, that allowed the downgrade, now everything is booting.

CORRECTION: everything seems to boot EXCEPT for sys-usb

start failed: Timed out during operation: cannot acquire state change lock (held by monitor=shutdown-event-20) [Fixed: caused by switching from HVM to PV]

@andrewdavidwong andrewdavidwong added C: Xen P: major Priority: major. Between "default" and "critical" in severity. needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Apr 25, 2024
@krystian-hebel
Copy link

The problem is still present in 4.17.4-1 from current-testing.

@krystian-hebel
Copy link

krystian-hebel commented Apr 26, 2024

All VMs start when booting with spec-ctrl=no-ibpb-entry, which makes me believe this is a performance issue that may have been present for some time, but hidden until XSA-455.

I've noticed the same problem (at least I think it's the same, but didn't do as much testing as on KGPE) on HP t630 with slightly newer CPU. In that case, sys-usb was failing which didn't allow me to log in and I didn't had PS/2 keyboard at hand. spec-ctrl=no-ibpb-entry helped in that case as well. (KGPE doesn't have sys-usb because OS is installed on USB drive)

@marmarek is it possible to relax the startup timeout to see if it helps?

@marmarek
Copy link
Member

The timeout is hardcoded in libxl (look for LIBXL_STUBDOM_START_TIMEOUT), to 30s. If 30s is not enough to start, I doubt relaxing it will result in a working system (even if it will start, the stubdomain will likely be too slow for sys-net/sys-usb to work at all...)

@marmarek
Copy link
Member

So, I'm afraid there is not much hope for this old-ish system... The only way to make the system kinda-usable has a tradeoff with security here, by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv. It does mean that stubdomain will be able to mount the attack, potentially leaking memory of any other VM (so isolation of sys-net/sys-usb and any other HVM becomes weaker). If that is not an acceptable risk, blame AMD for making buggy CPU, and replace with something newer...

@andyhhp
Copy link

andyhhp commented Apr 26, 2024

Yeah sorry... you need https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf and the update from Feb this year in order to have an AMD CPU not needing this mitigation for safety

@scallyob
Copy link
Author

by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv

  1. So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)
  2. For someone who doesn't know how to do this, how hard is this to do?

Thanks for looking into this, despite the disappointing conclusion.

@RA-Kooi
Copy link

RA-Kooi commented Apr 26, 2024

  1. So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)

Yes, by not updaitng Xen you will be vulnerable to vulnerabilities in Xen itself as well as being vulnerable to this CPU bug.

  1. For someone who doesn't know how to do this, how hard is this to do?

Depending on how Xen is started it could differ, but chances are there's a file called xen.cfg in /boot. In this file you will something like this:

[xen]
options=

Simply append spec-ctrl=ibpb-entry=no-pv to the end of the options line.

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Apr 26, 2024
Copy link

This issue has been closed as "declined." This means that the issue describes a legitimate bug (in the case of bug reports) or proposal (in the case of enhancements and tasks), and it is actionable, at least in principle. Nonetheless, it has been decided that no action will be taken on this issue. Here are some examples of reasons why an issue may be declined:

  • No solution can be found.
  • The proposed action is not possible.
  • The proposed action would weaken security to an unacceptable degree.
  • The proposed action would be too costly (in time, money, or other resources) relative to the benefits it would provide.
  • The proposed action would make some things better while making other things worse, and the trade-off is not worthwhile.

These are just general examples. If the specific reason for this particular issue being declined has not already been provided, please feel free to leave a comment below asking for an explanation.

We respect the time and effort you have taken to file this issue, and we understand that this outcome may be unsatisfying. Please accept our sincere apologies and know that we greatly value your participation and membership in the Qubes community.

If anyone reading this believes that this issue was closed in error or that the resolution of "declined" is not accurate, please leave a comment below saying so, and the Qubes team will review this issue again. For more information, see How issues get closed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

6 participants