Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes qubes don't start the first time - If Whonix/Kicksecure Hardening Features for Testers Enabled #7959

Closed
TNTBOMBOM opened this issue Dec 27, 2022 · 21 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@TNTBOMBOM
Copy link

How to file a helpful issue

Qubes OS release

4.1

Brief summary

Sometimes qube/app-vm doesnt start from the first time and telling me to check /var/log/xen/console

Steps to reproduce

Just start the qube/appvm

Expected behavior

Should start normally

Actual behavior

/var/log/xen/console:

whonix-ws.txt

Workaround

Just start it again.

@TNTBOMBOM TNTBOMBOM added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Dec 27, 2022
@marmarek
Copy link
Member

The qube seems to start very slowly, I see a few services that take a long time to start (over 10s), including SUID, SGID, Capabd File Permission Hardening and Boot Clock Randomization (names seems to be mangled). @adrelanos, do you seen anything in those that could potentially take a long time?
Some idea might be to increase initial memory for such qube.

@TNTBOMBOM
Copy link
Author

TNTBOMBOM commented Dec 27, 2022 via email

@marmarek
Copy link
Member

marmarek commented Dec 27, 2022

Why not starting at the first place?

There is a start timeout of 60s. You can increase it if you want (qrexec_timeout property), but it shouldn't be necessary in normally functioning system.

If about RAMs issue then the dynamic 400-4000 not really that dynamic specially at boot time?

Dynamic memory management kicks in when relevant service starts within the qube, before that it's at the initial amount.

@adrelanos
Copy link
Member

Looking at the log and ignoring everything before Logfile Opened (that is everything before the previous shutdown). Useful to see the last startup sequence.

I didn't see any slowness during the boot process.

[2022-12-27 12:16:43] [    1.961577] acpi LNXCPU:7d: Still not present
[2022-12-27 12:16:43] [    1.961598] acpi LNXCPU:7e: Still not present
[2022-12-27 12:16:43] [    1.961618] acpi LNXCPU:7f: Still not present

Is this an issue?

[2022-12-27 12:16:45] Waiting for /dev/xvda* devices...
[2022-12-27 12:16:45] Qubes: Doing R/W setup for TemplateVM...
[2022-12-27 12:16:46] [    4.496819]  xvdc: xvdc1 xvdc3
[2022-12-27 12:16:46] Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
[2022-12-27 12:16:46] no label, UUID=5b08d43b-9911-4c6f-8db7-f057762e86b4
[2022-12-27 12:16:46] Qubes: done.
[2022-12-27 12:16:46] [    4.837659] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[2022-12-27 12:16:46] Waiting for /dev/xvdd device...
[2022-12-27 12:16:46] [    4.860732] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[2022-12-27 12:16:46] [    4.870951] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[2022-12-27 12:16:46] mount: /tmp/modules: WARNING: device write-protected, mounted read-only.
[2022-12-27 12:16:46] [    5.063948] EXT4-fs (xvda3): re-mounted. Opts: (null). Quota mode: none.
[2022-12-27 12:16:46] switch_root: failed to mount moving /dev to /sysroot/dev: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /dev
[2022-12-27 12:16:46] switch_root: failed to mount moving /proc to /sysroot/proc: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /proc
[2022-12-27 12:16:46] switch_root: failed to mount moving /sys to /sysroot/sys: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /sys
[2022-12-27 12:16:46] switch_root: failed to mount moving /run to /sysroot/run: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /run
[2022-12-27 12:16:48] libbpf: failed to find valid kernel BTF
[2022-12-27 12:16:48] libbpf: Error loading vmlinux BTF: -3
[2022-12-27 12:16:48] libbpf: failed to load object 'iterators_bpf'
[2022-12-27 12:16:48] libbpf: failed to load BPF skeleton 'iterators_bpf': -3
[2022-12-27 12:16:48] Failed load could be due to wrong endianness

This seems to be a major issue?

@adrelanos
Copy link
Member

Since when does this issue happen?

.[K[ .[0;31m*.[0;1;31m*.[0m.[0;31m*  .[0m] A start job is running for SUID, SGmission Hardening (37s / no limit)
[2022-12-27 12:17:33] .M
.[K[.[0;32m  OK  .[0m] Finished .[0;1;39mSUID, SGID, Capaband File Permission Hardening.[0m.

SUID Disabler and Permission Hardener:
https://www.kicksecure.com/wiki/SUID_Disabler_and_Permission_Hardener

At time of writing an opt-in, testers only feature. It might be tested here.

In theory there could be an issue which makes it it run infinite (infinite loop) or very long but it could also be a follow-up issue due to above mount related issue?

[2022-12-27 12:17:39] [  .[0;31m*.[0;1;31m*.[0m.[0;31m* .[0m] (8 of 8) A start job is running forock Randomization (46s / no limit)
[2022-12-27 12:17:39] .M
.[K[   .[0;31m*.[0;1;31m*.[0m.[0;31m*.[0m] (8 of 8) A start job is running forock Randomization (46s / no limit)
[2022-12-27 12:17:40] .M

Boot Clock Randomization:
https://www.kicksecure.com/wiki/Boot_Clock_Randomization

A stable feature that hasn't ever caused boot issues. This feature has very much less complexity than above script. The systemd unit runs bash code which is probably less than 200 lines of code I don't think there's potential to run very slow unless there's underlying major system issues running simple Linux utilities.

For example it runs od -An -N2 -i /dev/random and that command should finish in less than a second. But if the system is broken (super slow), then anything will take ages including Boot Clock Randomization.

On handling corner cases:

I am happy to make the source code more resilient in cases there are such system issues. For example, perhaps all systemd units could have timeout instead of no timeout. It's expected to finish within a second but if it takes let's say longer than 20 seconds (x20 times longer than expected), systemd could kill it. That wouldn't fix the underlying system issues, but might prevent the Qubes VM boot process from being broken. The user would end up with a broken system that isn't functional according to vendor specification. In this case, the user's system would lack Boot Clock Randomization. A fail-safe implementation could intentionally break networking if Boot Clock Randomization fails. (A systemd reverse dependency.)

@marmarek
Copy link
Member

[2022-12-27 12:16:46] switch_root: failed to mount moving /dev to /sysroot/dev: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /dev
[2022-12-27 12:16:46] switch_root: failed to mount moving /proc to /sysroot/proc: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /proc
[2022-12-27 12:16:46] switch_root: failed to mount moving /sys to /sysroot/sys: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /sys
[2022-12-27 12:16:46] switch_root: failed to mount moving /run to /sysroot/run: Invalid argument
[2022-12-27 12:16:46] switch_root: forcing unmount of /run

I think switch_root expects full blown systemd in initramfs, with all those filesystems mounted, but we have just a simple script that mounts rootfs. systemd mounts those fs later anyway, so should be harmless.

@marmarek
Copy link
Member

At time of writing an opt-in, testers only feature. It might be tested here.

Maybe adding set -x somewhere (where?) will help with debugging?

@andrewdavidwong andrewdavidwong added C: Xen needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Dec 30, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Dec 30, 2022
@adrelanos
Copy link
Member

Could you please comment-in

set -x

in file

/usr/libexec/security-misc/permission-hardening

? @TNTBOMBOM

@TNTBOMBOM
Copy link
Author

TNTBOMBOM commented Jan 9, 2023 via email

@adrelanos
Copy link
Member

The log should now be more verbose after reboot. If so, could you share please?

@TNTBOMBOM
Copy link
Author

The log should now be more verbose after reboot. If so, could you share please?

var-log-xen-console(verbose).txt

@marmarek
Copy link
Member

marmarek commented Jan 9, 2023

I don't see any more details there. You need to make the change in the template, otherwise it gets reverted on reboot. And also, you may want to add systemd.journald.forward_to_console=1 to kernelopts, to get full journal on the console (it will help getting it out, in case of failed VM startup).

@TNTBOMBOM
Copy link
Author

Check this log:

25-1logs.txt

If nothing new shown up then its not been cached by this log method, and we need to use something else.

@adrelanos
Copy link
Member

  • The systemd journal retrieving method is functional.
  • SUID Disabler and Permission Hardener timing. Too slow?
[2023-01-25 03:57:28]          Starting .[0;1;39mSUID, SGID, Capabd File Permission Hardening.[0m...
[2023-01-25 03:57:36] [   16.761646] systemd[1]: Finished SUID, SGID, Capability and File Permission Hardening.
  • SUID Disabler and Permission Hardener isn't blocking either since as seen in the journal other actions are done at the same time.
  • Doesn't show any issues.
  • I don't see how it could be the cause of this issue.

I don't see anything wrong in that log.


The enable debugging of SUID Disabler and Permission Hardener was probably wrong. More detailed instructions:

In Template:

sudoedit /usr/libexec/security-misc/permission-hardening

Change from

#!/bin/bash

to

#!/bin/bash
set -x

These instruction's won't survive updates by security-misc.

For instructions that do surrive reboot (but log a bit less but probably still enough):

sudoedit /etc/permission-hardening.d/10_user.conf

Add.

set -x

Save.

Then please add a new log here.


Boot Clock Randomization changes the clock as per log:

[2023-01-25 03:57:37] [ 17.849934] start[938]: - 68 049155879

(- 64 seconds and 049155879 nanoseconds.)

But that's happening for all Whonix users for years so I doubt that would be causing this issue. That however might confuse the logging of the timing. The journal timestamps however seem to be unchanged so either journal has a special way to ensuring timestamps (separate clock) or boot clock randomization isn't functional. The latter I will investigate but that's unrelated to this ticket.

You could disable Boot Clock Randomization / sdwdate for testing. Instructions:

@adrelanos
Copy link
Member

Sometimes qube/app-vm doesnt start from the first time and telling me to check /var/log/xen/console

Does this also happen for VMs based on other Templates other than Kicksecure or Whonix?

Does this also happen when not using SUID Disabler and Permission Hardener?

@unman
Copy link
Member

unman commented Jan 30, 2023 via email

@TNTBOMBOM
Copy link
Author

The enable debugging of SUID Disabler and Permission Hardener was probably wrong. More detailed instructions:

Ok check now:

6-2logs.txt

Does this also happen for VMs based on other Templates other than Kicksecure or Whonix?

No, not yet.

Does this also happen when not using SUID Disabler and Permission Hardener?

yes.

@adrelanos
Copy link
Member

Thank you!

There's nothing specifically wrong with SUID Disabler and Permission Hardener on your system that would require further debugging. I recommend to not use this testers-only feature for now. Documentation:

https://www.kicksecure.com/w/index.php#Disable_SUID_Disabler_and_Permission_Hardener

Undoing its modifications shouldn't be required. Just the systemd unit needs to be disabled so it won't add boot delay.

SUID Disabler and Permission Hardener performance and Qubes integration issues will be tracked here:
https://forums.whonix.org/t/suid-disabler-and-permission-hardener/7706/80

I'll post an update there once a solution was been implemented.

Does this also happen when not using SUID Disabler and Permission Hardener?

yes.

Therefore no need to concentrate too much on SUID Disabler and Permission Hardener in this ticket. I recommend to not use this testers-only feature for now as it might contribute to further boot delay until the root cause is found.

Are you using an SSD or HDD?

Does this happen with "stock" Whonix after a Template re-installation?

Any other "unusual" modifications?

Please mention any testers-only features or other "rare" modifications in bug reports as these might have a huge effect on ability to reproduce issues.

To debug further, a new log with after disabling SUID Disabler and Permission Hardener or a "stock" Whonix is required should this issue still be happening.

@andrewdavidwong andrewdavidwong changed the title Sometimes Qube-VM doesnt start from the first time Sometimes qubes don't start the first time Feb 6, 2023
@TNTBOMBOM
Copy link
Author

Are you using an SSD or HDD?

SSD

Does this happen with "stock" Whonix after a Template re-installation?

I dont remember happened with me

Any other "unusual" modifications?

Nope

To debug further, a new log with after disabling SUID Disabler and Permission Hardener or a "stock" Whonix is required should this issue still be happening.

Ok will add.

@TNTBOMBOM
Copy link
Author

TNTBOMBOM commented Jul 19, 2023

Ok will add.

logsfresh.txt

Problem doesnt happen with freshly installed whonix

@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 13, 2023
@TNTBOMBOM
Copy link
Author

Closing as its determined which components reproducing this.

@TNTBOMBOM TNTBOMBOM changed the title Sometimes qubes don't start the first time Sometimes qubes don't start the first time - If Whonix/Kicksecure Hardening Features for Testers Enabled Oct 3, 2023
@andrewdavidwong andrewdavidwong added R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: self-closed Voluntarily closed by the person who opened it before another resolution occurred. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants