Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage pool exceptions handled ungracefully #5723

Open
cfcs opened this issue Mar 13, 2020 · 2 comments
Open

Storage pool exceptions handled ungracefully #5723

cfcs opened this issue Mar 13, 2020 · 2 comments
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.

Comments

@cfcs
Copy link

cfcs commented Mar 13, 2020

Qubes OS version:

4.0 (according to /etc/fedora-release)

Affected component(s) or functionality:

qubes-core-admin -> qubesd -> storage pool initialization


Steps to reproduce the behavior:

  1. Write a storage driver, install your fork of qubes-core-admin
  2. Catch exception 'during import (otherwise the result is a complete shit show), re-raise wrapped in qubes.storage.StoragePoolException("unable to load")
  3. Observe that qubes/storage/__init__.py doesn't catch the exception on the line that says pool = self.vm.app.get_pool(volume_config['pool'])
  4. sudo journalctl -fu qubesd spams "failed to start" messages.

I tentatively "fixed" it with the patch below, which turned out to not be a good idea (see notes in section below)

try:
    pool = self.vm.app.get_pool(volume_config['pool'])
except qubes.exc.QubesException as e:
    self.vm.log.error('unable to load pool {!r}'.format(e))
    return None

Expected or desired behavior:

Storage pool drivers should be allowed to fail (temporarily, ideally) without bringing down the whole system, and without the affected VMs/pools being erased from qubes.xml (I took a backup of my qubes.xml before trying to fix this, which turned out to be a good idea since my "fix" resulted in that).

Both when loading the storage pool module itself (which I believe this bug report is about), and during init_volume / init_pool.

The affected pools / vms should still be visible, even if they are not startable, or at the very least they should reappear when the transient error is corrected.

Actual behavior:

qubesd fails to start, as a consequence no VMs are able to start and the qvm-* commands fail.

This happened when updating to a new kernel + new kernel-devel package which included "support" for a GCC plugin to collect extra entropy (somewhere in include/linux/random.h), guarded by an ifdef. For some reason the ifdef-guard resulted in the new fancy code being expanded by DKMS, but a symbol (latent_entropy) was missing, causing DKMS compilation of the ZFS module to fail. That's a separate issue, but I suspect there will be more build failures in the future, so I'd like to have a solution to this problem.

My failed attempts to fix this resulted in the pools being "forgotten," I think because I returned None where qubesd expects an updated copy of the dict that tracks storage pool parameters, causing that to be serialized in place of the original date.

I'm not sure if this was caused by qubes/storage/__init__.py or somewhere else (after my patch), but that line definitely failed. I'm not super inclined to repeat the experiment if the issue can be resolved without further information, since it takes some time to conduct safely, but I can do that if there's not other path forward.

General notes:

I'm not sure if I'm doing it wrong / throwing the wrong exception from the wrong place, but I would appreciate some help understanding how to do this gracefully.


I have consulted the following relevant documentation:

I did not :-(

I am aware of the following related, non-duplicate issues:

The patchset which lead to this (ZFS storage pools):
QubesOS/qubes-core-admin#289

@marmarek
Copy link
Member

1. install your fork of `qubes-core-admin`

This is unnecessary. This is designed using entry points specifically to allow shipping storage drivers in other packages. No need to fork anything. See also https://dev.qubes-os.org/projects/core-admin/en/latest/qubes-storage.html#storage-pool-driver-api

Storage pool drivers should be allowed to fail (temporarily, ideally)

When accessing the storage? Yes. But when importing/initializing the object? I'm not so sure.
I'd say raising an exception when loading the storage driver is a bug in that storage driver.

Anyway, indeed the failure mode here could be improved.
The solution would need some more changes than you propose, because many places assume vm.storage and vm.volumes is properly initialized always. One of the thing is serializing them back to qubes.xml, which you've noticed. But there are many more places like this.

@cfcs
Copy link
Author

cfcs commented Mar 13, 2020

@marmarek thank you for the pointer to entry points, that is super helpful!

Re: not raising when importing: That sounds reasonable to me, but if you only throw the exceptions at init time there is, as you say, multiple call sites, and some of them catch exceptions and have weird default behaviors (for instance somewhere it reverts to using the default storage pool if initialization fails -- which does not seem like an ideal way to handle the case where there is an existing VM on the missing pool).

(EDIT: To be clear I agree that defining error semantics for pool/volume initialization and raising the exceptions at those points is the clean way to go, and that there should not be exceptions throw on import)

@andrewdavidwong andrewdavidwong added C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality. labels Mar 16, 2020
@andrewdavidwong andrewdavidwong added this to the Far in the future milestone Mar 16, 2020
@andrewdavidwong andrewdavidwong removed this from the Release TBD milestone Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Projects
None yet
Development

No branches or pull requests

3 participants