Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qubes 4 Admin API is way too inefficient #3293

Closed
qubesuser opened this issue Nov 8, 2017 · 13 comments · Fixed by QubesOS/qubes-core-admin#339
Closed

Qubes 4 Admin API is way too inefficient #3293

qubesuser opened this issue Nov 8, 2017 · 13 comments · Fixed by QubesOS/qubes-core-admin#339
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-dom0-cur-test T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Milestone

Comments

@qubesuser
Copy link

Qubes OS version:

4.0-rc2

Steps to reproduce the behavior:

  1. Run "time qvm-ls"

Expected behavior:

The output should be shown instantly, with time spent under 100ms.

Actual behavior:

It takes more than 2 seconds on my machine.

General notes:

The current Qubes API design that requires a call for each property just doesn't work.

A single call should exist and be used to get all data for all VMs or for any set of VMs (including a single one).

Proposed design:

  1. Add the API calls to qubesd
  2. Whenever a property of a VM is requested, get all of the VM data and cache it
  3. Add a method to precache the properties of a set of VMs, change clients to use
  4. Add a method to discard all or a set of properties for a VM, change "dynamic" clients to use it
@andrewdavidwong andrewdavidwong added C: core T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality. labels Nov 9, 2017
@andrewdavidwong andrewdavidwong added this to the Release 4.1 milestone Nov 9, 2017
@marmarek
Copy link
Member

marmarek commented Nov 9, 2017

Well, qvm-ls is an extreme example. I think this is the only case requiring thousands of calls, because it downloads almost the full state of the system. The current design of the Admin API is specifically granular to allow precise control using qrexec policy. This is the reason why there is no "get all values" call. Long-running process can download the state once, then observe events (admin.Events call) to update its state when changed. As for caching, dbus service serves similar purpose - using events to update its internal state. Currently it doesn't keep track of all the properties, but that should be doable. Implementing qvm-ls to use that instead also seems like a doable thing. There is some stub for something similar already (mostly debugging tool): qui-ls.

But on the Admin API level we prefer to keep it at the current granularity level, even if that means lower performance.

@qubesuser
Copy link
Author

qubesuser commented Nov 9, 2017

I have managed to get qvm-ls down to around 200ms on a system with 100 defined VMs.

Going lower than 100-200ms is probably impossible without rewriting both client and qubesd in an efficient language like Rust, which is not really worth the effort at least at the moment.

It does so by:

  • adding a single API call that downloads the data for all VMs
  • storing VM libvirt state in qubesd and updating it in response to libvirt events
  • optimizing the Python code in both client and server

Regarding the qrexec policy issue, I'd suggest using qrexec just to allow/disallow the whole admin API for specific VMs, and then use permission logic inside qubesd for any finer control, specifically hiding VMs that users should not be able to see (there are already events that can filter VMs and properties). This also allows to do things like dynamically define users that own VMs and have them be able to control only their own VMs automatically, which would be impossible to do easily in qrexec policy.

It's also possible to load the qrexec policy in qubesd and apply it to subtasks of requests, although this doesn't seem worth the effort since it wouldn't support well many use cases that require a dynamic security policy anyway.

At any rate, I don't think taking several seconds just for something like qvm-ls is a reasonable design for a production-ready system, so some sort of solution seems needed.

Preliminary pull requests:
Server changes: QubesOS/qubes-core-admin#165
Client changes: QubesOS/qubes-core-admin-client#36

Of course, in addition to qvm-ls, the changes make Qubes faster as a whole since they optimize several code paths used all over the place (such as the libvirt status queries and property code).

@qubesuser
Copy link
Author

qubesuser commented Nov 9, 2017

BTW, current qvm-ls performance with my patch set with ~100 VMs on a ~3GHz Intel machine (running it in dom0 with default configuration):
single call: 200ms
one call per VM: 600ms
one call per property: 1150ms
one call per property, without my patches: 2000ms+

If one really wants to preserve qrexec policy at VM granularity, a possible long-term solution could be to add the GetAll calls to get all properties for a single VM or Qubes but not GetAllData, and then replace qvm-ls with a dbus-based tool.

In the meantime, one could add GetAllData, and have the client automatically switch to GetAll if it's not available or denied, and in turn switch to Get if GetAll is not available (I changed my patches to do this). The GetAllData can then be removed or disabled with no ill effects.

Even with dbus-based tools, requiring seconds and thousands of qrexec calls (each one of which currently creates at least a process on both client and dom0) just for the dbus service to get the initial value of all properties doesn't seem acceptable in the long term, so I think we want the GetAll call at least.

@qubesuser
Copy link
Author

Another thing to consider is that admin.vm.List has pretty much the same policy/security implications of the proposed admin.vm.GetAllData (which is exactly like admin.vm.List but also sends all properties for each VM) since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).

So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.

Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.

Thus, I'm thinking there might be no real downside of having GetAllData?

@jpouellet
Copy link
Contributor

jpouellet commented Nov 12, 2017

since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).

I'm not convinced of their non-sensitivity.

For the properties expected to be set by a management VM and read by that same management VM, then fine, whatever.

However:

  • backup_timestamp: time of day could be a side-channel into whether to break into the user's home, office, trusted friend's place, etc. if for some reason wanting to obtain physical control over all backups made
  • label: knowing this allows more convincing UI spoofing - this otherwise needs to be guessed or leaked via bad opsec (e.g. screenshots) and might be guessed wrong
  • xid: side-channel as NN in DispNN used to be

and others (potentially unknowable others from 3rd party extensions) might indeed be sensitive depending on the given user's real-world situation.

So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.

+1

As much as I like the idea of being able to put all policy in a separate front-layer such that qubesd isn't even reachable unless a particular action is allowed, I too fail to see how such policies could be expressed without increasing the complexity of the policy format.

For example, what if you wanted to say "Allow work-mgmt to set work-email's TemplateVM, but only to TemplateVMs also managed by work-mgmt"?

It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.

Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.

Actually, it is intended for policy to be able to be applied per-property, (e.g. /etc/qubes-rpc/policy/admin.vm.property.Get+netvm). More examples in https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/

@marmarek
Copy link
Member

It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.

Yes, the policy cannot be applied to value being set. This is the trade-off we've made between flexibility and complexity of the policy. If in some use case for example you need to limit what templates can be used by particular Mgmt VM, then you need to write extension handling appropriate admin-permission event. And in that handler you already get values after preliminary validation (including standard policy - so only some Mgmt VMs can reach that code).

@zander
Copy link

zander commented Dec 23, 2017

A couple of thoughts;

using the local socket from C++ makes the list of requests pretty fasts. It takes 30ms to get all the info into my app. The catch here is that this is inside dom0, so no qrexec used.

Now, adding qrexec seems to do a lot of shell execution on the dom0 side and then the time can skyrocket. Without actually measuring this I can't tell. In essence I do agree with OP that something can be done to make this much smoother.

The main issue I see with the current setup is the overhead being too big. This is not much different than what OP concluded, but I disagree that you need new API calls, that doesn't scale either. You'll end up with a million similar but not the same API calls. A GetAllData call, as an example, overlaps functionality with the property getter call. More API calls is not the best idea as its working around the problem, not fixing it. Next week you'll want a GetAllDataExceptNetwork call...

So lets focus on the actual overhead instead.

What happens today;

  • the qubesd process expects one socket connection per API-call.
  • QubesD doesn't start processing the call until the callee closes the write-channel.
  • A full disconnect is done when the last data of the reply has been sent.

This is extremely wasteful since setting up the actual connection, especially via qrexec, is probably taking 90% of the time.

What I do in a similar solution is this;

  • We package a single API call, including all its arguments, in a single "Message".
  • The QubesD answers in a similar Message.
  • Individual messages have meta-data. Most important is the size of that message. This way both parties know when a question or a reply is finished.
  • We no longer have to close the connection after a single question/reply. A long series of questions/replies can be streamed over the same connection eliminating overhead of reconnects, handshakes and re-authentication etc.

The downside of this approach is that, much like OP suggested, you need to have more permission management inside of qubesd. I don't know if that already is the case, if tags for instance are handled in qubed or in qrexec. This will need to be handled in qubesd. From a security perspective this is a good idea anyway (have security as close to the source as possible to limit the attack surface)...

As I did this in a previous project and I ended up designing a tagged streaming dataformat, I will point to that here as you guys may want to use it. spec and bindings for several languages (including python) here.

In that solution I wrote a tag with ID zero as a separator. If that tag comes along, you know the question or answer is 'done'.

@marmarek
Copy link
Member

Actually the most time is spent on qrexec-policy process, called to evaluate policy at each call. Each time new (python) process is started, libraries imported etc. In extreme situation it can take up to 300ms per call...
There are some things that can be done without architecture change, for example getting rid of pkg_resources, like already done in core-admin-client. But if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process. The latter will not happen for for 4.0, more likely for 4.1.

@marmarek
Copy link
Member

The above comment is about non-dom0 case.

@zander
Copy link

zander commented Dec 24, 2017

the most time is spent on qrexec-policy process

That indeed confirms my worries.

if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process.

Or an in-the-middle solution where you have a simple daemon that is pre-loaded and can fork off a handling process. This avoids lots of threads and config-file cashing issues while taking advantage of the the fact that your forked process doesn't need to load python or any libraries.

It would be interesting to see how fast that is, and if its fast enough.

The solution I described is indeed a longer term refactor, one that would need a separate handshake. It will likely be magnitudes faster than current.

@marmarek
Copy link
Member

@DemiMarie can you provide some absolute numbers? For example on my system:

[marmarek@dom0 ~]$ time qvm-ls|wc -l
76              

real	0m0.996s
user	0m0.137s
sys	0m0.080s

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Dec 5, 2019
Allow getting all the VM properties with one call. This greatly improve
performance of an applications retrieving many/all of them (qvm-ls,
qubes manager etc)

QubesOS/qubes-issues#5415
Fixes QubesOS/qubes-issues#3293
@marmarek marmarek reopened this Dec 7, 2019
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.1.6-1.fc29 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@andrewdavidwong andrewdavidwong added the P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. label Dec 9, 2019
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020
The socket protocol is adjusted to match qrexec socket service protocol.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020
Power state changes are signaled with events too, so it is possible to
cache it and update/invalidate cache with events.
Additionally, admin.vm.List returns a power state, so the cache can be
populated early. This in particular greatly improves qvm-ls performance -
eliminate admin.vm.CurrentState call at all.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020
Both tools issue a large number of Admin API calls and greatly benefit
from a cache filled with a single per-vm Admin API call
(admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple
admin.vm.CurrentState calls (power state is given in the admin.vm.List
response too).

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Apr 20, 2020
Remove intermediate qubesd-query-fast proxy process.
This requires changing socket protocol to match what qrexec is sending
in the header.

Fixes QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020
Power state changes are signaled with events too, so it is possible to
cache it and update/invalidate cache with events.
Additionally, admin.vm.List returns a power state, so the cache can be
populated early. This in particular greatly improves qvm-ls performance -
eliminate admin.vm.CurrentState call at all.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020
Both tools issue a large number of Admin API calls and greatly benefit
from a cache filled with a single per-vm Admin API call
(admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple
admin.vm.CurrentState calls (power state is given in the admin.vm.List
response too).

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
Application that runs EventsDispatcher can safely use also cache , which
greatly improve performance. This is because cache then is properly
updated/invalidated when needed.
Instead of modifying each application to explicitly enable cache based
on this simple rule, make it implicit when EventsDispatcher is created.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
The socket protocol is adjusted to match qrexec socket service protocol.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
Power state changes are signaled with events too, so it is possible to
cache it and update/invalidate cache with events.
Additionally, admin.vm.List returns a power state, so the cache can be
populated early. This in particular greatly improves qvm-ls performance -
eliminate admin.vm.CurrentState call at all.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
Both tools issue a large number of Admin API calls and greatly benefit
from a cache filled with a single per-vm Admin API call
(admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple
admin.vm.CurrentState calls (power state is given in the admin.vm.List
response too).

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
Application that runs EventsDispatcher can safely use also cache , which
greatly improve performance. This is because cache then is properly
updated/invalidated when needed.
Instead of modifying each application to explicitly enable cache based
on this simple rule, make it implicit when EventsDispatcher is created.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020
Application that runs EventsDispatcher can safely use also cache , which
greatly improve performance. This is because cache then is properly
updated/invalidated when needed.
Instead of modifying each application to explicitly enable cache based
on this simple rule, make it implicit when EventsDispatcher is created.

Do not enable caching when EventsDispatcher is created only temporarily
in wait_for_domain_shutdown.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 22, 2020
The socket protocol is adjusted to match qrexec socket service protocol.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 22, 2020
Power state changes are signaled with events too, so it is possible to
cache it and update/invalidate cache with events.
Additionally, admin.vm.List returns a power state, so the cache can be
populated early. This in particular greatly improves qvm-ls performance -
eliminate admin.vm.CurrentState call at all.

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 22, 2020
Both tools issue a large number of Admin API calls and greatly benefit
from a cache filled with a single per-vm Admin API call
(admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple
admin.vm.CurrentState calls (power state is given in the admin.vm.List
response too).

QubesOS/qubes-issues#3293
marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 22, 2020
Application that runs EventsDispatcher can safely use also cache , which
greatly improve performance. This is because cache then is properly
updated/invalidated when needed.
Instead of modifying each application to explicitly enable cache based
on this simple rule, make it implicit when EventsDispatcher is created.

Do not enable caching when EventsDispatcher is created only temporarily
in wait_for_domain_shutdown.

QubesOS/qubes-issues#3293
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.1.12-1.fc32 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-dom0-cur-test T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants