Qubes 4 Admin API is way too inefficient #3293

qubesuser · 2017-11-08T20:00:30Z

Qubes OS version:

4.0-rc2

Steps to reproduce the behavior:

Run "time qvm-ls"

Expected behavior:

The output should be shown instantly, with time spent under 100ms.

Actual behavior:

It takes more than 2 seconds on my machine.

General notes:

The current Qubes API design that requires a call for each property just doesn't work.

A single call should exist and be used to get all data for all VMs or for any set of VMs (including a single one).

Proposed design:

Add the API calls to qubesd
Whenever a property of a VM is requested, get all of the VM data and cache it
Add a method to precache the properties of a set of VMs, change clients to use
Add a method to discard all or a set of properties for a VM, change "dynamic" clients to use it

marmarek · 2017-11-09T01:54:31Z

Well, qvm-ls is an extreme example. I think this is the only case requiring thousands of calls, because it downloads almost the full state of the system. The current design of the Admin API is specifically granular to allow precise control using qrexec policy. This is the reason why there is no "get all values" call. Long-running process can download the state once, then observe events (admin.Events call) to update its state when changed. As for caching, dbus service serves similar purpose - using events to update its internal state. Currently it doesn't keep track of all the properties, but that should be doable. Implementing qvm-ls to use that instead also seems like a doable thing. There is some stub for something similar already (mostly debugging tool): qui-ls.

But on the Admin API level we prefer to keep it at the current granularity level, even if that means lower performance.

qubesuser · 2017-11-09T19:32:04Z

I have managed to get qvm-ls down to around 200ms on a system with 100 defined VMs.

Going lower than 100-200ms is probably impossible without rewriting both client and qubesd in an efficient language like Rust, which is not really worth the effort at least at the moment.

It does so by:

adding a single API call that downloads the data for all VMs
storing VM libvirt state in qubesd and updating it in response to libvirt events
optimizing the Python code in both client and server

Regarding the qrexec policy issue, I'd suggest using qrexec just to allow/disallow the whole admin API for specific VMs, and then use permission logic inside qubesd for any finer control, specifically hiding VMs that users should not be able to see (there are already events that can filter VMs and properties). This also allows to do things like dynamically define users that own VMs and have them be able to control only their own VMs automatically, which would be impossible to do easily in qrexec policy.

It's also possible to load the qrexec policy in qubesd and apply it to subtasks of requests, although this doesn't seem worth the effort since it wouldn't support well many use cases that require a dynamic security policy anyway.

At any rate, I don't think taking several seconds just for something like qvm-ls is a reasonable design for a production-ready system, so some sort of solution seems needed.

Preliminary pull requests:
Server changes: QubesOS/qubes-core-admin#165
Client changes: QubesOS/qubes-core-admin-client#36

Of course, in addition to qvm-ls, the changes make Qubes faster as a whole since they optimize several code paths used all over the place (such as the libvirt status queries and property code).

qubesuser · 2017-11-09T22:23:22Z

BTW, current qvm-ls performance with my patch set with ~100 VMs on a ~3GHz Intel machine (running it in dom0 with default configuration):
single call: 200ms
one call per VM: 600ms
one call per property: 1150ms
one call per property, without my patches: 2000ms+

If one really wants to preserve qrexec policy at VM granularity, a possible long-term solution could be to add the GetAll calls to get all properties for a single VM or Qubes but not GetAllData, and then replace qvm-ls with a dbus-based tool.

In the meantime, one could add GetAllData, and have the client automatically switch to GetAll if it's not available or denied, and in turn switch to Get if GetAll is not available (I changed my patches to do this). The GetAllData can then be removed or disabled with no ill effects.

Even with dbus-based tools, requiring seconds and thousands of qrexec calls (each one of which currently creates at least a process on both client and dom0) just for the dbus service to get the initial value of all properties doesn't seem acceptable in the long term, so I think we want the GetAll call at least.

qubesuser · 2017-11-09T23:15:35Z

Another thing to consider is that admin.vm.List has pretty much the same policy/security implications of the proposed admin.vm.GetAllData (which is exactly like admin.vm.List but also sends all properties for each VM) since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).

So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.

Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.

Thus, I'm thinking there might be no real downside of having GetAllData?

jpouellet · 2017-11-12T22:41:54Z

since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).

I'm not convinced of their non-sensitivity.

For the properties expected to be set by a management VM and read by that same management VM, then fine, whatever.

However:

backup_timestamp: time of day could be a side-channel into whether to break into the user's home, office, trusted friend's place, etc. if for some reason wanting to obtain physical control over all backups made
label: knowing this allows more convincing UI spoofing - this otherwise needs to be guessed or leaked via bad opsec (e.g. screenshots) and might be guessed wrong
xid: side-channel as NN in DispNN used to be

and others (potentially unknowable others from 3rd party extensions) might indeed be sensitive depending on the given user's real-world situation.

So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.

+1

As much as I like the idea of being able to put all policy in a separate front-layer such that qubesd isn't even reachable unless a particular action is allowed, I too fail to see how such policies could be expressed without increasing the complexity of the policy format.

For example, what if you wanted to say "Allow work-mgmt to set work-email's TemplateVM, but only to TemplateVMs also managed by work-mgmt"?

It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.

Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.

Actually, it is intended for policy to be able to be applied per-property, (e.g. /etc/qubes-rpc/policy/admin.vm.property.Get+netvm). More examples in https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/

marmarek · 2017-11-12T23:32:10Z

It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.

Yes, the policy cannot be applied to value being set. This is the trade-off we've made between flexibility and complexity of the policy. If in some use case for example you need to limit what templates can be used by particular Mgmt VM, then you need to write extension handling appropriate admin-permission event. And in that handler you already get values after preliminary validation (including standard policy - so only some Mgmt VMs can reach that code).

zander · 2017-12-23T15:58:59Z

A couple of thoughts;

using the local socket from C++ makes the list of requests pretty fasts. It takes 30ms to get all the info into my app. The catch here is that this is inside dom0, so no qrexec used.

Now, adding qrexec seems to do a lot of shell execution on the dom0 side and then the time can skyrocket. Without actually measuring this I can't tell. In essence I do agree with OP that something can be done to make this much smoother.

The main issue I see with the current setup is the overhead being too big. This is not much different than what OP concluded, but I disagree that you need new API calls, that doesn't scale either. You'll end up with a million similar but not the same API calls. A GetAllData call, as an example, overlaps functionality with the property getter call. More API calls is not the best idea as its working around the problem, not fixing it. Next week you'll want a GetAllDataExceptNetwork call...

So lets focus on the actual overhead instead.

What happens today;

the qubesd process expects one socket connection per API-call.
QubesD doesn't start processing the call until the callee closes the write-channel.
A full disconnect is done when the last data of the reply has been sent.

This is extremely wasteful since setting up the actual connection, especially via qrexec, is probably taking 90% of the time.

What I do in a similar solution is this;

We package a single API call, including all its arguments, in a single "Message".
The QubesD answers in a similar Message.
Individual messages have meta-data. Most important is the size of that message. This way both parties know when a question or a reply is finished.
We no longer have to close the connection after a single question/reply. A long series of questions/replies can be streamed over the same connection eliminating overhead of reconnects, handshakes and re-authentication etc.

The downside of this approach is that, much like OP suggested, you need to have more permission management inside of qubesd. I don't know if that already is the case, if tags for instance are handled in qubed or in qrexec. This will need to be handled in qubesd. From a security perspective this is a good idea anyway (have security as close to the source as possible to limit the attack surface)...

As I did this in a previous project and I ended up designing a tagged streaming dataformat, I will point to that here as you guys may want to use it. spec and bindings for several languages (including python) here.

In that solution I wrote a tag with ID zero as a separator. If that tag comes along, you know the question or answer is 'done'.

marmarek · 2017-12-23T21:48:48Z

Actually the most time is spent on qrexec-policy process, called to evaluate policy at each call. Each time new (python) process is started, libraries imported etc. In extreme situation it can take up to 300ms per call...
There are some things that can be done without architecture change, for example getting rid of pkg_resources, like already done in core-admin-client. But if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process. The latter will not happen for for 4.0, more likely for 4.1.

marmarek · 2017-12-23T22:03:21Z

The above comment is about non-dom0 case.

zander · 2017-12-24T00:19:17Z

the most time is spent on qrexec-policy process

That indeed confirms my worries.

if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process.

Or an in-the-middle solution where you have a simple daemon that is pre-loaded and can fork off a handling process. This avoids lots of threads and config-file cashing issues while taking advantage of the the fact that your forked process doesn't need to load python or any libraries.

It would be interesting to see how fast that is, and if its fast enough.

The solution I described is indeed a longer term refactor, one that would need a separate handshake. It will likely be magnitudes faster than current.

marmarek · 2018-04-30T17:31:36Z

@DemiMarie can you provide some absolute numbers? For example on my system:

[marmarek@dom0 ~]$ time qvm-ls|wc -l
76              

real	0m0.996s
user	0m0.137s
sys	0m0.080s

Allow getting all the VM properties with one call. This greatly improve performance of an applications retrieving many/all of them (qvm-ls, qubes manager etc) QubesOS/qubes-issues#5415 Fixes QubesOS/qubes-issues#3293

qubesos-bot · 2019-12-07T06:19:49Z

Automated announcement from builder-github

The package qubes-core-dom0-4.1.6-1.fc29 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

Power state changes are signaled with events too, so it is possible to cache it and update/invalidate cache with events. Additionally, admin.vm.List returns a power state, so the cache can be populated early. This in particular greatly improves qvm-ls performance - eliminate admin.vm.CurrentState call at all. QubesOS/qubes-issues#3293

Both tools issue a large number of Admin API calls and greatly benefit from a cache filled with a single per-vm Admin API call (admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple admin.vm.CurrentState calls (power state is given in the admin.vm.List response too). QubesOS/qubes-issues#3293

Remove intermediate qubesd-query-fast proxy process. This requires changing socket protocol to match what qrexec is sending in the header. Fixes QubesOS/qubes-issues#3293

Power state changes are signaled with events too, so it is possible to cache it and update/invalidate cache with events. Additionally, admin.vm.List returns a power state, so the cache can be populated early. This in particular greatly improves qvm-ls performance - eliminate admin.vm.CurrentState call at all. QubesOS/qubes-issues#3293

Both tools issue a large number of Admin API calls and greatly benefit from a cache filled with a single per-vm Admin API call (admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple admin.vm.CurrentState calls (power state is given in the admin.vm.List response too). QubesOS/qubes-issues#3293

Application that runs EventsDispatcher can safely use also cache , which greatly improve performance. This is because cache then is properly updated/invalidated when needed. Instead of modifying each application to explicitly enable cache based on this simple rule, make it implicit when EventsDispatcher is created. QubesOS/qubes-issues#3293

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

Power state changes are signaled with events too, so it is possible to cache it and update/invalidate cache with events. Additionally, admin.vm.List returns a power state, so the cache can be populated early. This in particular greatly improves qvm-ls performance - eliminate admin.vm.CurrentState call at all. QubesOS/qubes-issues#3293

Both tools issue a large number of Admin API calls and greatly benefit from a cache filled with a single per-vm Admin API call (admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple admin.vm.CurrentState calls (power state is given in the admin.vm.List response too). QubesOS/qubes-issues#3293

Application that runs EventsDispatcher can safely use also cache , which greatly improve performance. This is because cache then is properly updated/invalidated when needed. Instead of modifying each application to explicitly enable cache based on this simple rule, make it implicit when EventsDispatcher is created. QubesOS/qubes-issues#3293

Application that runs EventsDispatcher can safely use also cache , which greatly improve performance. This is because cache then is properly updated/invalidated when needed. Instead of modifying each application to explicitly enable cache based on this simple rule, make it implicit when EventsDispatcher is created. Do not enable caching when EventsDispatcher is created only temporarily in wait_for_domain_shutdown. QubesOS/qubes-issues#3293

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

Power state changes are signaled with events too, so it is possible to cache it and update/invalidate cache with events. Additionally, admin.vm.List returns a power state, so the cache can be populated early. This in particular greatly improves qvm-ls performance - eliminate admin.vm.CurrentState call at all. QubesOS/qubes-issues#3293

Both tools issue a large number of Admin API calls and greatly benefit from a cache filled with a single per-vm Admin API call (admin.vm.property.GetAll). In case of qvm-ls, this also saves multiple admin.vm.CurrentState calls (power state is given in the admin.vm.List response too). QubesOS/qubes-issues#3293

Application that runs EventsDispatcher can safely use also cache , which greatly improve performance. This is because cache then is properly updated/invalidated when needed. Instead of modifying each application to explicitly enable cache based on this simple rule, make it implicit when EventsDispatcher is created. Do not enable caching when EventsDispatcher is created only temporarily in wait_for_domain_shutdown. QubesOS/qubes-issues#3293

qubesos-bot · 2020-05-25T16:05:43Z

Automated announcement from builder-github

The package qubes-core-dom0-4.1.12-1.fc32 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

andrewdavidwong added C: core T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality. labels Nov 9, 2017

andrewdavidwong added this to the Release 4.1 milestone Nov 9, 2017

marmarek mentioned this issue Apr 29, 2018

Persistent qrexec policy daemon #3868

Closed

marmarek mentioned this issue May 19, 2018

Socket-based qrexec services #3912

Closed

marmarek mentioned this issue Jun 14, 2019

Improve handling of restricted Admin API policy #5099

Open

marmarek mentioned this issue Jun 26, 2019

qrexec-policy daemon #5125

Closed

marmarek closed this as completed in QubesOS/qubes-core-admin@10f99e5 Dec 7, 2019

marmarek reopened this Dec 7, 2019

qubesos-bot added the r4.1-dom0-cur-test label Dec 7, 2019

qubesos-bot mentioned this issue Dec 7, 2019

core-admin v4.1.6 (r4.1) QubesOS/updates-status#1502

Closed

andrewdavidwong added the P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. label Dec 9, 2019

marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue Apr 20, 2020

Adjust for changed qubesd socket protocol

06b491f

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

marmarek mentioned this issue Apr 20, 2020

Make qubesd connected directly as an socket qrexec service QubesOS/qubes-core-admin#339

Merged

marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 18, 2020

Adjust for changed qubesd socket protocol

a7510e5

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

marmarek added a commit to marmarek/qubes-core-admin-client that referenced this issue May 22, 2020

Adjust for changed qubesd socket protocol

bfe1a3d

The socket protocol is adjusted to match qrexec socket service protocol. QubesOS/qubes-issues#3293

marmarek closed this as completed in marmarek/qubes-core-admin@48ae89f May 24, 2020

qubesos-bot mentioned this issue May 24, 2020

core-admin-client v4.1.6 (r4.1) QubesOS/updates-status#1815

Closed

qubesos-bot mentioned this issue May 25, 2020

core-admin v4.1.12 (r4.1) QubesOS/updates-status#1837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qubes 4 Admin API is way too inefficient #3293

Qubes 4 Admin API is way too inefficient #3293

qubesuser commented Nov 8, 2017

marmarek commented Nov 9, 2017

qubesuser commented Nov 9, 2017 •

edited

Loading

qubesuser commented Nov 9, 2017 •

edited

Loading

qubesuser commented Nov 9, 2017

jpouellet commented Nov 12, 2017 •

edited

Loading

marmarek commented Nov 12, 2017

zander commented Dec 23, 2017

marmarek commented Dec 23, 2017

marmarek commented Dec 23, 2017

zander commented Dec 24, 2017

marmarek commented Apr 30, 2018

qubesos-bot commented Dec 7, 2019

qubesos-bot commented May 25, 2020

Qubes 4 Admin API is way too inefficient #3293

Qubes 4 Admin API is way too inefficient #3293

Comments

qubesuser commented Nov 8, 2017

Qubes OS version:

Steps to reproduce the behavior:

Expected behavior:

Actual behavior:

General notes:

marmarek commented Nov 9, 2017

qubesuser commented Nov 9, 2017 • edited Loading

qubesuser commented Nov 9, 2017 • edited Loading

qubesuser commented Nov 9, 2017

jpouellet commented Nov 12, 2017 • edited Loading

marmarek commented Nov 12, 2017

zander commented Dec 23, 2017

marmarek commented Dec 23, 2017

marmarek commented Dec 23, 2017

zander commented Dec 24, 2017

marmarek commented Apr 30, 2018

qubesos-bot commented Dec 7, 2019

qubesos-bot commented May 25, 2020

qubesuser commented Nov 9, 2017 •

edited

Loading

qubesuser commented Nov 9, 2017 •

edited

Loading

jpouellet commented Nov 12, 2017 •

edited

Loading