Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upQubes 4 Admin API is way too inefficient #3293
Comments
andrewdavidwong
added
C: core
enhancement
labels
Nov 9, 2017
andrewdavidwong
added this to the Release 4.1 milestone
Nov 9, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Nov 9, 2017
Member
Well, qvm-ls is an extreme example. I think this is the only case requiring thousands of calls, because it downloads almost the full state of the system. The current design of the Admin API is specifically granular to allow precise control using qrexec policy. This is the reason why there is no "get all values" call. Long-running process can download the state once, then observe events (admin.Events call) to update its state when changed. As for caching, dbus service serves similar purpose - using events to update its internal state. Currently it doesn't keep track of all the properties, but that should be doable. Implementing qvm-ls to use that instead also seems like a doable thing. There is some stub for something similar already (mostly debugging tool): qui-ls.
But on the Admin API level we prefer to keep it at the current granularity level, even if that means lower performance.
|
Well, But on the Admin API level we prefer to keep it at the current granularity level, even if that means lower performance. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesuser
Nov 9, 2017
I have managed to get qvm-ls down to around 200ms on a system with 100 defined VMs.
Going lower than 100-200ms is probably impossible without rewriting both client and qubesd in an efficient language like Rust, which is not really worth the effort at least at the moment.
It does so by:
- adding a single API call that downloads the data for all VMs
- storing VM libvirt state in qubesd and updating it in response to libvirt events
- optimizing the Python code in both client and server
Regarding the qrexec policy issue, I'd suggest using qrexec just to allow/disallow the whole admin API for specific VMs, and then use permission logic inside qubesd for any finer control, specifically hiding VMs that users should not be able to see (there are already events that can filter VMs and properties). This also allows to do things like dynamically define users that own VMs and have them be able to control only their own VMs automatically, which would be impossible to do easily in qrexec policy.
It's also possible to load the qrexec policy in qubesd and apply it to subtasks of requests, although this doesn't seem worth the effort since it wouldn't support well many use cases that require a dynamic security policy anyway.
At any rate, I don't think taking several seconds just for something like qvm-ls is a reasonable design for a production-ready system, so some sort of solution seems needed.
Preliminary pull requests:
Server changes: QubesOS/qubes-core-admin#165
Client changes: QubesOS/qubes-core-admin-client#36
Of course, in addition to qvm-ls, the changes make Qubes faster as a whole since they optimize several code paths used all over the place (such as the libvirt status queries and property code).
qubesuser
commented
Nov 9, 2017
•
|
I have managed to get qvm-ls down to around 200ms on a system with 100 defined VMs. Going lower than 100-200ms is probably impossible without rewriting both client and qubesd in an efficient language like Rust, which is not really worth the effort at least at the moment. It does so by:
Regarding the qrexec policy issue, I'd suggest using qrexec just to allow/disallow the whole admin API for specific VMs, and then use permission logic inside qubesd for any finer control, specifically hiding VMs that users should not be able to see (there are already events that can filter VMs and properties). This also allows to do things like dynamically define users that own VMs and have them be able to control only their own VMs automatically, which would be impossible to do easily in qrexec policy. It's also possible to load the qrexec policy in qubesd and apply it to subtasks of requests, although this doesn't seem worth the effort since it wouldn't support well many use cases that require a dynamic security policy anyway. At any rate, I don't think taking several seconds just for something like qvm-ls is a reasonable design for a production-ready system, so some sort of solution seems needed. Preliminary pull requests: Of course, in addition to qvm-ls, the changes make Qubes faster as a whole since they optimize several code paths used all over the place (such as the libvirt status queries and property code). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesuser
Nov 9, 2017
BTW, current qvm-ls performance with my patch set with ~100 VMs on a ~3GHz Intel machine (running it in dom0 with default configuration):
single call: 200ms
one call per VM: 600ms
one call per property: 1150ms
one call per property, without my patches: 2000ms+
If one really wants to preserve qrexec policy at VM granularity, a possible long-term solution could be to add the GetAll calls to get all properties for a single VM or Qubes but not GetAllData, and then replace qvm-ls with a dbus-based tool.
In the meantime, one could add GetAllData, and have the client automatically switch to GetAll if it's not available or denied, and in turn switch to Get if GetAll is not available (I changed my patches to do this). The GetAllData can then be removed or disabled with no ill effects.
Even with dbus-based tools, requiring seconds and thousands of qrexec calls (each one of which currently creates at least a process on both client and dom0) just for the dbus service to get the initial value of all properties doesn't seem acceptable in the long term, so I think we want the GetAll call at least.
qubesuser
commented
Nov 9, 2017
•
|
BTW, current qvm-ls performance with my patch set with ~100 VMs on a ~3GHz Intel machine (running it in dom0 with default configuration): If one really wants to preserve qrexec policy at VM granularity, a possible long-term solution could be to add the GetAll calls to get all properties for a single VM or Qubes but not GetAllData, and then replace qvm-ls with a dbus-based tool. In the meantime, one could add GetAllData, and have the client automatically switch to GetAll if it's not available or denied, and in turn switch to Get if GetAll is not available (I changed my patches to do this). The GetAllData can then be removed or disabled with no ill effects. Even with dbus-based tools, requiring seconds and thousands of qrexec calls (each one of which currently creates at least a process on both client and dom0) just for the dbus service to get the initial value of all properties doesn't seem acceptable in the long term, so I think we want the GetAll call at least. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesuser
Nov 9, 2017
Another thing to consider is that admin.vm.List has pretty much the same policy/security implications of the proposed admin.vm.GetAllData (which is exactly like admin.vm.List but also sends all properties for each VM) since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).
So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.
Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.
Thus, I'm thinking there might be no real downside of having GetAllData?
qubesuser
commented
Nov 9, 2017
|
Another thing to consider is that admin.vm.List has pretty much the same policy/security implications of the proposed admin.vm.GetAllData (which is exactly like admin.vm.List but also sends all properties for each VM) since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually). So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well. Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get. Thus, I'm thinking there might be no real downside of having GetAllData? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 12, 2017
Contributor
since in Qubes props aren't particularly sensitive (and the name and state are in fact more sensitive than the props usually).
I'm not convinced of their non-sensitivity.
For the properties expected to be set by a management VM and read by that same management VM, then fine, whatever.
However:
- backup_timestamp: time of day could be a side-channel into whether to break into the user's home, office, trusted friend's place, etc. if for some reason wanting to obtain physical control over all backups made
- label: knowing this allows more convincing UI spoofing - this otherwise needs to be guessed or leaked via bad opsec (e.g. screenshots) and might be guessed wrong
- xid: side-channel as NN in DispNN used to be
and others (potentially unknowable others from 3rd party extensions) might indeed be sensitive depending on the given user's real-world situation.
So a way to prevent a domain from seeing all VMs in admin.vm.List is needed anyway for advanced configurations, and that can be applied to admin.vm.GetAllData as well.
+1
As much as I like the idea of being able to put all policy in a separate front-layer such that qubesd isn't even reachable unless a particular action is allowed, I too fail to see how such policies could be expressed without increasing the complexity of the policy format.
For example, what if you wanted to say "Allow work-mgmt to set work-email's TemplateVM, but only to TemplateVMs also managed by work-mgmt"?
It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.
Policy can't be applied per-property, but I don't think you can do that currently anyway even with just admin.vm.property.Get.
Actually, it is intended for policy to be able to be applied per-property, (e.g. /etc/qubes-rpc/policy/admin.vm.property.Get+netvm). More examples in https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/
I'm not convinced of their non-sensitivity. For the properties expected to be set by a management VM and read by that same management VM, then fine, whatever. However:
and others (potentially unknowable others from 3rd party extensions) might indeed be sensitive depending on the given user's real-world situation.
+1 As much as I like the idea of being able to put all policy in a separate front-layer such that qubesd isn't even reachable unless a particular action is allowed, I too fail to see how such policies could be expressed without increasing the complexity of the policy format. For example, what if you wanted to say "Allow It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for
Actually, it is intended for policy to be able to be applied per-property, (e.g. /etc/qubes-rpc/policy/admin.vm.property.Get+netvm). More examples in https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/ |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Nov 12, 2017
Member
It seems to me we'll have to either increase policy format complexity (even beyond {created,managed}-by-x) (undesirable) or have more policy evaluation also in core-admin (also undesirable), or possibly both? I don't like it to begin with, but if we need to do it for .List, why not for .GetAll too.
Yes, the policy cannot be applied to value being set. This is the trade-off we've made between flexibility and complexity of the policy. If in some use case for example you need to limit what templates can be used by particular Mgmt VM, then you need to write extension handling appropriate admin-permission event. And in that handler you already get values after preliminary validation (including standard policy - so only some Mgmt VMs can reach that code).
Yes, the policy cannot be applied to value being set. This is the trade-off we've made between flexibility and complexity of the policy. If in some use case for example you need to limit what templates can be used by particular Mgmt VM, then you need to write extension handling appropriate |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zander
Dec 23, 2017
A couple of thoughts;
using the local socket from C++ makes the list of requests pretty fasts. It takes 30ms to get all the info into my app. The catch here is that this is inside dom0, so no qrexec used.
Now, adding qrexec seems to do a lot of shell execution on the dom0 side and then the time can skyrocket. Without actually measuring this I can't tell. In essence I do agree with OP that something can be done to make this much smoother.
The main issue I see with the current setup is the overhead being too big. This is not much different than what OP concluded, but I disagree that you need new API calls, that doesn't scale either. You'll end up with a million similar but not the same API calls. A GetAllData call, as an example, overlaps functionality with the property getter call. More API calls is not the best idea as its working around the problem, not fixing it. Next week you'll want a GetAllDataExceptNetwork call...
So lets focus on the actual overhead instead.
What happens today;
- the qubesd process expects one socket connection per API-call.
- QubesD doesn't start processing the call until the callee closes the write-channel.
- A full disconnect is done when the last data of the reply has been sent.
This is extremely wasteful since setting up the actual connection, especially via qrexec, is probably taking 90% of the time.
What I do in a similar solution is this;
- We package a single API call, including all its arguments, in a single "Message".
- The QubesD answers in a similar Message.
- Individual messages have meta-data. Most important is the size of that message. This way both parties know when a question or a reply is finished.
- We no longer have to close the connection after a single question/reply. A long series of questions/replies can be streamed over the same connection eliminating overhead of reconnects, handshakes and re-authentication etc.
The downside of this approach is that, much like OP suggested, you need to have more permission management inside of qubesd. I don't know if that already is the case, if tags for instance are handled in qubed or in qrexec. This will need to be handled in qubesd. From a security perspective this is a good idea anyway (have security as close to the source as possible to limit the attack surface)...
As I did this in a previous project and I ended up designing a tagged streaming dataformat, I will point to that here as you guys may want to use it. spec and bindings for several languages (including python) here.
In that solution I wrote a tag with ID zero as a separator. If that tag comes along, you know the question or answer is 'done'.
zander
commented
Dec 23, 2017
|
A couple of thoughts; using the local socket from C++ makes the list of requests pretty fasts. It takes 30ms to get all the info into my app. The catch here is that this is inside dom0, so no qrexec used. Now, adding qrexec seems to do a lot of shell execution on the dom0 side and then the time can skyrocket. Without actually measuring this I can't tell. In essence I do agree with OP that something can be done to make this much smoother. The main issue I see with the current setup is the overhead being too big. This is not much different than what OP concluded, but I disagree that you need new API calls, that doesn't scale either. You'll end up with a million similar but not the same API calls. A GetAllData call, as an example, overlaps functionality with the property getter call. More API calls is not the best idea as its working around the problem, not fixing it. Next week you'll want a GetAllDataExceptNetwork call... So lets focus on the actual overhead instead. What happens today;
This is extremely wasteful since setting up the actual connection, especially via qrexec, is probably taking 90% of the time. What I do in a similar solution is this;
The downside of this approach is that, much like OP suggested, you need to have more permission management inside of qubesd. I don't know if that already is the case, if tags for instance are handled in qubed or in qrexec. This will need to be handled in qubesd. From a security perspective this is a good idea anyway (have security as close to the source as possible to limit the attack surface)... As I did this in a previous project and I ended up designing a tagged streaming dataformat, I will point to that here as you guys may want to use it. spec and bindings for several languages (including python) here. In that solution I wrote a tag with ID zero as a separator. If that tag comes along, you know the question or answer is 'done'. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Dec 23, 2017
Member
Actually the most time is spent on qrexec-policy process, called to evaluate policy at each call. Each time new (python) process is started, libraries imported etc. In extreme situation it can take up to 300ms per call...
There are some things that can be done without architecture change, for example getting rid of pkg_resources, like already done in core-admin-client. But if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process. The latter will not happen for for 4.0, more likely for 4.1.
|
Actually the most time is spent on qrexec-policy process, called to evaluate policy at each call. Each time new (python) process is started, libraries imported etc. In extreme situation it can take up to 300ms per call... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
The above comment is about non-dom0 case. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zander
Dec 24, 2017
the most time is spent on qrexec-policy process
That indeed confirms my worries.
if that wouldn't be enough, we can switch to policy handling daemon, instead of per-call process.
Or an in-the-middle solution where you have a simple daemon that is pre-loaded and can fork off a handling process. This avoids lots of threads and config-file cashing issues while taking advantage of the the fact that your forked process doesn't need to load python or any libraries.
It would be interesting to see how fast that is, and if its fast enough.
The solution I described is indeed a longer term refactor, one that would need a separate handshake. It will likely be magnitudes faster than current.
zander
commented
Dec 24, 2017
That indeed confirms my worries.
Or an in-the-middle solution where you have a simple daemon that is pre-loaded and can fork off a handling process. This avoids lots of threads and config-file cashing issues while taking advantage of the the fact that your forked process doesn't need to load python or any libraries. It would be interesting to see how fast that is, and if its fast enough. The solution I described is indeed a longer term refactor, one that would need a separate handshake. It will likely be magnitudes faster than current. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Apr 30, 2018
Member
@DemiMarie can you provide some absolute numbers? For example on my system:
[marmarek@dom0 ~]$ time qvm-ls|wc -l
76
real 0m0.996s
user 0m0.137s
sys 0m0.080s
|
@DemiMarie can you provide some absolute numbers? For example on my system:
|
qubesuser commentedNov 8, 2017
Qubes OS version:
4.0-rc2
Steps to reproduce the behavior:
Expected behavior:
The output should be shown instantly, with time spent under 100ms.
Actual behavior:
It takes more than 2 seconds on my machine.
General notes:
The current Qubes API design that requires a call for each property just doesn't work.
A single call should exist and be used to get all data for all VMs or for any set of VMs (including a single one).
Proposed design: