Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispVM halts prematurely with custom .desktop file involving qrexec-client-vm #3318

Closed
joshuathayer opened this issue Nov 17, 2017 · 19 comments
Labels
C: core P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@joshuathayer
Copy link

Qubes OS version:

4.0rc2 (dom0 and templates up to date with testing repos)

Affected TemplateVMs:

At least fedora-25


Steps to reproduce the behavior:

I have a situation where I'm trying to open a particular file format in a disposable VM, from a (Whonix-based) AppVM, eg:

$ qvm-open-in-vm '$dispvm:work' ./foo.custom-ext

In my work VM I've configured xdg to associate .custom-ext to a python script. That script processes the given file, and as a side effect there's a call to qrexec-client-vm, eg:

...
p = subprocess.Popen(["qrexec-client-vm", "target-vm", "custom.Process"], stdin=subprocess.Pipe)
p.communicate("message")
...

Expected behavior:

I'd expect:

  • a disposable VM based on my work VM to start
  • my script to start running
  • the RPC call in the script to successfully complete
  • the script to continue processing until it exits
  • finally, the dispVM to shut down

Actual behavior:

My script seems to run as far as the call to qrexec (which completes successfully), and then the entire VM gets shut down. While the machine is shutting down, my script is still running: based on writes to STDERR, it's able to make some inconsistent number of lines before the machine is down.

General notes:

Removing the Popen call from my script allows it to run and complete normally.

Running the command on a non-disposable VM allows the script to run and complete normally, eg:

$ qvm-open-in-vm work ./foo.custom-ext

works great.

I suspected the issue may have had something to do with my script's STDIN/OUT/ERR being passed to the spawned child, but the following didn't change behavior at all:

...
fd = os.open("/dev/null", os.O_RDWR)
p = subprocess.Popen(["qrexec-client-vm", "target-vm", "custom.Process"],
      stdin=subprocess.Pipe, stdout=fd, stderr=fd)
...

For debugging, i tried making other calls which involve a fork() in my processing script, and those calls seemed to complete without error and without causing the VM to halt. Only the call to qrexec-client-vm caused the machine to halt.

Also for debugging I tried not using a pipe for stdin (instead using the same /dev/null filehandle), and that also didn't change matters.

I'm stumped. I'm not sure this is a Qubes bug as much as my not understanding something, but I do expect the behavior I noted above. Thanks for taking a look!


Related issues:

@marmarek
Copy link
Member

Just to be sure - have you checked the script isn't simply crashing? For example it should be subprocess.PIPE, not subprocess.Pipe. Have you tried running this in non-DisposableVM first?

@joshuathayer
Copy link
Author

@marmarek, yes, I've run this on a non-dispVM (the "base" VM the dispVM is based on) and it runs without problem.

And right, subprocess.PIPE is correct... sorry for the typo here. It's correct in my script. This code is public (part of this SecureDrop project), so I'll commit the branch I'm working on tomorrow and link to the code in question here.

@joshuathayer
Copy link
Author

I've come up with a minimal example of this behavior. It seems to be triggered when making qrexec calls to a disposable target VM, where the target VM then makes its own RPC calls back to the source VM in the course of handling the initial RPC call.

In this example we're going to make an RPC call from personal to a disposable VM based on work. On personal, we'll run qrexec-client-vm work bug.PartOne, then the invoked script will call qrexec-client-vm personal bug.PartTwo. We'll see that while that process works fine on a non-disposable VM, it fails when the target VM is disposable.

On work, create /rw/config/bug.PartOne with the following, and chmod 0755 it:

#!/usr/bin/python

import subprocess
import sys
import time

sys.stderr.write("Hello from bug-part-one. Going to qrexec bug.PartTwo now\n")

subprocess.Popen(["qrexec-client-vm","personal","bug.PartTwo"])

i = 0
while i < 10:
  time.sleep(1)
  subprocess.Popen(["qrexec-client-vm","personal","bug.PartTwo"])
  i += 1     

Add the following to /rw/config/rc.local on work:

ln -s /rw/config/bug.PartOne /etc/qubes-rpc/

And also run that command in your running work VM.

On personal, create /rw/config/bug.PartTwo with the following, and chmod 0755 it:

#!/usr/bin/python

import sys

sys.stderr.write("Hello from bug part two!\n")

Although it's not strictly necessary for this demo, to /rw/config/rc.local on personal add:

ln -s /rw/config/bug.PartTwo /etc/qubes-rpc/

and also run that command on your running personal VM.

On dom0, create /etc/qubes-rpc/policy/bug.PartOne and /etc/qubes-rpc/policy/bug.PartTwo with the following:

$anyvm $anyvm allow

And allow work to be a dispvm template:

$ qvm-prefs work template-for-dispvms True

Now we'll all set up. On personal, run:

$ qrexec-client-vm 'work' bug.PartOne

You should see the expected output:

$ qrexec-client-vm 'work' bug.PartOne
Hello from bug-part-one. Going to qrexec bug.PartTwo now
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
$ _

OK, that's what we expect. Now run the same thing on a disposable VM using work as a template:

[user@personal ~]$ qrexec-client-vm '$dispvm:work' bug.PartOne
Hello from bug-part-one. Going to qrexec bug.PartTwo now
Hello from bug part two!

The call never returns, and the disposable VM that was created has crashed.

I believe that if bug.PartTwo is run from the disposable VM targeting a different VM from the initial source (devel, for example), the process runs fine.

marmarek added a commit to marmarek/qubes-core-agent-linux that referenced this issue Nov 21, 2017
… port

(unconfirmed theory)
Only one side of the connection should report MSG_CONNECTION_TERMINATED
- the one where vchan port was allocated. This is especially important
for VM-VM connection, because when two such connections are established
between the same domains, in opposite directions - both will share the
same vchan port number, so qrexec-daemon will get confused which
connection was terminated.

QubesOS/qubes-issues#3318
@marmarek
Copy link
Member

I think this is about qrexec connection cleanup. When you execute two qrexec connections between the same domains (and those are the only connections), but in opposite directions, they will be assigned the same vchan port number. This is normally ok, because connection direction differs (different VM act as a server). But connection cleanup code got confused and report wrong connection being closed. The only code using that report in practice is cleanup of DispVM.

I've pushed preliminary attempt to fix this, but it doesn't work yet.

@joshuathayer
Copy link
Author

@marmarek thanks for your prompt attention! I suspected something along those lines, but had no idea where to start looking... I'll be interested to take a close look at your patch. Is there documentation anywhere about setting up a development environment for building and testing these core libraries? I'd rather be submitting patches than reporting bugs ;)

@marmarek
Copy link
Member

See here: https://www.qubes-os.org/doc/development-workflow/
For testing I suggest using cloned template (to not brick to many pieces at once).
There is also (really poor :/) documentation of qrexec internals: https://www.qubes-os.org/doc/qrexec3/#qrexec-protocol-details

As for the bug - the dom0 counterpart is here: https://github.com/QubesOS/qubes-core-admin-linux/blob/master/qrexec/qrexec-daemon.c#L387-L517 and https://github.com/QubesOS/qubes-core-admin-linux/blob/master/qrexec/qrexec-client.c#L713-L805. Unfortunately the logic about vchan port allocation is quite convoluted (the second link), mostly because dom0 is special - do not have own qrexec-agent/qrexec-daemon pair.

BTW if you want to install qrexec service in one VM only, you can use /usr/local/etc/qubes-rpc (which is part of /rw), instead of ln -s in rc.local.

@joshuathayer
Copy link
Author

Thanks so much, I'll spend some time with that tomorrow.

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core and removed C: other labels Mar 31, 2018
@unman
Copy link
Member

unman commented Aug 2, 2019

@andrewdavidwong
The test scripts provided by @joshuathayer now seem to work as expected, and output is same between 'target' and 'dispVM:target'
Can be closed.

@rustybird
Copy link

rustybird commented Aug 2, 2019

@unman:

The test scripts provided by @joshuathayer now seem to work as expected, and output is same between 'target' and 'dispVM:target'

Are you on R4.1? I'm using R4.0 current-testing, and the one-liner test command from #4273 (which is a duplicate of this bug) still crashes the DispVM.

@marmarek
Copy link
Member

marmarek commented Aug 2, 2019

For me this one-liner test works fine.
VM: qubes-core-agent-4.0.46
dom0: qubes-core-dom0-linux-4.0.18

@unman
Copy link
Member

unman commented Aug 2, 2019 via email

@rustybird
Copy link

Hmm, strange. I have qubes-core-agent-4.0.47 in the Fedora 30 based sourcevm (and qubes-core-dom0-linux-4.0.18 in dom0). Its default_dispvm is also Fedora 30 based.

I run qvm-run-vm --dispvm '/usr/lib/qubes/qrexec-client-vm \$default qubes.Filecopy </dev/null; echo sleeping supposedly forever; sleep inf' in sourcevm, wait a few seconds for the DispVM to start, then there's a Filecopy dom0 permission popup where I choose sourcevm as the target. The still running qvm-run-vm command quickly prints sleeping supposedly forever. After about 2 seconds, there's a Domain disp123 has halted notification. After another 5 seconds or so, qvm-run-vm exits, and /var/log/qubes/qrexec.disp123.log in dom0 contains libvchan_is_eof.

@marmarek
Copy link
Member

marmarek commented Aug 2, 2019

Ok, I can confirm it's still broken. I think it depends on something during source vm startup. After some source vm reboots it works reliably, but after some, it's broken.

@rustybird
Copy link

Just a shot in the dark, but it seems plausible if there's indeed some VM startup thing: My computer is pretty slow, and the bug always triggers. So maybe putting your (presumably faster) system under high CPU and/or I/O load during e.g. sourcevm startup would also more consistently reproduce it for you.

@marmarek
Copy link
Member

marmarek commented Aug 2, 2019

Yes, it's likely.

I guess it's related to connection identifier / vchan port number, which are assigned on first-free basis. See #3318 (comment)

@pwmarcz
Copy link

pwmarcz commented Feb 10, 2020

I think I can confirm what happens in this case.

I added a debug print to qrexec-daemon whenever it allocated a port. Then I reproduced the issue:

sourcevm $ qvm-run-vm --dispvm '/usr/lib/qubes/qrexec-client-vm \$default qubes.Filecopy </dev/null; echo sleeping supposedly forever; sleep inf'

Here is the daemon log from source VM (with identifiers replaced):

connect_port: 513, connect_domain: DISP_VM, type: MSG_EXEC_CMDLINE
releasing port 513 to DISP_VM
eintr

And here is one from the disp VM:

connect_port: 513, connect_domain: SOURCE_VM, type: MSG_EXEC_CMDLINE
connect_port: 514, connect_domain: 0, type: MSG_EXEC_CMDLINE
connect_port: 515, connect_domain: 0, type: MSG_SERVICE_CONNECT
releasing port 515 to 0
eintr
releasing port 514 to 0
connect_port: 514, connect_domain: 0, type: MSG_SERVICE_CONNECT
eintr
releasing port 513 to SOURCE_VM, terminating client
libvchan_is_eof

Looks like there are two connections over port 513:

  • original qvm-run, in which SOURCE_VM is the vchan server
    • this has a corresponding qrexec-client -W connected to DISP_VM and waiting for end (because this is a dispvm)
  • nested qrexec-client-vm, in which DISP_VM is the vchan server

Then, the second qrexec-client-vm finishes and disconnects from qrexec-agent for DISP_VM, which notifies qrexec-daemon for DISP_VM about connection end, which (incorrectly) kills the qrexec-client -W connected, just because it has the same port number.

@pwmarcz
Copy link

pwmarcz commented Feb 10, 2020

Initially I thought this is a different kind of collision: the "allocated port" is sometimes allocated for a vchan client (in case of MSG_EXEC_CMDLINE), and sometimes for a vchan server (in case of MSG_SERVICE_CONNECT). I still think that can cause problems, but here we have a case of two MSG_EXEC_CMDLINE requests.

I think a good move would be to prevent any kinds of collisions between two daemons. In other words:

  • When asking for domX <--> domY connection, the domX daemon and the domY daemon should never issue the same port number.

A simple scheme to do that would be:

  • If X < Y, then domX daemon issues only odd port numbers (513, 515, 517...) for a connection to domY.
  • Otherwise, domX daemon issues only even port numbers (514, 516, 518...) to domY.

I ran a proof-of-concept and it seems to solve this specific bug.

With the current protocol, I think it's cleaner than tracking where a given connection came from, who was running a server, etc.

pwmarcz added a commit to pwmarcz/qubes-core-qrexec that referenced this issue Feb 11, 2020
Should prevent the situations where a wrong vchan connection is
terminated, which causes e.g. premature termination of a DispVM.

See longer discussion in QubesOS/qubes-issues#3318.
@rustybird
Copy link

rustybird commented May 8, 2020

@marmarek: Have you considered cherry-picking this patch for R4.0? It applies cleanly in qubes-core-admin-linux (after adjusting the file path).

marmarek pushed a commit to QubesOS/qubes-core-admin-linux that referenced this issue May 9, 2020
Should prevent the situations where a wrong vchan connection is
terminated, which causes e.g. premature termination of a DispVM.

See longer discussion in QubesOS/qubes-issues#3318.
@marmarek
Copy link
Member

Backported, included in QubesOS/updates-status#1787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

6 participants