New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qubes doesn't start VMs on 20% of boots #3626

Open
3hhh opened this Issue Feb 24, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@3hhh

3hhh commented Feb 24, 2018

Qubes OS version:

Since 4.0rc3, currently with 4.0rc4.

Affected component(s):

VMs, dom0 starts fine

Steps to reproduce the behavior:

boot and get unlucky in 20% of cases (at least for me that's the case)

Expected behavior:

VMs can be started

Actual behavior:

VMs seem to start (notification appears), but apparently they hang at some point; all are affected equally

Attempting to run qvm-run -p vm 'echo bla' returns

gntshr: error: ioctl failed: No space left on device
Failed to start data vchan server

Some suspicious entries from a dom0 sudo journalctl -b0:
Multiple
Feb 24 12:50:27 dom0 qmemman.daemon.reqhandler[2067]: EOF
and multiple

Feb 24 12:55:46 dom0 qubesd[2055]: unhandled exception while calling src=b'[VM name]' meth=b'qubes.NotifyUpdates' dest=b'dom0' arg=b'' len(untrusted_payload
)=2
Feb 24 12:55:46 dom0 qubesd[2055]: Traceback (most recent call last):
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib/python3.5/site-packages/qubes/api/__init__.py", line 262, in respond
Feb 24 12:55:46 dom0 qubesd[2055]:     untrusted_payload=untrusted_payload)
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib64/python3.5/asyncio/futures.py", line 381, in __iter__
Feb 24 12:55:46 dom0 qubesd[2055]:     yield self  # This tells Task to wait for completion.
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 310, in _wakeup
Feb 24 12:55:46 dom0 qubesd[2055]:     future.result()
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib64/python3.5/asyncio/futures.py", line 294, in result
Feb 24 12:55:46 dom0 qubesd[2055]:     raise self._exception
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Feb 24 12:55:46 dom0 qubesd[2055]:     result = coro.send(None)
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib64/python3.5/asyncio/coroutines.py", line 210, in coro
Feb 24 12:55:46 dom0 qubesd[2055]:     res = func(*args, **kw)
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib/python3.5/site-packages/qubes/api/misc.py", line 123, in qubes_notify_updates
Feb 24 12:55:46 dom0 qubesd[2055]:     if self.src.storage.outdated_volumes:
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib/python3.5/site-packages/qubes/storage/__init__.py", line 564, in outdated_volumes
Feb 24 12:55:46 dom0 qubesd[2055]:     if volume.is_outdated():
Feb 24 12:55:46 dom0 qubesd[2055]:   File "/usr/lib/python3.5/site-packages/qubes/storage/__init__.py", line 238, in is_outdated
Feb 24 12:55:46 dom0 qubesd[2055]:     raise self._not_implemented("is_outdated")
Feb 24 12:55:46 dom0 qubesd[2055]: NotImplementedError: Volume FileVolume has is_outdated() not implemented

qubesd seemed to be up and running. A qubesd restart didn't change anything.

General notes:

debian-8 and debian-9 templates were tested

Related issues:

#3616 sounds similar wrt the outcome, but I think it's a different issue.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 24, 2018

Member

Attempting to run qvm-run -p vm 'echo bla' returns

gntshr: error: ioctl failed: No space left on device
Failed to start data vchan server

This looks related. How many VMs do you have running? Check also how many qrexec-client processes are in dom0 (should be about 1 per VM).

Member

marmarek commented Feb 24, 2018

Attempting to run qvm-run -p vm 'echo bla' returns

gntshr: error: ioctl failed: No space left on device
Failed to start data vchan server

This looks related. How many VMs do you have running? Check also how many qrexec-client processes are in dom0 (should be about 1 per VM).

@3hhh

This comment has been minimized.

Show comment
Hide comment
@3hhh

3hhh Mar 8, 2018

Yesterday I ran into this one again.

This looks related. How many VMs do you have running?

It's directly after boot, so just 4.

Check also how many qrexec-client processes are in dom0 (should be about 1 per VM).

Yes, there was exactly one qrexec-daemon per VM; same for client (before I started trying qvm-run commands).

Another interesting fact:

xl console [vm] 
Unable to attach console.

3hhh commented Mar 8, 2018

Yesterday I ran into this one again.

This looks related. How many VMs do you have running?

It's directly after boot, so just 4.

Check also how many qrexec-client processes are in dom0 (should be about 1 per VM).

Yes, there was exactly one qrexec-daemon per VM; same for client (before I started trying qvm-run commands).

Another interesting fact:

xl console [vm] 
Unable to attach console.
@3hhh

This comment has been minimized.

Show comment
Hide comment
@3hhh

3hhh Mar 8, 2018

qvm-ls however reported the [vm] to be running. No custom init scripts ran though, e.g. no network manager applet for sys-net etc.

3hhh commented Mar 8, 2018

qvm-ls however reported the [vm] to be running. No custom init scripts ran though, e.g. no network manager applet for sys-net etc.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2018

Member

Another interesting fact:

xl console [vm]
Unable to attach console.

Use sudo xl console -t pv [vm]

What virt_mode this VM have? (qvm-prefs [vm] virt_mode)

Member

marmarek commented Mar 8, 2018

Another interesting fact:

xl console [vm]
Unable to attach console.

Use sudo xl console -t pv [vm]

What virt_mode this VM have? (qvm-prefs [vm] virt_mode)

@3hhh

This comment has been minimized.

Show comment
Hide comment
@3hhh

3hhh Mar 8, 2018

Use sudo xl console -t pv [vm]

I think I also tried with sudo, but ok I'll try again next time it happens.

What virt_mode this VM have? (qvm-prefs [vm] virt_mode)

3* pvh (out of these 1 disposable), 1* hvm (sys-usb as disposable).

I just sent you the entire journalctl output of the original issue description via e-mail as well - maybe you see more than I do (I didn't want to put it all online).

3hhh commented Mar 8, 2018

Use sudo xl console -t pv [vm]

I think I also tried with sudo, but ok I'll try again next time it happens.

What virt_mode this VM have? (qvm-prefs [vm] virt_mode)

3* pvh (out of these 1 disposable), 1* hvm (sys-usb as disposable).

I just sent you the entire journalctl output of the original issue description via e-mail as well - maybe you see more than I do (I didn't want to put it all online).

@3hhh

This comment has been minimized.

Show comment
Hide comment
@3hhh

3hhh Mar 14, 2018

sudo xl console -t pv [vm]

That was the right command, thanks!
All of my VMs seemed up and running fine according to their console, but the GUI didn't show.

Thus I believe I identified the root cause:
As part of #3149 I was starting VMs via the WM autostart (awesome for me), which apparently caused a racing condition between the Qubes GUI daemon (or whatever else is started in /etc/xdg/autostart) start and my autostarted VMs.

I just ran into this 7 times in a row and once I removed the VM autostart, waited a few seconds after my X showed up and then started the VMs manually it now worked 2 times in a row. Time will tell more.

I don't really understand this though as in the default Qubes config the VMs are started before the xdg autostart anyway? So why would I have to wait for it?

For reference some suspicious entries in the journal of one of the VMs (might indicate another unrelated issue though):

...
Mar 14 20:38:47 d-sys-usb systemd[1]: Started Session c27 of user user.
Mar 14 20:38:57 d-sys-usb qrexec-agent[633]: pid 1298 exited with -1
Mar 14 20:38:57 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:11 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:11 d-sys-usb qrexec-agent[1317]: pam_unix(qrexec:session): session 
Mar 14 20:39:11 d-sys-usb systemd[1]: Started Session c28 of user user.
Mar 14 20:39:20 d-sys-usb qrexec-agent[633]: pid 1317 exited with -1
Mar 14 20:39:20 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:34 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:34 d-sys-usb qrexec-agent[1338]: pam_unix(qrexec:session): session 
Mar 14 20:39:34 d-sys-usb systemd[1]: Started Session c29 of user user.
Mar 14 20:39:44 d-sys-usb qrexec-agent[633]: pid 1338 exited with -1
Mar 14 20:39:44 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:58 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:58 d-sys-usb qrexec-agent[1392]: pam_unix(qrexec:session): session 
Mar 14 20:39:58 d-sys-usb systemd[1]: Started Session c30 of user user.
...
Mar 14 20:47:06 d-sys-usb kernel: xenbus: xen store gave: unknown error E2BIG
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Data vchan connection failed
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Cannot setup watch on /local/domain
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Data vchan connection failed
Mar 14 20:47:06 d-sys-usb kernel: xenbus: xen store gave: unknown error E2BIG
...

3hhh commented Mar 14, 2018

sudo xl console -t pv [vm]

That was the right command, thanks!
All of my VMs seemed up and running fine according to their console, but the GUI didn't show.

Thus I believe I identified the root cause:
As part of #3149 I was starting VMs via the WM autostart (awesome for me), which apparently caused a racing condition between the Qubes GUI daemon (or whatever else is started in /etc/xdg/autostart) start and my autostarted VMs.

I just ran into this 7 times in a row and once I removed the VM autostart, waited a few seconds after my X showed up and then started the VMs manually it now worked 2 times in a row. Time will tell more.

I don't really understand this though as in the default Qubes config the VMs are started before the xdg autostart anyway? So why would I have to wait for it?

For reference some suspicious entries in the journal of one of the VMs (might indicate another unrelated issue though):

...
Mar 14 20:38:47 d-sys-usb systemd[1]: Started Session c27 of user user.
Mar 14 20:38:57 d-sys-usb qrexec-agent[633]: pid 1298 exited with -1
Mar 14 20:38:57 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:11 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:11 d-sys-usb qrexec-agent[1317]: pam_unix(qrexec:session): session 
Mar 14 20:39:11 d-sys-usb systemd[1]: Started Session c28 of user user.
Mar 14 20:39:20 d-sys-usb qrexec-agent[633]: pid 1317 exited with -1
Mar 14 20:39:20 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:34 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:34 d-sys-usb qrexec-agent[1338]: pam_unix(qrexec:session): session 
Mar 14 20:39:34 d-sys-usb systemd[1]: Started Session c29 of user user.
Mar 14 20:39:44 d-sys-usb qrexec-agent[633]: pid 1338 exited with -1
Mar 14 20:39:44 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:39:58 d-sys-usb qrexec-agent[633]: executed user:QUBESRPC qubes.WaitFo
Mar 14 20:39:58 d-sys-usb qrexec-agent[1392]: pam_unix(qrexec:session): session 
Mar 14 20:39:58 d-sys-usb systemd[1]: Started Session c30 of user user.
...
Mar 14 20:47:06 d-sys-usb kernel: xenbus: xen store gave: unknown error E2BIG
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Data vchan connection failed
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: eintr
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Cannot setup watch on /local/domain
Mar 14 20:47:06 d-sys-usb qrexec-agent[633]: Data vchan connection failed
Mar 14 20:47:06 d-sys-usb kernel: xenbus: xen store gave: unknown error E2BIG
...
@3hhh

This comment has been minimized.

Show comment
Hide comment
@3hhh

3hhh Mar 27, 2018

As part of #3149 I was starting VMs via the WM autostart (awesome for me), which apparently caused a racing condition between the Qubes GUI daemon (or whatever else is started in /etc/xdg/autostart) start and my autostarted VMs.

I just ran into this 7 times in a row and once I removed the VM autostart, waited a few seconds after my X showed up and then started the VMs manually it now worked 2 times in a row. Time will tell more.

Nowadays I call this script from the awesome WM autostart:

#!/bin/bash

#start all the XFCE autostart stuff
#for some stupid reason this apparently must happen BEFORE VMs are started -_- (as of 2018-03 Qubes otherwise crashes in 20% of the boots, cf. https://github.com/QubesOS/qubes-issues/issues/3626)
#dex doesn't block so we need to wait (it also screws with pipes, so we don't give it any)
pgrep -f -a mqui.tray | grep -v $$ || ( dex-autostart -a -e XFCE < /dev/null &> /dev/null ; sleep 10 )

#start VMs
#make sure they are not started by Qubes (qvm-prefs)
#qubes will automatically start all VMs further downstream
qvm-start --skip-if-running sys-usb &
pid=$!
qvm-start --skip-if-running sys-firewall
wait $pid

So far this fixed the issue.
It would be interesting to understand the root cause though.

3hhh commented Mar 27, 2018

As part of #3149 I was starting VMs via the WM autostart (awesome for me), which apparently caused a racing condition between the Qubes GUI daemon (or whatever else is started in /etc/xdg/autostart) start and my autostarted VMs.

I just ran into this 7 times in a row and once I removed the VM autostart, waited a few seconds after my X showed up and then started the VMs manually it now worked 2 times in a row. Time will tell more.

Nowadays I call this script from the awesome WM autostart:

#!/bin/bash

#start all the XFCE autostart stuff
#for some stupid reason this apparently must happen BEFORE VMs are started -_- (as of 2018-03 Qubes otherwise crashes in 20% of the boots, cf. https://github.com/QubesOS/qubes-issues/issues/3626)
#dex doesn't block so we need to wait (it also screws with pipes, so we don't give it any)
pgrep -f -a mqui.tray | grep -v $$ || ( dex-autostart -a -e XFCE < /dev/null &> /dev/null ; sleep 10 )

#start VMs
#make sure they are not started by Qubes (qvm-prefs)
#qubes will automatically start all VMs further downstream
qvm-start --skip-if-running sys-usb &
pid=$!
qvm-start --skip-if-running sys-firewall
wait $pid

So far this fixed the issue.
It would be interesting to understand the root cause though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment