Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upOccasionally qubes core gets deadlocked #1636
Comments
rootkovska
added
bug
C: core
P: blocker
labels
Jan 15, 2016
rootkovska
added this to the Release 3.1 milestone
Jan 15, 2016
marmarek
added
the
duplicate
label
Jan 15, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 15, 2016
Member
This is duplicate #1389, already fixed, just uploaded package to current-testing
|
This is duplicate #1389, already fixed, just uploaded package to current-testing |
marmarek
closed this
Jan 15, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
Feb 21, 2016
Member
Sadly this still affects me on the latest rc3 :/ This is easily reproducible in a env when lots of VMs are running and so there is a shortage of memory available, which apparently causes the DispVM launch to fail. In that case the whole qubes core gets deadlocked, as described above.
FWIW, this is the fragment of the log from the DispVM that got started but vchan didn't connect:
Feb 18 09:39:40 fedora-23-ylw-dvm prepare-dvm.sh[374]: Closing windows...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: USER PID ACCESS COMMAND
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: /rw: root kernel mount /rw
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: done.
Feb 18 09:39:41 fedora-23-ylw-dvm su[679]: pam_unix(su-l:session): session closed for user user
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: USER_END pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_close grantors=pam_keyinit,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: CRED_DISP pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Waiting for save/restore...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Failed to read /qubes-restore-complete
Feb 18 09:39:41 fedora-23-ylw-dvm qubes-gui[474]: libvchan_is_eof
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: vchan closed
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: reconnecting
Feb 18 09:40:37 fedora-23-ylw-dvm systemd[405]: Time has been changed
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: vchan module loading
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: play libvchan_fd_for_select=18, ctrl=0x55cc68d2d3a0
Manually shutting down the half-started VM helped, but otherwise the whole system is unusable due to locked qubes.
|
Sadly this still affects me on the latest rc3 :/ This is easily reproducible in a env when lots of VMs are running and so there is a shortage of memory available, which apparently causes the DispVM launch to fail. In that case the whole qubes core gets deadlocked, as described above. FWIW, this is the fragment of the log from the DispVM that got started but vchan didn't connect:
Manually shutting down the half-started VM helped, but otherwise the whole system is unusable due to locked qubes. |
rootkovska
reopened this
Feb 21, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 21, 2016
Member
What exact versions of Qubes packages do you have (both dom0 and that template)?
|
What exact versions of Qubes packages do you have (both dom0 and that template)? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 21, 2016
Member
Also - how long have you waited? There is 60s timeout on qrexec connection (configurable using qrexec_timeout VM pref).
|
Also - how long have you waited? There is 60s timeout on qrexec connection (configurable using qrexec_timeout VM pref). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 21, 2016
Member
Ok, found one case where timeout wasn't enforced. In R3.0+ there are two vchan connection in qrexec:
- control connection (always between dom0 and VM)
- data connection (between VMs directly in most cases)
Missing timeout check was in the second one. In normal case, it should connect instantly (when control connection is already set up), so any unusual delay here is an error.
In case of DispVM startup there are two consecutive data connections: qubes.WaitForSession, then qubes.OpenInVM (or any other service requested in that DispVM). This means that timeout 60s for single data connection will result in 2 min "freeze" in case of failed DispVM startup. Maybe this timeout should be much smaller? Like 10s (which is already way above normal connection time)?
This all is about dealing with error condition in user-friendly way, regardless of what the actual error is. Actual DispVM startup problem is a separate issue, IMHO with much lower priority (not a release blocker).
|
Ok, found one case where timeout wasn't enforced. In R3.0+ there are two vchan connection in qrexec:
Missing timeout check was in the second one. In normal case, it should connect instantly (when control connection is already set up), so any unusual delay here is an error. This all is about dealing with error condition in user-friendly way, regardless of what the actual error is. Actual DispVM startup problem is a separate issue, IMHO with much lower priority (not a release blocker). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
Feb 22, 2016
Member
Yeah I think any timeout longer then 3-5 secs doesn't make sense, because it feels like an enternity for the user, and the user will be more inclined to restart the system, than wait for such a long time. Especially that this locks down effectively all the other operations, such as other VMs starting or listing even (which of course is a serious UX bug even if we reduce the timeouts to 3s or so).
|
Yeah I think any timeout longer then 3-5 secs doesn't make sense, because it feels like an enternity for the user, and the user will be more inclined to restart the system, than wait for such a long time. Especially that this locks down effectively all the other operations, such as other VMs starting or listing even (which of course is a serious UX bug even if we reduce the timeouts to 3s or so). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 22, 2016
Member
DispVM startup can take 10-20s depending on hardware, so even 10s timeout doesn't look suspicious. I have some slow machine and will test timeout 3s.
Anyway I agree that this locking mechanism isn't ideal. This is going to improve in R4.1 (with introduction of "qubesd" and management API), but also in R4.0 we will need to solve it somehow: #1729.
|
DispVM startup can take 10-20s depending on hardware, so even 10s timeout doesn't look suspicious. I have some slow machine and will test timeout 3s. Anyway I agree that this locking mechanism isn't ideal. This is going to improve in R4.1 (with introduction of "qubesd" and management API), but also in R4.0 we will need to solve it somehow: #1729. |
marmarek
closed this
in
marmarek/qubes-core-admin-linux@f8d8368
Feb 23, 2016
added a commit
to marmarek/old-qubes-core-admin
that referenced
this issue
Feb 23, 2016
added a commit
to marmarek/old-qubes-core-admin
that referenced
this issue
Feb 23, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 23, 2016
Member
Automated announcement from builder-github
The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
|
Automated announcement from builder-github The package
|
marmarek
added
the
r3.1-dom0-cur-test
label
Feb 23, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 28, 2016
Member
Automated announcement from builder-github
The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
rootkovska commentedJan 15, 2016
I cannot find a reliable way to reproduce it, but have already run into this bug on 2 different, freshly installed 3.1-rc2 systems within just 2 days(!).
I suspect this happens after unsuccessful attempt to start a DispVM (See #1621) -- the system is the left with a process keeping a lock on
qubes.xmlpreventing any other qvm-* tools, the manager, etc, from working, until the misbehaving process gets manually killed.Here's some snippets from the last session I just encountered today:
I then manually killed just the 16007 process, and this restored system back to normal (e.g. now it executed all the scheduled qvm-run's), although a few other processes related to that dispVM still were hanging:
For hygienic reasons I also xl-destory'ed the disp2 which also killed these remaining processes, now the lsof | grep qubes.xml showed nothing.
I think it's very important to track down this bug and fix, because a less advanced user would have no other choice than to reboot their system, possibly loosing all the unsaved work. I'm thus giving it a top priority.