Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMysterious GUId exit #16
Comments
marmarek
added
bug
C: gui-virtualization
labels
Mar 8, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Modified by wikimaster on 6 Apr 2010 23:03 UTC |
marmarek
added
the
P: critical
label
Mar 8, 2015
marmarek
added this to the Release 1 Alpha 2 milestone
Mar 8, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 8, 2015
Member
Comment by rafal on 7 Apr 2010 10:37 UTC
This is really weird.[is "libvchan_is_eof" line in qubes_guid log. This means qubes_guid process noticed that the "server_closed" flag is set in the shared page. The problem is that this flag is never set by the VM-side agent code (it never plans to exit).[[BR]([BR]]
There)]
[the libvchan_is_eof() function, when executed client-side (so, by the qubes_guid process) would also set the "client_closed" flag, and signal the server. Server should then wake up, see this flag, and exit as well. It does not happen, server is still alive.[[BR]([BR]]
Moreover,)]
[[BR]]
My current guess is that somehow the shared page is no longer mapped in the qubes_guid process, and some garbage is there instead; which would imply Xen or kernel bug. The problem is that this bug is very hard to reproduce, so debugging is difficult.
|
Comment by rafal on 7 Apr 2010 10:37 UTC |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 8, 2015
Member
Comment by joanna on 11 Apr 2010 20:50 UTC
Got lucky today and reproduced it again! This time we can see the "ring" page contains some trash, e.g. the "debug" canary is not 0x11223344:
libvchan_is_eof:
ctrl=0x17f35c0 ring=0x7f051f671000 ring_ref=0xa57ed evfd=0x7 evport=0x2c, debug=0xed3c81b0 server_closed=0xed1cfc53 client_closed=0x1
release_all_mapped_mfns running
Currently the best theory, suggested by Rafal, is that the "ring" page mmap'ed by VM is being swap out from time to time, even though we explicitly lock it via mlock() in ring_init().
The output of 'free' command in the VM after this situation occurred confirms that some swap has been in use indeed in the VM:
~# free -m
total used free shared buffers cached
Mem: 397 373 23 0 25 111
-/+ buffers/cache: 236 160
Swap: 999 3 996
I modified guid to loop and continuously log the state of the ring page when this mysterious exit condition occurs -- see this commit in 'guid_exit_debugging' branch:
http://qubes-os.org/gitweb/?p=joanna/gui.git;a=commitdiff;h=95b2596a969aade93ba6ab0d8a33914415116448
When the situation occurs next time and if we will see that the ring state returns to a normal state after some time (e.g. debug canary is 0x11223344 again) it would suggest the page was only temporarily unavailable at the PFN that guid thought it should be at, and would most likely suggest it was swapped out.
On the other hand, if we see only the garbage in all the following logs, it might suggest that the page has either been swapped (and there was just no request to pull it back, or it was swapped in but at a different PFN?), or that the VM kernel has somehow moved this page to some other PFN for some unknown reason... (what could that be?)
Perhaps the best solution would be to allocate this page from within a kernel driver, to prevent the kernel from swapping/moving it to another PFN? (Do we need to additionally lock it somehow, e.g. pretend it's a DMA buffer to prevent any potential PFN rearrangement?)
|
Comment by joanna on 11 Apr 2010 20:50 UTC libvchan_is_eof: Currently the best theory, suggested by Rafal, is that the "ring" page mmap'ed by VM is being swap out from time to time, even though we explicitly lock it via mlock() in ring_init(). The output of 'free' command in the VM after this situation occurred confirms that some swap has been in use indeed in the VM: ~# free -m I modified guid to loop and continuously log the state of the ring page when this mysterious exit condition occurs -- see this commit in 'guid_exit_debugging' branch: http://qubes-os.org/gitweb/?p=joanna/gui.git;a=commitdiff;h=95b2596a969aade93ba6ab0d8a33914415116448 When the situation occurs next time and if we will see that the ring state returns to a normal state after some time (e.g. debug canary is 0x11223344 again) it would suggest the page was only temporarily unavailable at the PFN that guid thought it should be at, and would most likely suggest it was swapped out. On the other hand, if we see only the garbage in all the following logs, it might suggest that the page has either been swapped (and there was just no request to pull it back, or it was swapped in but at a different PFN?), or that the VM kernel has somehow moved this page to some other PFN for some unknown reason... (what could that be?) Perhaps the best solution would be to allocate this page from within a kernel driver, to prevent the kernel from swapping/moving it to another PFN? (Do we need to additionally lock it somehow, e.g. pretend it's a DMA buffer to prevent any potential PFN rearrangement?) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 8, 2015
Member
Comment by joanna on 12 Apr 2010 18:11 UTC
Ok, we're so much closer now. We have added some extra debugging info:
http://qubes-os.org/gitweb/?p=joanna/gui.git;a=commitdiff;h=31651bd6d1f177f293ddf9c99ebb941b495eb769
Now we determined that indeed the MFN (sorry I meant MFN when writing PFN in the previous comment), so that indeed the MFN for the "ring" page can change on the AppVM side.
We used to believe that by using mlock() on the ring buffer we could pin the buffer to the MFN, but apparently this is not so:
[~](root@random)# ps ax | grep qubes_gui
739 ? SL 0:00 /usr/bin/qubes_gui
1248 hvc0 S+ 0:00 grep qubes_gui
[~](root@random)# kill -USR1 739
[~](root@random)# head -1 /var/log/qubes/gui_agent.log
ring_va = 0x7f62995bd000, ring_mfn = 0x7d8f2
[~](root@random)# tail -1 /var/log/qubes/gui_agent.log
ring_va = 0x7f62995bd000, ring_mfn = 0x856d8
The above log (taken in an AppVM) shows that the MFN for the ring page has changed -- when it was originally allocated it was stored at MFN = 0x7d8f2 and now (a few minutes later) the kernel (or Xen?) has moved it to another MFN = 0x856d8.
This obviously is a problem, as the guid, running in Dom0, still thinks it's at the original address, and as a result is getting garbage and shuts down the GUI for this particular domain.
We also found a method that results in almost 100% reproduction:
- Start a test vm, e.g. (in Dom0):
qvm-start random
- Run this script (in Dom0):
#!/bin/sh
qvm-run random "(while true; do echo; done) | zenity --progress --pulsate"
while true; do
qvm-run random firefox
sleep 1
done
Shortly after a few instances of firefox got started the zenity's progress bar should stop moving, indicating a crash of the gui for the 'random' VM.
We suspect the MFN changes are not related to swapping, because we observed that sometime swap has not been used at all in the VM, while this MFN change still occurred. Our current best theory is that there is some kind of memory defragamentation/reclaiming performed by the VM kernel or Xen hypervisor that modifies the MFNs.
Interestingly, we have '''never''' observed any garbage in the composition buffers, which are also usermode buffers allocated by the VM's Xorg server and which we access via their MFNs. This is strange -- does the X server uses some special flag for its composition buffers?
|
Comment by joanna on 12 Apr 2010 18:11 UTC http://qubes-os.org/gitweb/?p=joanna/gui.git;a=commitdiff;h=31651bd6d1f177f293ddf9c99ebb941b495eb769 Now we determined that indeed the MFN (sorry I meant MFN when writing PFN in the previous comment), so that indeed the MFN for the "ring" page can change on the AppVM side. We used to believe that by using
The above log (taken in an AppVM) shows that the MFN for the ring page has changed -- when it was originally allocated it was stored at MFN = This obviously is a problem, as the guid, running in Dom0, still thinks it's at the original address, and as a result is getting garbage and shuts down the GUI for this particular domain. We also found a method that results in almost 100% reproduction:
Shortly after a few instances of firefox got started the zenity's progress bar should stop moving, indicating a crash of the gui for the 'random' VM. We suspect the MFN changes are not related to swapping, because we observed that sometime swap has not been used at all in the VM, while this MFN change still occurred. Our current best theory is that there is some kind of memory defragamentation/reclaiming performed by the VM kernel or Xen hypervisor that modifies the MFNs. Interestingly, we have '''never''' observed any garbage in the composition buffers, which are also usermode buffers allocated by the VM's Xorg server and which we access via their MFNs. This is strange -- does the X server uses some special flag for its composition buffers? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 8, 2015
Member
Comment by joanna on 13 Apr 2010 23:08 UTC
After some desperate attempts to force MFN pinning of the usermode-allocated buffer, we finally decided to modify u2mfn and libvchan to allocate the ring page in the kernel mode and now everything seems to work fine. However, the code in the testing branch is not a very quality one...
Anyway, still there is a question about the compositing buffer: why we have never run into problems with their MFNs? Does the X server somehow pins their MFNs? Would be great if it did, as it would mean we just solved the problem. But we need to confirm for 100%...
History of our today's debugging and testing adventures:
http://qubes-os.org/gitweb/?p=joanna/gui.git;a=shortlog;h=refs/heads/guid_exit_debugging
|
Comment by joanna on 13 Apr 2010 23:08 UTC Anyway, still there is a question about the compositing buffer: why we have never run into problems with their MFNs? Does the X server somehow pins their MFNs? Would be great if it did, as it would mean we just solved the problem. But we need to confirm for 100%... History of our today's debugging and testing adventures: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Mar 8, 2015
Member
Comment by joanna on 27 Apr 2010 09:41 UTC
This problem has been solved in version 1.0.1/1.1.0 of gui/vchan (sorry for de-syncing the versions) by the following two commits:
- http://qubes-os.org/gitweb/?p=mainstream/gui.git;a=commit;h=7c9515f4c4f607b22db9cca179e9f131ed83c406
- http://qubes-os.org/gitweb/?p=mainstream/gui.git;a=commit;h=cbad17a0a748d0f2420774fe20acec98d8bede7a
Essentially, they modify u2mfn module to implement mmap() method that allocates buffers in kernel for usermode apps, and then they modify vchan so that it uses this mmap() to allocate its ring buffer. Thanks to this, we ensure the ring buffer's mfn can never be modified by the DomU kernel's de-fragmentation algorithm.
We've been testing this for some last 2 weeks and never observed any problem again.
Still, it is interesting that the problem doesn't affect the composition buffers. To the best of our knowledge they are being allocated by the DomU's Xorg server via a regular malloc() and so they ''should'' be subject to the same problem as our ring page was. But we have '''never''' seen any problem with a composition buffer's mfn being changed after its allocation. Such a problem would be visible as "garbage" in one of the !AppViewer's window. There must be some mechanism that pins the composition buffers' mfns -- perhaps the Xorg heap is pinned? (but then again the composition buffer are usually several MBs in size, so should be allocated via mmap() eventually, not from the heap?).
Anyway, because we have never observed this problem in practice (and we did some really extensive testing), and also because even if this problem ever occurred (i.e. mfn of one of the composition buffer got relocated in DomU), this would '''not''' cause guid exit/crash. It would only put garbage into one of the !AppViewer window, but the user would still be able to quickly recover from this situation, e.g. by resizing the window slightly (which should cause composition buffer re-allocation).
Thus, we consider this ticket as closed.
The new rpms with updated qubes-gui-vm (1.0.1) and qubes-vchan (1.1.0) have been built and pushed into the yum repo. Users can upgrade by starting their templates and doing:
yum update 'qubes-*'
|
Comment by joanna on 27 Apr 2010 09:41 UTC
Essentially, they modify u2mfn module to implement mmap() method that allocates buffers in kernel for usermode apps, and then they modify vchan so that it uses this mmap() to allocate its ring buffer. Thanks to this, we ensure the ring buffer's mfn can never be modified by the DomU kernel's de-fragmentation algorithm. We've been testing this for some last 2 weeks and never observed any problem again. Still, it is interesting that the problem doesn't affect the composition buffers. To the best of our knowledge they are being allocated by the DomU's Xorg server via a regular malloc() and so they ''should'' be subject to the same problem as our ring page was. But we have '''never''' seen any problem with a composition buffer's mfn being changed after its allocation. Such a problem would be visible as "garbage" in one of the !AppViewer's window. There must be some mechanism that pins the composition buffers' mfns -- perhaps the Xorg heap is pinned? (but then again the composition buffer are usually several MBs in size, so should be allocated via mmap() eventually, not from the heap?). Anyway, because we have never observed this problem in practice (and we did some really extensive testing), and also because even if this problem ever occurred (i.e. mfn of one of the composition buffer got relocated in DomU), this would '''not''' cause guid exit/crash. It would only put garbage into one of the !AppViewer window, but the user would still be able to quickly recover from this situation, e.g. by resizing the window slightly (which should cause composition buffer re-allocation). Thus, we consider this ticket as closed. The new rpms with updated qubes-gui-vm (1.0.1) and qubes-vchan (1.1.0) have been built and pushed into the yum repo. Users can upgrade by starting their templates and doing:
|
marmarek commentedMar 8, 2015
Reported by joanna on 6 Apr 2010 23:00 UTC
Eventually managed to reproduce it! Logs from Dom0 and VM attached.
The guid in Dom0 exited:
The gui in the VM's still running:
Migrated-From: https://wiki.qubes-os.org/ticket/16