New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

Closed
marmarek opened this Issue Nov 7, 2015 · 16 comments

Comments

Projects
None yet
4 participants
@marmarek
Member

marmarek commented Nov 7, 2015

Nov 07 10:50:05 fedora-21-dvm qubesdb-daemon[155]: vchan closed
Nov 07 10:50:05 fedora-21-dvm qubesdb-daemon[155]: reconnecting
Nov 07 10:51:02 fedora-21-dvm qubesdb-daemon[155]: xc: error: xc_gnttab_map_grant_refs: mmap failed (22 = Invalid argument): Internal error
Nov 07 10:51:02 fedora-21-dvm qubesdb-daemon[155]: vchan reconnection failed

Additionally when it happens, gui-agent spins in endless loop trying to read
QubesDB watch (waits for DispVM restore).

@marmarek marmarek added this to the Release 3.1 milestone Nov 7, 2015

marmarek added a commit to marmarek/old-qubes-gui-agent-linux that referenced this issue Nov 8, 2015

DispVM restore: do not spin forever when QubesDB connection is closed
This is already quite fatal situation, but do not make it even worse by
spinning in endless loop.

QubesOS/qubes-issues#1389

marmarek added a commit to QubesOS/qubes-gui-agent-linux that referenced this issue Nov 13, 2015

DispVM restore: do not spin forever when QubesDB connection is closed
This is already quite fatal situation, but do not make it even worse by
spinning in endless loop.

QubesOS/qubes-issues#1389

(cherry picked from commit e2da2b8)
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Nov 26, 2015

Member

The same applies to qrexec-agent data connection, also on DispVM restore. So it looks like vchan problem.

Member

marmarek commented Nov 26, 2015

The same applies to qrexec-agent data connection, also on DispVM restore. So it looks like vchan problem.

@marmarek marmarek added the C: xen label Nov 26, 2015

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Dec 13, 2015

Member

It may cause DispVM IP address to be wrong - because /usr/lib/qubes/setup-ip script can't get the right one from QubesDB.

Member

marmarek commented Dec 13, 2015

It may cause DispVM IP address to be wrong - because /usr/lib/qubes/setup-ip script can't get the right one from QubesDB.

@marmarek marmarek added P: critical and removed P: minor labels Dec 22, 2015

@i7u

This comment has been minimized.

Show comment
Hide comment
@i7u

i7u Dec 24, 2015

A side effect of this bug seems to be:
[user@fedora-23-dvm ~]$ cat /etc/resolv.conf
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB[user@fedora-23-dvm ~]$

This happens 100% of the times when I launch a DispVM on a fresh 3.1rc1 installation.

i7u commented Dec 24, 2015

A side effect of this bug seems to be:
[user@fedora-23-dvm ~]$ cat /etc/resolv.conf
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB[user@fedora-23-dvm ~]$

This happens 100% of the times when I launch a DispVM on a fresh 3.1rc1 installation.

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Dec 26, 2015

Affects me too. Qubes DB crashes. Confirmed via xl console.

Rudd-O commented Dec 26, 2015

Affects me too. Qubes DB crashes. Confirmed via xl console.

@m-v-b

This comment has been minimized.

Show comment
Hide comment
@m-v-b

m-v-b Jan 4, 2016

Given that this issue is intermittent, there appears to be a small race condition.

I found that removing the following lines in 01QubesDisposableVm.py (in /usr/lib64/python2.7/site-packages/qubes/modules) resolves this issue:

# close() is not really needed, because the descriptor is close-on-exec
# anyway, the reason to postpone close() is that possibly xl is not done
# constructing the domain after its main process exits
# so we close() when we know the domain is up
# the successful unpause is some indicator of it
        if qmemman_present:
            qmemman_client.close()

As the comments indicate, closing the Qubes memory manager is used as a synchronization point; however, this does not seem to be necessary and appears to cause the issue reported in this bug.

To prove this, I wrote a simple shell script, available at [1].

After removing the aforementioned two lines from 01QubesDisposableVm.py, my script runs to completion, and more importantly, I have never encountered this issue after the proposed modification.

I also found another option (but a rather unacceptable one). Adding a sleep after resuming the virtual machine in the same file, after the following line, also resolves the issue:

    self.libvirt_domain.resume()

[1] https://gist.github.com/m-v-b/5f156ae08d089efd757f

m-v-b commented Jan 4, 2016

Given that this issue is intermittent, there appears to be a small race condition.

I found that removing the following lines in 01QubesDisposableVm.py (in /usr/lib64/python2.7/site-packages/qubes/modules) resolves this issue:

# close() is not really needed, because the descriptor is close-on-exec
# anyway, the reason to postpone close() is that possibly xl is not done
# constructing the domain after its main process exits
# so we close() when we know the domain is up
# the successful unpause is some indicator of it
        if qmemman_present:
            qmemman_client.close()

As the comments indicate, closing the Qubes memory manager is used as a synchronization point; however, this does not seem to be necessary and appears to cause the issue reported in this bug.

To prove this, I wrote a simple shell script, available at [1].

After removing the aforementioned two lines from 01QubesDisposableVm.py, my script runs to completion, and more importantly, I have never encountered this issue after the proposed modification.

I also found another option (but a rather unacceptable one). Adding a sleep after resuming the virtual machine in the same file, after the following line, also resolves the issue:

    self.libvirt_domain.resume()

[1] https://gist.github.com/m-v-b/5f156ae08d089efd757f

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 4, 2016

Member

qmemman_client.close() basically resumes qmemman operations (dynamic management of VM memory). So maybe the problem is that memory used by vchan during connection is taken away by qmemman? That would be a kernel bug...

Member

marmarek commented Jan 4, 2016

qmemman_client.close() basically resumes qmemman operations (dynamic management of VM memory). So maybe the problem is that memory used by vchan during connection is taken away by qmemman? That would be a kernel bug...

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 5, 2016

Member

Progress: that xc_gnttab_map_grant_refs: mmap failed is in fact failed GNTTABOP_map_grant_ref hypercall fail. The actual error is GNTST_no_device_space (-7) /* Out of space in I/O MMU. */ (got that using kernel patch, because it isn't logged anywhere...)

Member

marmarek commented Jan 5, 2016

Progress: that xc_gnttab_map_grant_refs: mmap failed is in fact failed GNTTABOP_map_grant_ref hypercall fail. The actual error is GNTST_no_device_space (-7) /* Out of space in I/O MMU. */ (got that using kernel patch, because it isn't logged anywhere...)

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 5, 2016

Member

Generally this is qmemman problem:
When VM have some memory assigned, it means two things:

  • upper limit on VM memory allocation
  • target memory size for balloon driver

When memory allocation is changed, both values are changed. But actual memory usage will change only after balloon driver will balloon up/down the VM memory (give the memory back to the hypervisor, or take it from there). Until that happens, memory assigned, but not yet allocated by the VM, is considered "free" from hypervisor point of view (as in xl info, also xl list displays actual memory usage, not "target memory"). In such case, VM is free to allocate such memory (up to assigned limit) at any time.

Exactly this is happening during DispVM startup:

  1. QubesDisposableVM.start requests some memory (initial memory size for DispVM template - *-dvm VM) from qmemman to start new DispVM (e.g. 400MB)
  2. DispVM is restored from a savefile, using only memory that was allocated at savefile creation time (e.g. 280MB)
  3. Now DispVM is using some memory (280MB), but is allowed to use that initial size (400MB). The difference (120MB) is considered "free".
  4. qmemman redistribute that free memory among other VMs, leaving 50MB safety margin
  5. DispVM, after some time, allocate remaining memory, draining all the memory from Xen free pool
  6. Bad Things(tm) are happening - in this case grant table operations failures
  7. qmemman adjust memory assignments, so it looks ok a moment later (making debugging harder)

Note that this is nothing specific to DispVM, nor savefile usage. Any VM, with misbehaving balloon driver, could trigger such problem. The problem is that, qmemman doesn't handle memory assigned to some VM, but not used yet.

Member

marmarek commented Jan 5, 2016

Generally this is qmemman problem:
When VM have some memory assigned, it means two things:

  • upper limit on VM memory allocation
  • target memory size for balloon driver

When memory allocation is changed, both values are changed. But actual memory usage will change only after balloon driver will balloon up/down the VM memory (give the memory back to the hypervisor, or take it from there). Until that happens, memory assigned, but not yet allocated by the VM, is considered "free" from hypervisor point of view (as in xl info, also xl list displays actual memory usage, not "target memory"). In such case, VM is free to allocate such memory (up to assigned limit) at any time.

Exactly this is happening during DispVM startup:

  1. QubesDisposableVM.start requests some memory (initial memory size for DispVM template - *-dvm VM) from qmemman to start new DispVM (e.g. 400MB)
  2. DispVM is restored from a savefile, using only memory that was allocated at savefile creation time (e.g. 280MB)
  3. Now DispVM is using some memory (280MB), but is allowed to use that initial size (400MB). The difference (120MB) is considered "free".
  4. qmemman redistribute that free memory among other VMs, leaving 50MB safety margin
  5. DispVM, after some time, allocate remaining memory, draining all the memory from Xen free pool
  6. Bad Things(tm) are happening - in this case grant table operations failures
  7. qmemman adjust memory assignments, so it looks ok a moment later (making debugging harder)

Note that this is nothing specific to DispVM, nor savefile usage. Any VM, with misbehaving balloon driver, could trigger such problem. The problem is that, qmemman doesn't handle memory assigned to some VM, but not used yet.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 5, 2016

Member

Debugging hint:

watch -n 0.2 xl info\;xl list

And carefully observe the screen during DispVM startup.

Member

marmarek commented Jan 5, 2016

Debugging hint:

watch -n 0.2 xl info\;xl list

And carefully observe the screen during DispVM startup.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 7, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.9-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

Member

marmarek commented Jan 7, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.9-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 12, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.10-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Member

marmarek commented Jan 12, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.10-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 14, 2016

Member

Apparently race condition mentioned here is more common that I thought.

Member

marmarek commented Jan 14, 2016

Apparently race condition mentioned here is more common that I thought.

@marmarek marmarek reopened this Jan 14, 2016

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

qmemman: use try/finally to really release the lock
Currently not needed in practice, but a preparation for the next
commit(s).

QubesOS/qubes-issues#1389

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

qmemman: refresh domain list holding global_lock
Retrieve a domain list only after obtaining global lock. Otherwise an
outdated list may be used, when a domain was introduced in the meantime
(starting a new domain is done with global lock held), leading to #1389.

QubesOS/qubes-issues#1389
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 15, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

Member

marmarek commented Jan 15, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 8, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Member

marmarek commented Feb 8, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Feb 10, 2016

I tried to get the updates but it no workie. :-(

Rudd-O commented Feb 10, 2016

I tried to get the updates but it no workie. :-(

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Feb 10, 2016

Wait, the package is already installed. Nice. I'm a klutz.

Rudd-O commented Feb 10, 2016

Wait, the package is already installed. Nice. I'm a klutz.

marmarek added a commit to marmarek/qubes-core-admin-linux that referenced this issue Feb 23, 2016

qrexec: add timeout for data vchan connection
When qrexec-agent crashes for any reason (for example
QubesOS/qubes-issues#1389), it will never connect back and qrexec-client
will wait forever. In worst case it may happen while holding qubes.xml
write lock (in case of DispVM startup) effectively locking the whole
system.

Fixes QubesOS/qubes-issues#1636

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Feb 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment