New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in qubes manager #860

Closed
marmarek opened this Issue Mar 8, 2015 · 17 comments

Comments

Projects
None yet
4 participants
@marmarek
Member

marmarek commented Mar 8, 2015

Reported by marmarek on 24 May 2014 02:11 UTC
After about 3 weeks Qubes Manager uses over 1GB virtual memory (over 200MB resident). Its window was active (not iconified) almost all the time. Visible columns:

  • type
  • label
  • name
  • state
  • template
  • netvm
  • cpu (bar)
  • memory (bar)
  • requested mem (bar)

Migrated-From: https://wiki.qubes-os.org/ticket/860

@marmarek marmarek added this to the Release 2.1 (post R2) milestone Mar 8, 2015

@nvesely

This comment has been minimized.

Show comment
Hide comment
@nvesely

nvesely Apr 25, 2015

I'm getting 851M VIRT and 63792 RES in htop while Qube Manager is iconified. 923M VIRT and 68036 while open.

nvesely commented Apr 25, 2015

I'm getting 851M VIRT and 63792 RES in htop while Qube Manager is iconified. 923M VIRT and 68036 while open.

@tasket

This comment has been minimized.

Show comment
Hide comment
@tasket

tasket Jun 30, 2016

On R3.2rc1, it appears to gain 4KB at each increment at about 56KB per minute.

I wonder if moving to python 3 would solve or improve the issue.

As a quick remedy, perhaps QM could restart itself at set intervals if there has been no user input for at least N minutes?

tasket commented Jun 30, 2016

On R3.2rc1, it appears to gain 4KB at each increment at about 56KB per minute.

I wonder if moving to python 3 would solve or improve the issue.

As a quick remedy, perhaps QM could restart itself at set intervals if there has been no user input for at least N minutes?

@tasket

This comment has been minimized.

Show comment
Hide comment
@tasket

tasket Jun 30, 2016

Also, closing the QM window appears to halt the increase, while re-opening it causes it to resume.

tasket commented Jun 30, 2016

Also, closing the QM window appears to halt the increase, while re-opening it causes it to resume.

@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jun 30, 2016

I profiled the memory usage of the Qubes Manager, and measured that instances of "dict" and "instancemethod" are always increasing in number. Specifically, the instancemethod objects are all
<bound method QubesWatch._retry_register_watches of <qubes.qubesutils.QubesWatch object at 0x7f2bd40da2d0>>
while all the ever-growing dictionaries are all
{'cb': <bound method QubesWatch._retry_register_watches of <qubes.qubesutils.QubesWatch object at 0x7f2bd40da2d0>>, 'opaque': <libvirt.virDomain object at 0x7f2bd40da710>}
The backreference graphs for these instances are nearly useless - maybe it's just me, I'm not that expert in debugging python, and being in dom0 does not really help.
Hope this can help, let me know if I may assist further.

I profiled the memory usage of the Qubes Manager, and measured that instances of "dict" and "instancemethod" are always increasing in number. Specifically, the instancemethod objects are all
<bound method QubesWatch._retry_register_watches of <qubes.qubesutils.QubesWatch object at 0x7f2bd40da2d0>>
while all the ever-growing dictionaries are all
{'cb': <bound method QubesWatch._retry_register_watches of <qubes.qubesutils.QubesWatch object at 0x7f2bd40da2d0>>, 'opaque': <libvirt.virDomain object at 0x7f2bd40da710>}
The backreference graphs for these instances are nearly useless - maybe it's just me, I'm not that expert in debugging python, and being in dom0 does not really help.
Hope this can help, let me know if I may assist further.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 1, 2016

Member

The code responsible for this is here: https://github.com/QubesOS/qubes-core-admin/blob/master/core/qubesutils.py#L671-L799

But first, it should be automatically unregistered after 0.5sec timeout:

    def _register_watches(self, libvirt_domain):
(...)
                libvirt.virEventAddTimeout(500, self._retry_register_watches,
libvirt_domain)
(...)

    def _retry_register_watches(self, timer, libvirt_domain):
        libvirt.virEventRemoveTimeout(timer)
        self._register_watches(libvirt_domain)

And second, _register_watches shouldn't be called that often - even if some connection fails each time, it should just retry it later (remove one timeout object, create another one - none should be leaked). Can you track where, for which domains etc _register_watches is called that often?

Member

marmarek commented Jul 1, 2016

The code responsible for this is here: https://github.com/QubesOS/qubes-core-admin/blob/master/core/qubesutils.py#L671-L799

But first, it should be automatically unregistered after 0.5sec timeout:

    def _register_watches(self, libvirt_domain):
(...)
                libvirt.virEventAddTimeout(500, self._retry_register_watches,
libvirt_domain)
(...)

    def _retry_register_watches(self, timer, libvirt_domain):
        libvirt.virEventRemoveTimeout(timer)
        self._register_watches(libvirt_domain)

And second, _register_watches shouldn't be called that often - even if some connection fails each time, it should just retry it later (remove one timeout object, create another one - none should be leaked). Can you track where, for which domains etc _register_watches is called that often?

@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jul 1, 2016

I added a little logging around the various calls to libvirt, like this:

def _register_watches(self, libvirt_domain):
    if libvirt_domain:
        name = libvirt_domain.name()
        if name in self._qdb:
            return
        if not libvirt_domain.isActive():
            return
        # open separate connection to Qubes DB:
        # 1. to not confuse pull() with responses to real commands sent from
        # other threads (like read, write etc) with watch events
        # 2. to not think whether QubesDB is thread-safe (it isn't)
        try:
            self._qdb[name] = QubesDB(name)
        except Error as e:
            if e.args[0] != 2:
                raise
            try:
                with open("/tmp/qwatch_log", "a") as qwatchlog:
                    qwatchlog.write("Adding timeout for " + name + ", " + str(e) + "\n")
            except:
                pass

            libvirt.virEventAddTimeout(500, self._retry_register_watches,
                                       libvirt_domain)
            return
    else:
        name = "dom0"
        self._qdb[name] = QubesDB(name)
    try:
        with open("/tmp/qwatch_log", "a") as qwatchlog:
            qwatchlog.write("Registering for " + name + "\n")
    except:
        pass
    try:
        self._qdb[name].watch('/qubes-block-devices')
    except Error as e:
        if e.args[0] == 102: # Connection reset by peer
            # QubesDB daemon not running - most likely we've connected to
            #  stale daemon which just exited; retry later
            libvirt.virEventAddTimeout(500, self._retry_register_watches,
                                       libvirt_domain)
            return
    self._qdb_events[name] = libvirt.virEventAddHandle(
        self._qdb[name].watch_fd(),
        libvirt.VIR_EVENT_HANDLE_READABLE,
        self._qdb_handler, name)

and it logged something like this starting qubes-manager:

Registering for netvm
Registering for firewallvm
Registering for personal
Registering for work
Registering for skype-1
Adding timeout for Domain-0, (2, 'No such file or directory')
Registering for dom0
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')

I don't remember ever creating a vm with that name (Domain-0), and it looks like QubesDB can't either. A quick find / -name Domain* in dom0 doesn't yield anything, either.

alex-mazzariol commented Jul 1, 2016

I added a little logging around the various calls to libvirt, like this:

def _register_watches(self, libvirt_domain):
    if libvirt_domain:
        name = libvirt_domain.name()
        if name in self._qdb:
            return
        if not libvirt_domain.isActive():
            return
        # open separate connection to Qubes DB:
        # 1. to not confuse pull() with responses to real commands sent from
        # other threads (like read, write etc) with watch events
        # 2. to not think whether QubesDB is thread-safe (it isn't)
        try:
            self._qdb[name] = QubesDB(name)
        except Error as e:
            if e.args[0] != 2:
                raise
            try:
                with open("/tmp/qwatch_log", "a") as qwatchlog:
                    qwatchlog.write("Adding timeout for " + name + ", " + str(e) + "\n")
            except:
                pass

            libvirt.virEventAddTimeout(500, self._retry_register_watches,
                                       libvirt_domain)
            return
    else:
        name = "dom0"
        self._qdb[name] = QubesDB(name)
    try:
        with open("/tmp/qwatch_log", "a") as qwatchlog:
            qwatchlog.write("Registering for " + name + "\n")
    except:
        pass
    try:
        self._qdb[name].watch('/qubes-block-devices')
    except Error as e:
        if e.args[0] == 102: # Connection reset by peer
            # QubesDB daemon not running - most likely we've connected to
            #  stale daemon which just exited; retry later
            libvirt.virEventAddTimeout(500, self._retry_register_watches,
                                       libvirt_domain)
            return
    self._qdb_events[name] = libvirt.virEventAddHandle(
        self._qdb[name].watch_fd(),
        libvirt.VIR_EVENT_HANDLE_READABLE,
        self._qdb_handler, name)

and it logged something like this starting qubes-manager:

Registering for netvm
Registering for firewallvm
Registering for personal
Registering for work
Registering for skype-1
Adding timeout for Domain-0, (2, 'No such file or directory')
Registering for dom0
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')
Adding timeout for Domain-0, (2, 'No such file or directory')

I don't remember ever creating a vm with that name (Domain-0), and it looks like QubesDB can't either. A quick find / -name Domain* in dom0 doesn't yield anything, either.

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jul 1, 2016

qubeswatch: use always "dom0" name for dom0
Libvirt reports dom0 as "Domain-0". Which is incompatible with how Qubes
and libxl toolstack names it ("dom0"). So handle this as a special case.

QubesOS/qubes-issues#860
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 1, 2016

Member

This is dom0. Looks like virsh list list it as Domain-0.
Try this commit: marmarek/qubes-core-admin@6bc7733

Member

marmarek commented Jul 1, 2016

This is dom0. Looks like virsh list list it as Domain-0.
Try this commit: marmarek/qubes-core-admin@6bc7733

@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jul 1, 2016

In implementing your commit I added a check for libvirt_domain being a valid object before accessing the ID() method, and nothing else.

From what I can see, two watches are registered for dom0, but this does not seem to affect anything else; there is no more loop. Qubes Manager is running, will collect memory new statistics in a few hours and report back.

In implementing your commit I added a check for libvirt_domain being a valid object before accessing the ID() method, and nothing else.

From what I can see, two watches are registered for dom0, but this does not seem to affect anything else; there is no more loop. Qubes Manager is running, will collect memory new statistics in a few hours and report back.

@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jul 1, 2016

After applying the patch with mentioned difference (added check for libvirt_domain before accessing ID) there is no noticeable increase of any object count. This may fix the issue.

After applying the patch with mentioned difference (added check for libvirt_domain before accessing ID) there is no noticeable increase of any object count. This may fix the issue.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 1, 2016

Member

Other events worth checking (whether cause increase of object count):

  • starting/stopping some VM (either from qubes manager, or independently)
  • changing some VM setting
Member

marmarek commented Jul 1, 2016

Other events worth checking (whether cause increase of object count):

  • starting/stopping some VM (either from qubes manager, or independently)
  • changing some VM setting
@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jul 1, 2016

From quick estimates (looking at the counts I took while investigating) some other object types increase with user actions, but not nearly as much as the wrong handling of the name of dom0 did.

I will produce more exact results later, but I think you can close the issue just by fixing your commit (adding a check for libvirt_domain being null) - user interactions do not leak this much memory!

From quick estimates (looking at the counts I took while investigating) some other object types increase with user actions, but not nearly as much as the wrong handling of the name of dom0 did.

I will produce more exact results later, but I think you can close the issue just by fixing your commit (adding a check for libvirt_domain being null) - user interactions do not leak this much memory!

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 1, 2016

Member

Thanks!

Member

marmarek commented Jul 1, 2016

Thanks!

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jul 1, 2016

qubeswatch: use always "dom0" name for dom0
Libvirt reports dom0 as "Domain-0". Which is incompatible with how Qubes
and libxl toolstack names it ("dom0"). So handle this as a special case.
Otherwise reconnection retries leaks event object every iteration.

Fixes QubesOS/qubes-issues#860
Thanks @alex-mazzariol for help with debugging!
@alex-mazzariol

This comment has been minimized.

Show comment
Hide comment
@alex-mazzariol

alex-mazzariol Jul 2, 2016

I'm back with some measurement as @marmarek asked.
I measured the objects in heap after two forced garbage collections, after performing an action and its reverse once and three times, to look for object count increase. It's the same methodology I used for the other memory leak.
First the good news! Starting and stopping a VM thrice instead of once does not lead to any object count increase, so I think that's OK from this point of view.
Then the not-so-good news: changing settings six times (three actual modification from base status, and three reverts) instead of twice (one actual modification and a revert) does lead to leaked objects - not a big problem, though.
The settings I changed are the template and vcpu count for a vm, and nothing else. The last time I left the modified vcpu count, so I did not even open the "advanced" tab.
It seems that there were 7 "dict" objects and 14 "list" objects more than when I changed the settings only twice. The dicts were some PyQt4.QtGui.QMenu objects and some dicts with keys "menu", "tray_notifier_type", "tray_notifier", "blk_menu" and "blk_manager", but I can't check the exact count.
The lists were... numbers. I think they may have something to do with the window layout, but cannot confirm. A lot of them were empty lists.
I don't think these leaks will lead to actual problems.

I'm back with some measurement as @marmarek asked.
I measured the objects in heap after two forced garbage collections, after performing an action and its reverse once and three times, to look for object count increase. It's the same methodology I used for the other memory leak.
First the good news! Starting and stopping a VM thrice instead of once does not lead to any object count increase, so I think that's OK from this point of view.
Then the not-so-good news: changing settings six times (three actual modification from base status, and three reverts) instead of twice (one actual modification and a revert) does lead to leaked objects - not a big problem, though.
The settings I changed are the template and vcpu count for a vm, and nothing else. The last time I left the modified vcpu count, so I did not even open the "advanced" tab.
It seems that there were 7 "dict" objects and 14 "list" objects more than when I changed the settings only twice. The dicts were some PyQt4.QtGui.QMenu objects and some dicts with keys "menu", "tray_notifier_type", "tray_notifier", "blk_menu" and "blk_manager", but I can't check the exact count.
The lists were... numbers. I think they may have something to do with the window layout, but cannot confirm. A lot of them were empty lists.
I don't think these leaks will lead to actual problems.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 12, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.2.5-1.fc23 has been pushed to the r3.2 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

Member

marmarek commented Jul 12, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.2.5-1.fc23 has been pushed to the r3.2 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jul 28, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.2.7-1.fc23 has been pushed to the r3.2 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Member

marmarek commented Jul 28, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.2.7-1.fc23 has been pushed to the r3.2 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Nov 20, 2016

qubeswatch: use always "dom0" name for dom0
Libvirt reports dom0 as "Domain-0". Which is incompatible with how Qubes
and libxl toolstack names it ("dom0"). So handle this as a special case.
Otherwise reconnection retries leaks event object every iteration.

Fixes QubesOS/qubes-issues#860
Thanks @alex-mazzariol for help with debugging!

(cherry picked from commit 7e9c816)
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Nov 20, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.18-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

Member

marmarek commented Nov 20, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.18-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Dec 4, 2016

Member

Automated announcement from builder-github

The package qubes-core-dom0-3.1.18-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Member

marmarek commented Dec 4, 2016

Automated announcement from builder-github

The package qubes-core-dom0-3.1.18-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment