Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less aggressive garbage collection #3045

Merged
merged 1 commit into from Jul 5, 2014

Conversation

kmike
Copy link
Contributor

@kmike kmike commented May 6, 2014

For me it fixes #3044.

@kmike kmike changed the title Use less aggressive garbage collection. Use less aggressive garbage collection May 6, 2014
@tacaswell
Copy link
Member

Does this actually garbage collect the mpl objects? The artists, axes, and figures tend to end up with circular references.

@kmike
Copy link
Contributor Author

kmike commented May 6, 2014

This is what happens in current matplotlib: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/kmike/93102e3fdf75dfc29631/raw/matplotlib-gc3.ipynb

This is the same example with no gc.collect() in matplotlib: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/kmike/93102e3fdf75dfc29631/raw/matplotlib-gc2.ipynb

This is an example with gc.collect(1) in matplotlib: http://nbviewer.ipython.org/gist/kmike/93102e3fdf75dfc29631/matplotlib-gc.ipynb (note that subsequent gc.collect doesn't collect anything new).

So yes, I think gc.collect(1) collects some mpl objects. But note that the number of objects increases by about 5k after each plotting call even if gc.collect() is called. This is what happens:

In [28]:

cnt = Counter( 
    cls for cls in 
    [getattr(obj, '__class__', type(obj)) for obj in gc.get_objects()]
    if 'matplotlib' in str(cls)
)
cnt.most_common(20)

Out[28]:

[(matplotlib.path.Path, 486),
 (matplotlib.transforms.Bbox, 382),
 (matplotlib.font_manager.FontEntry, 359),
 (matplotlib.transforms.CompositeGenericTransform, 324),
 (matplotlib.transforms.Affine2D, 320),
 (matplotlib.lines.Line2D, 216),
 (matplotlib.markers.MarkerStyle, 216),
 (matplotlib.transforms.IdentityTransform, 193),
 (matplotlib.text.Text, 172),
 (matplotlib.font_manager.FontProperties, 164),
 (matplotlib.transforms.BboxTransformTo, 160),
 (matplotlib.colors.LinearSegmentedColormap, 144),
 (matplotlib.transforms.TransformedBbox, 116),
 (matplotlib.transforms.TransformedPath, 104),
 (matplotlib.transforms.ScaledTranslation, 52),
 (matplotlib.patches.Rectangle, 48),
 (matplotlib.axis.XTick, 40),
 (matplotlib.axis.YTick, 32),
 (matplotlib.cbook.maxdict, 21),
 (matplotlib.cbook.CallbackRegistry, 20)]

Numbers increase after each "hist" call followed by full gc.collect(). So it seems gc.collect() doesn't really help, and it can be very slow. It collects some items, and gc.collect(1) seems to collect the same items.

There is a lot of moving parts (IPython, matplotlib, gc, various plotting methods, etc) so I may miss something, and the analysis can be incorrect. But running full garbage collection that user can't control is a bad decision IMHO. If leftovers from matplotlib is an issue it is always possible to run gc.collect() manually and get the same results.

gc.collect(1) is a compromiss that seems to fix most of the problems gc.collect() fixes without its huge overhead in presence of long-living objects.

@tacaswell tacaswell added this to the v1.4.0 milestone May 7, 2014
@tacaswell
Copy link
Member

Thank you for digging down into this.

This looks fine to me, but think @mdboom or @efiring should take a look at it as well.

@mdboom
Copy link
Member

mdboom commented May 7, 2014

@kmike: In your examples, can you clear the figure? An internal reference is still held to it in these examples... That should, at least by design, result in no leaks from the matplotlib side (though IPython may still hold references to the results of each cell). You might also try this testing outside of IPython to remove that as a factor when testing.

@kmike
Copy link
Contributor Author

kmike commented May 7, 2014

@mdboom I can try to do it, but are you suggesting to keep gc.collect() if it collects more objects than gc.collect(1)? The disadvantage of gc.collect() is that it checks all objects in memory, it is potentially a costly operation, while gc.collect(1) is bounded.

@mdboom
Copy link
Member

mdboom commented May 7, 2014

clf() should leave no remaining objects around, at least by design. I haven't investigated in some time whether that's still the case. If making this change keeps more matplotlib objects around than it is a no-go. A speed penalty is always better than an unbounded memory leak. If that's the case, we may need to find another solution to the speed problem -- by introducing more weakref's where appropriate or refactoring the code to reduce the number of cyclical references.

@tacaswell
Copy link
Member

My understanding is that the speed cost is due to large numbers of user objects making gc.collect() take a long time.

@WeatherGod
Copy link
Member

Yeah, that is my understanding too. I think the OP makes a good point.
Calling clf() has side-effects (mostly benign, but still real).
Unfortunately, I am not savy enough on the gc module to understand the
implications of calling collect(1). The documentation merely refers to
"generations", but I have no clue what that means.

On Wed, May 7, 2014 at 1:27 PM, Thomas A Caswell
notifications@github.comwrote:

My understanding is that the speed cost is due to large numbers of _user_objects making
gc.collect() take a long time.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3045#issuecomment-42456997
.

@efiring
Copy link
Member

efiring commented May 7, 2014

The docstring for gc.set_threshold() "explains" the generations and the collection scheme, but that still leaves me with a less-than-complete understanding. My interpretation is that collect(0) will look at objects that have not been checked previously; collect(1) will look at objects that have been checked exactly once; collect(2) at objects that have been checked twice or more; and collect() will look at all objects, by going through each of the three generation lists. It is not clear whether collect(1) actually operates on the generation 0 list and the generation 1 list, or only the latter. If only the latter, it would not seem to be very useful when used in isolation.
The thresholds that I see are 700, 10, and 10, meaning collect(0) is run automatically only when there have been 700 more allocations than deallocations, irrespective of the actual amount of memory involved. Allocations of some primitive objects are not counted.

@kmike
Copy link
Contributor Author

kmike commented May 7, 2014

gc.collect(1) checks generations <= 1.

I'd even remove gc.collect altogether; gc.collect(1) is just to be a bit conservative.

It is true that unbounded memory leak is worse than a speed penalty, but now we have an O(N) speed penalty where N is a number of alive user objects, not a number of matplotlib objects, and the leak is not really a leak because objects will be eventually collected - in the worst case users can call gc.collect() themselves to make this happen sooner. As for the speed penalty - they can do nothing.

A "leak" is in kilobytes of temporary allocated memory per chart (maybe megabytes in pathological cases); the speed penalty can be seconds of wait time for each executed IPython cell (see ipython/ipython#5795).

@tacaswell
Copy link
Member

I am in favor of merging this.

I think this explains some issues I was having with my code at the end of grad school but due to the need to graduate didn't take the time to track down.

@tacaswell
Copy link
Member

@mdboom @efiring What do you want to do about this? I am in favor of merging.

@efiring
Copy link
Member

efiring commented Jun 27, 2014

Responding to @mdboom's last comment: Anything we can do to ensure prompt release of memory is a good move in general, but it doesn't address the OP's problem, which is that gc.collect() can be damagingly slow if there is a huge number of user objects, regardless of whether any of them are actually collectible. I think that in the OP's case, these objects are not created by the plotting, so they are not under our control.

The problem addressed by this PR can be quite bad; I am in favor of giving it a try. It would certainly be good to have a clearer understanding of when, if ever in practice, it would lead to troublesome increases in memory consumption. My impression is that this should be very rare, so the tradeoff is worthwhile.

tacaswell added a commit that referenced this pull request Jul 5, 2014
Use less aggressive garbage collection
@tacaswell tacaswell merged commit b4a678a into matplotlib:master Jul 5, 2014
tacaswell added a commit to tacaswell/matplotlib that referenced this pull request Aug 23, 2022
Matplotlib has a large number of circular references (between figure and
manager, between axes and figure, axes and artist, figure and canvas, and ...)
so when the user drops their last reference to a `Figure` (and clears it from
pyplot's state), the objects will not immediately deleted.

To account for this we have long (goes back to
e34a333 the "reorganize code" commit in 2004
which is the end of history for much of the code) had a `gc.collect()` in the
close logic in order to promptly clean up after our selves.

However, unconditionally calling `gc.collect` and be a major performance
issue (see matplotlib#3044 and
matplotlib#3045) because if there are a
large number of long-lived user objects Python will spend a lot of time
checking objects that are not going away are never going away.

Instead of doing a full collection we switched to clearing out the lowest two
generations.  However this both not doing what we want (as most of our objects
will actually survive) and due to clearing out the first generation opened us
up to having unbounded memory usage.

In cases with a very tight loop between creating the figure and destroying
it (e.g. `plt.figure(); plt.close()`) the first generation will never grow
large enough for Python to consider running the collection on the higher
generations.  This will lead to un-bounded memory usage as the long-lived
objects are never re-considered to look for reference cycles and hence are
never deleted because their reference counts will never go to zero.

closes matplotlib#23701
melissawm pushed a commit to melissawm/matplotlib that referenced this pull request Dec 19, 2022
Matplotlib has a large number of circular references (between figure and
manager, between axes and figure, axes and artist, figure and canvas, and ...)
so when the user drops their last reference to a `Figure` (and clears it from
pyplot's state), the objects will not immediately deleted.

To account for this we have long (goes back to
e34a333 the "reorganize code" commit in 2004
which is the end of history for much of the code) had a `gc.collect()` in the
close logic in order to promptly clean up after our selves.

However, unconditionally calling `gc.collect` and be a major performance
issue (see matplotlib#3044 and
matplotlib#3045) because if there are a
large number of long-lived user objects Python will spend a lot of time
checking objects that are not going away are never going away.

Instead of doing a full collection we switched to clearing out the lowest two
generations.  However this both not doing what we want (as most of our objects
will actually survive) and due to clearing out the first generation opened us
up to having unbounded memory usage.

In cases with a very tight loop between creating the figure and destroying
it (e.g. `plt.figure(); plt.close()`) the first generation will never grow
large enough for Python to consider running the collection on the higher
generations.  This will lead to un-bounded memory usage as the long-lived
objects are never re-considered to look for reference cycles and hence are
never deleted because their reference counts will never go to zero.

closes matplotlib#23701
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

matplotlib shouldn't call gc.collect()
6 participants