Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macosx backend slowdown with 1.2.0 #1563

Closed
jwoillez opened this issue Dec 5, 2012 · 45 comments
Closed

macosx backend slowdown with 1.2.0 #1563

jwoillez opened this issue Dec 5, 2012 · 45 comments

Comments

@jwoillez
Copy link
Contributor

jwoillez commented Dec 5, 2012

I've noticed a major slowdown of the macosx backend between version 1.1.1 and 1.2.0, when using pylab.show(). The following code times the call and gives very different results between the two versions:
dt ~ 454 pts/s for 1.2
dt ~ 168,480 pts/s for 1.1.1
A difference by a factor ~ 200. Can others test and confirm? The results did not change much when using TkAgg.

import matplotlib
#matplotlib.use("TkAgg")
matplotlib.use("MacOSX")
import pylab
import numpy
import time

N=10000

fig, axarr = pylab.subplots(1,1)
axarr.plot(numpy.random.normal(size=N))

matplotlib.interactive(True)

t0 = time.clock()
pylab.show()
t1 = time.clock()
ptPerSecond = 1.0*N/(t1-t0)

print("dt=%.3f points/s"%ptPerSecond)
@mdehoon
Copy link
Contributor

mdehoon commented Dec 5, 2012

This script will not give you an accurate time measurement.
The time spent by this script consists of two parts: The time spent executing each of the commands in this script, and the time spent actually drawing the figure. The latter is done from the event loop, and takes most of the time by far.
Now if the event loop kicks in after you call t1 = time.clock(), then the reported time is the time spent running these commands, and not the time spent drawing the figure.
If the event loop kicks in as part of pylab.show(), the reported time includes the time spent drawing the figure, which is much longer. So depending on where the event loop starts to run, you get a very different answer. However, the wall-clock time between starting the script and seeing the output on your screen will be the same.
I believe that there was some change between 1.1 and 1.2 in how pylab.show() acts depending on whether matplotlib is interactive or not, but I don't remember the details now. Anyway, that may explain why you got a different timing between 1.1 and 1.2.

@pelson
Copy link
Member

pelson commented Dec 5, 2012

This script will not give you an accurate time measurement.

Agreed. Would you be able to provide the complete runtime @jwoillez?

@jwoillez
Copy link
Contributor Author

jwoillez commented Dec 5, 2012

Not sure how I can time things any other way in a non interactive mode, since show() blocks. For now I can report no noticeable time difference up to show(), whereas it takes fraction of a seconds for the window with the figure to appear in 1.1.1 and 20seconds in 1.2, for 10000 points. That delay increases with number of points. If you can suggest a more accurate script, I will be happy to run it.

@dmcdougall
Copy link
Member

Would it be better to put t0 = time.clock() at the top of the script?

@efiring
Copy link
Member

efiring commented Dec 5, 2012

On 2012/12/05 7:14 AM, Julien Woillez wrote:

Not sure how I can time things any other way in a non interactive mode,
since show() blocks. For now I can report no noticeable time difference
up to show(), whereas it takes fraction of a seconds for the window with
the figure to appear in 1.1.1 and 20seconds in 1.2, for 10000 points.
That delay increases with number of points. If you can suggest a more
accurate script, I will be happy to run it.

You are right, this problem does not require any fancy timing to see it.
With current mpl, the macosx backend is simply unusable for plotting
10,000 points. On the same hardware, with the tkagg backend, it is
almost instantaneous. This can be verified interactively with "ipython
--pylab" versus "ipython --pylab=tk". I have not tried with an earlier
mpl version.

@efiring
Copy link
Member

efiring commented Dec 5, 2012

@jwoillez the next step is to use bisection to track down the commit that caused the slowdown. Would you be able to do this?

@dmcdougall
Copy link
Member

On 2012/12/05 7:14 AM, Julien Woillez wrote:
Not sure how I can time things any other way in a non interactive mode,
since show() blocks. For now I can report no noticeable time difference
up to show(), whereas it takes fraction of a seconds for the window with
the figure to appear in 1.1.1 and 20seconds in 1.2, for 10000 points.
That delay increases with number of points. If you can suggest a more
accurate script, I will be happy to run it.

You are right, this problem does not require any fancy timing to see it.
With current mpl, the macosx backend is simply unusable for plotting
10,000 points. On the same hardware, with the tkagg backend, it is
almost instantaneous. This can be verified interactively with "ipython
--pylab" versus "ipython --pylab=tk". I have not tried with an earlier
mpl version.

I don't see this problem. The tk backend has never worked for me so I'm using qt4agg instead. The macosx backend is quicker than the qt4 backend. Noticeably so.

@mdboom
Copy link
Member

mdboom commented Dec 5, 2012

It might be helpful to track down whether path simplification is being performed.

@efiring
Copy link
Member

efiring commented Dec 5, 2012

@mdboom, yes, absence of path simplification seems like a possible explanation. I haven't tried to track it down. Still, a delay of 20 s or so to plot 10,000 points seems excessive on an i7 even without path simplification. Something else seems to be going on here.

@dmcdougall, with "ipython --pylab=qt" I get the same instantaneous response as with "--pylab=tk", on Mountain Lion. But with plain "ipython --pylab", so it is using the macosx backend, it takes 23 seconds, roughly timed with a watch. The processor is doing something the whole time--the machine is drawing about 32 watts instead of its idle value of 18 or so. All I am doing is starting ipython, executing "plot(np.random.randn(10000))", and timing from when I hit return to when the plot appears. The ipython prompt returns immediately, and the plot window is created; the delay is in waiting for the actual plot to appear in the window. The same delay occurs with any pan attempt.

@dmcdougall
Copy link
Member

@efiring I have tried both ipython --pylab (with backend: macosx in my matplotlibrc file) and I have also tried ipython --pylab=osx. Both are instantaneous on my 15" retina mbp, with OS X 10.8.

@pelson
Copy link
Member

pelson commented Dec 5, 2012

> python support/slow_osx.py -dmacosx
1.3.x MacOSX
dt=4212.264 points/s
> python support/slow_osx.py -dmacosx
1.3.x MacOSX
dt=4024.404 points/s
> python support/slow_osx.py -dmacosx
1.3.x MacOSX
dt=4131.463 points/s


> python support/slow_osx.py -dmacosx
1.1.1 MacOSX
dt=7459.013 points/s
> python support/slow_osx.py -dmacosx
1.1.1 MacOSX
dt=7483.126 points/s
> python support/slow_osx.py -dmacosx
1.1.1 MacOSX
dt=7757.109 points/s
> python support/slow_osx.py -dmacosx
1.1.1 MacOSX
dt=7815.186 points/s

Crude measurements with 1000 pts confirms this.

@efiring
Copy link
Member

efiring commented Dec 6, 2012

Bisection points to commit 51611c8 as the first that triggers the slowdown. Looking at that commit, however, I have no idea how it could cause a slowdown specifically for plotting a large number of points.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 6, 2012

Trying on a different Mac, I was able to replicate the slowdown with the current matplotlib in github compared to 1.1.1. It seems to be an actual problem with Quartz itself, as the slowdown depends on the exact line width of the curve being drawn. If I set the line width to 1.0 with CGContextSetLineWidth just before the call to CGContextStrokePath, the drawing is fast. If I set it to 1.00001, it is slow.
Under matplotlib 1.1.1, the line width was exactly equal to 1.0, so it is fast.
Under the current matplotlib, in gc.set_linewidth I scale the line width by dpi/72.0. With the dpi being equal to 80, the line width will be 1.111111, and drawing is slow.

Actually, in gc.set_linewidth, should we scale the line width by dpi/72.0? Or should we use the line width as is?

Also, could you guys run jwoillez' script and report back whether the script is painfully slow (by measuring the time between starting the script and seeing the output on screen), and which version of Mac OS X you are using? I don't remember seeing any slowdown on Mac OS X 10.8 with jwoillez' script, only with Mac OS X 10.5, so this may occur only for older Macs.

@efiring
Copy link
Member

efiring commented Dec 6, 2012

@mdehoon, no need to run the script. I have verified that within my ipython session, plotting 10000 points, I get a slowdown (20 seconds or so) with figure.dpi = 80 (the default), and fast plotting (essentially instantaneous) with figure.dpi = 72. This is with the same kind of recent machine as @dmcdougall has, OS X 10.8.

Line width in mpl is in points, so assuming Quartz is working with line width in dots, the present scaling is correct. It looks like this is a major Quartz limitation that we are stuck with--slow plotting when the line width is not an integer number of dots. I suppose we could simply round it to an integer number of dots, with a minimum of zero. This probably would be a good move, since it would solve the speed problem, and I suspect it would have only a minor and acceptable effect on the screen display quality.

As a separate issue, @mdboom asked whether path simplification is used in this backend. I think the answer is "no". @mdehoon, is this correct, and if so, could path simplification be added?

@mdehoon
Copy link
Contributor

mdehoon commented Dec 6, 2012

Path simplification is used in this backend. If I switch off path simplification, the script takes almost twice as long.

Some more testing revealed that Quartz is slow if the line width is greater than 1. It doesn't matter if it is an integer or not.

But it also turns out that drawing 10000 points as 100 x 100 points is much faster than drawing 10000 points at once. So we can speed up the Mac OS X backend that way. We'd have to be careful though to make sure the end result is exactly the same, in particular at the end points if alpha!=1.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 6, 2012

By the way, I won't be able to implement any fixes for this issue myself any time soon, so if somebody else wants to give it a try, please go ahead.

@efiring
Copy link
Member

efiring commented Dec 6, 2012

Given this rather horrible behavior by Quartz, would it make sense to use Agg for the rendering instead? I realize this would be a big change, and I am in no position to contribute to it. Apart from the cost of making the change, what is the advantage of Quartz rendering?

@efiring
Copy link
Member

efiring commented Dec 6, 2012

An unrelated problem with the macosx backend is that it doesn't respond continuously to pan/zoom the way the other backends do. Is this an inherent limitation? Has it already been reported?

@mdehoon
Copy link
Contributor

mdehoon commented Dec 7, 2012

At this point, I would not use Agg for the rendering instead, as there are more straightforward options to explore first. My first step would be to get some feedback from the Apple developers to see if there is a simple way to get better performance. If not, simply drawing long paths as multiple shorter paths is the simplest solution. Or we could try QuickDraw instead of Quartz rendering. Also I haven't tried if switching off anti-aliasing for long paths makes a difference.

With regard to the macosx backend not responding to panning and zooming, let's open a separate issue for that, if it has not been reported yet.

@efiring
Copy link
Member

efiring commented Dec 7, 2012

I don't think QuickDraw is an option. According to http://en.wikipedia.org/wiki/QuickDraw it has been deprecated since OSX 10.4.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 7, 2012

If I use SNAP_TRUE instead of SNAP_AUTO in the call to get_path_iterator in GraphicsContext_draw_path, I get much better (near-instantaneous) performance. Are there any other (better?) ways to tweak the call to get_path_iterator? If not, I suggest that we simply use SNAP_TRUE instead of SNAP_AUTO for long paths.

@WeatherGod
Copy link
Member

Is the slow-down still gone if you have SNAP_TRUE and thicker linewidths?

@mdehoon
Copy link
Contributor

mdehoon commented Dec 7, 2012

A linewidth of 1 is fastest, but a linewidth of 10 still gives an acceptable speed.

@mdboom
Copy link
Member

mdboom commented Dec 7, 2012

But isn't the drawing less accurate with snapping on? Snapping is only intended for rectilinear (i.e. axis-aligned) lines, and the AUTO mode first does a test to determine if the path falls into that category. And the testing is turned off when the path has more than 1024 points, so it shouldn't be triggered in this case. It would be good to see some images with snapping on and off to determine if we're not losing quality there.

@WeatherGod
Copy link
Member

I don't think the issue is that the auto-detection of path snapping is
eating up cycles. I think the issue is that the quartz renderer is eating
up cycles when given non-snapped points.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 8, 2012

Below are three screen shots.
The first figure (current.png) shows the current (slow) behavior.
The second figure (snap_true.png) shows the output if I set SNAP_TRUE in get_path_iterator in GraphicsContext_draw_path.
The third figure (simplify_threshold_tenfold.png) shows the output if I keep SNAP_AUTO, but I multiply m_path_iter.simplify_threshold() by 10 in PathCleanupIterator in src/path_cleanup.cpp.

All three look very similar to me, but current and simplify_threshold_tenfold look a bit closer to each other. Perhaps we could simplify paths more when they contain many points?

current.png
snap_true.png
simplify_threshold_tenfold.png

@dmcdougall
Copy link
Member

@mdehoon On my retina screen, the second one looks crispest to me, then again it might just be late and I'm looking at it with tired eyes.

@efiring What do you think?

@efiring
Copy link
Member

efiring commented Dec 8, 2012

On 2012/12/07 6:16 PM, Damon McDougall wrote:

@mdehoon https://github.com/mdehoon On my retina screen, the second
one looks crispest to me, then again it might just be late and I'm
looking at it with tired eyes.

@efiring https://github.com/efiring What do you think?

The question is of accuracy, not crispness. Snapping makes it crisp, at
the cost of accuracy.

@dmcdougall
Copy link
Member

So, 1) is unacceptably slow but very accurate. 2) is crisp but accurate only to within a pixel. 3) retains sub-pixel accuracy, looks similar to 1) but discards data for the purposes of speed.

How do we objectively decide which option is the most appropriate?

@mdehoon
Copy link
Contributor

mdehoon commented Dec 9, 2012

I tried a different approach, which is to divide the paths into subpaths and draw each of them separately, which is much faster. This seems the cleanest solution to me. I'll do some more testing to make sure it doesn't cause problems in other types of plots.

I noticed that CLOSEPOLY is defined differently in lib/matplotlib/path.py (CLOSEPOLY = 0x4f) from src/_backend_agg.h (CLOSEPOLY = 5). In src/_backend_agg.h it says that these constants should be kept in sync with path.py. Perhaps it is better if we put these definitions in a separate .h file, and make them available to path.py via src/path_cleanup.cpp? I am asking since I would like to add EMPTY to these to signify an empty path. Then in src/_macosx.m I can distinguish more easily between an empty path, a partially finished path, and a finished path (which is what I need to draw parts of the path separately).

Also, can somebody merge #1562?
#1562

@dmcdougall
Copy link
Member

Also, can somebody merge #1562?

Done.

@dmcdougall
Copy link
Member

I tried a different approach, which is to divide the paths into subpaths and draw each of them separately, which is much faster. This seems the cleanest solution to me. I'll do some more testing to make sure it doesn't cause problems in other types of plots.

That is interesting that it's faster if you break up the path. It might be that CoreGraphics is multithreading the processing.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 9, 2012

As far as I know it is not due to multithreading, but it has to do with CoreGraphics needing to perform more calculations for longer paths to find out if they overlap or self-intersect.

@dmcdougall
Copy link
Member

Ahhh, interesting. I wonder what the speed difference, if any, there is between passing one huge line with lots of (say, 10^4) points, and 10^4 - 1 lines each of two points. If they're comparable, finding the 'sweet spot' might be beneficial.

Also, on the topic of multithreading, since most Macs have at least two cores now, I wonder if we can utilise Grand Central Dispatch to ship out parts of the line to various threads. The main things I'd be concerned about there are whether or not the drawing is guaranteed to end before the event loop does, and also ensuring shipping out parts of the the drawing doesn't implicitly change the z-order. That's probably beyond the scope of this pull request -- I just wanted to think out loud.

@mdboom
Copy link
Member

mdboom commented Dec 10, 2012

The downside of splitting the path is when alpha blending (which I think @mdehoon already pointed out). Maybe it makes sense to turning of splitting when alpha != 1.0? It seems rather unlikely someone would be alpha-blending a high-vertex-count line.

I think I prefer that rather than snapping everything, which does diminish accuracy.

And just curious -- is there anything in Apple's bug tracker or knowledge base about this? I wonder if there aren't other suggested workarounds.

@mdehoon
Copy link
Contributor

mdehoon commented Dec 11, 2012

The suggestion from Apple developers was to split up the path.

I agree that it is unlikely that someone would be alpha-blending a line with many vertices. But then we don't have to switch off splitting when alpha!=1, since splitting is only done if the path has many vertices (I have done some tests with splitting every 100 or every 1000 vertices. Both of them are fast, though I want to do some more testing before committing this). So for now I would prefer to always split the path if it has more than e.g. 1000 vertices, regardless of the alpha value. If some day we find a case where the results for alpha!=1 are unacceptably bad, we can reconsider.

@astrofrog
Copy link
Contributor

Just pitching in to say I've had several local Python users tell me they were having issues with plotting even 10,000 points with the MacOS X backend. So for now, the solution is to recommend using a different backend?

@mdboom
Copy link
Member

mdboom commented Feb 27, 2013

@astrofrog: is that with or without alpha. I think the problem here is alpha-specific. Usually when it's a slow down, the first thing to check is whether path simplification is turned on and being applied.

@astrofrog
Copy link
Contributor

I didn't check , I just noticed users trying to do plt.plot(x, y) with the defaults, and having the backend basically hang with only 18,000 points. I'll try and do more diagnosing next time I see this happen.

@efiring
Copy link
Member

efiring commented Feb 27, 2013

@mdboom, I don't think this is an alpha problem; at least based on this Issue history, it looks like @mdehoon's path-splitting fix never made it to the PR stage. Michiel verified earlier in this thread that path simplification is being used. The underlying problem is that Quartz is slow. (This also shows up in Preview; I have run into pdf documents with line plots that are rendered nearly instantaneously using evince on Linux, but that are unusably slow on the Mac with Preview.) I think that the path-splitting fix is likely the best way to get around it for mpl 1.3.

@astrofrog: yes, the workaround is to use one of the agg-based backends.

@mdehoon
Copy link
Contributor

mdehoon commented Feb 28, 2013

Indeed I never got around to creating a pull request for this fix. Sorry for that. I can do so over the next 2 weeks.

@jwoillez
Copy link
Contributor Author

@astrofrog, using "figure.dpi : 72" in my matplotlibrc worked for me (following @efiring's suggestion), as long as I am not playing with linewidth and alpha.

@mdehoon
Copy link
Contributor

mdehoon commented Mar 9, 2013

See this pull request:
#1816
This avoids the slowdown by splitting paths into segment of 100 points, and drawing each separately. Visually I cannot discern any difference between drawing the path in one go, or drawing the path segment by segment, but the latter is much, much faster.

@mdehoon
Copy link
Contributor

mdehoon commented Mar 27, 2013

@mdboom or anybody else: Can the pull request #1816 be accepted? Then I can move on to the next issue.

@dmcdougall
Copy link
Member

#1816 merged; closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants