Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sprint] scatter plots are (reportedly) too slow #2156

Merged
merged 2 commits into from Sep 30, 2013

Conversation

mdboom
Copy link
Member

@mdboom mdboom commented Jul 22, 2013

@ChrisBeaumont reported in his great SciPy talk that scatter plots are too slow. We should get to the bottom of that 😉.

@ChrisBeaumont
Copy link
Contributor

This is essentially the issue discussed here -- calling plt.plot is about 10x faster than plt.scatter on my machine. I think this is because plot copies a stamp over and over in Agg, whereas scatter has to re-draw markers at every datapoint.

It would be great if matplotlib had some way of mapping size and/or color to data, without sacrificing all of that speed! I have to think there is some middle ground between the speed of plot and the flexibility of scatter.

@WeatherGod
Copy link
Member

Last year, john Hunter explained that he did work to optimize line
collections, which assume the same marker throughout a line, while I think
scatter makes patch collections, which can't? I don't exactly recall it
all, but he did say that some of the optimizations could be brought over to
scatter.

Ben
On Jun 26, 2013 6:18 PM, "Chris Beaumont" notifications@github.com wrote:

This is essentially the issue discussed herehttps://github.com/astrofrog/mpl-scatter-density/issues/2#issuecomment-2508221-- calling
plt.plot is about 10x faster than plt.scatter on my machine. I think this
is because plot renders a stamp over and over, whereas scatter has to
re-draw markers at every datapoint.

It would be great if matplotlib had some way of mapping size and/or
color to data, without sacrificing all of that speed! I have to think there
is some middle ground between the speed of plot and the flexibility of
scatter.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2156#issuecomment-20087105
.

@mdboom
Copy link
Member Author

mdboom commented Jul 22, 2013

@ChrisBeaumont: I've attached some low-hanging optimizations to scatter.

I've benchmarked this using points of different sizes (5, 25, 125 pixel diameters) and different numbers of points (100, 1000, 10000, 100000 and 1000000). Here's the baseline timings:

baseline

Greater than linear growth -- probably mostly due to memory/cache pressures with the greater number of points. Here's the relative speedups with this patch.

combined

Three separate optimizations were found:

  1. zip was being used to combine the X and Y coordinates. np.dstack is obviously much better.

dstack

  1. When the points are all the same size, shape and color (etc...), it falls back to draw_markers, which draws the shape once as scanlines and then draws from those scanlines multiple times.

blit

  1. The transformation for each scatter point was being stored in a list of Transform objects. Storing the transformation matrices in a Nx3x3 array is much more efficient both to create and run through.

array_matrices

Beyond this, I haven't found much other low-hanging fruit that doesn't involve changing the output. I'll save that report for another issue.

@mdboom
Copy link
Member Author

mdboom commented Jul 22, 2013

The benchmark, FYI.

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
from numpy import random
import io
import time
import sys

npoints = int(sys.argv[1])
size = int(sys.argv[2])

np.random.seed(0)
xy = np.random.rand(npoints, 2)
if size > 0:
    sizes = np.power(np.random.rand(npoints) * float(size), 2.0)
else:
    sizes = -size

s = time.time()
x = plt.scatter(xy[:, 0], xy[:, 1], sizes)
plt.savefig(io.BytesIO())
print(time.time() - s)

@ChrisBeaumont: I'd appreciate some testing in the context of the kind of data you're seeing, just to see how it compares to this completely random and synthetic benchmark.

width + scanlines.max_x() + 1.0,
height + scanlines.max_y() + 1.0);
height - scanlines.min_y() + 1.0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the purpose of the clipping change in _backend_agg.cpp? Was that a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It was a bug where it would avoid drawing some legitimate things.

@ChrisBeaumont
Copy link
Contributor

Nice! I'll play around with this.

@ChrisBeaumont
Copy link
Contributor

My benchmarks are similar: before this PR:
benchmark_old

And after (note the different yaxis scaling):
benchmark

The comparison with plot is with `plot(x, y, 'o', ms=4), which produces output quite similar to default scatter.

Unfortunately, the biggest speedup in this PR (the blitting) essentially replicates what plot can do already. The compelling functionality of scatter is, IMO, the ability to map color and/or size onto data. I can envision two "medium"-hanging fruit optimizations, that might push this kind of functionality into the 10^5-6 points range:

draw_markers could accept an optional color array for each marker. I'm not positive about the pixel blending math, but I suspect it's possible to pre-render a stamp and then blit it with different tints each time? If so, this would speed up calls to scatter with constant sizes but variable colors.

Sometimes the number of combinations of colors/sizes/markers will be small, especially if they are discretized first. In these situations, scatter would benefit from caching the markers, and using the same blitting trick in draw_markers.

Do you think either is feasible? I could take a crack at them in a new PR, to earn my matplotlib/C++ merit badge

@mdboom
Copy link
Member Author

mdboom commented Jul 22, 2013

I think handling multiple colors in draw_markers should be possible. It probably requires implementing a custom blender for agg, though. Sounds, um, magical.

As for caching markers based on multiple sizes, that should also be possible. I had experimented with generating one stamp and then resampling it to different sizes, but, even with low-quality nearest neighbor resampling, doesn't gain much over just rendering the vector object directly. But if you built a discrete number of sizes and used those, that should help, I suspect. We probably want some sort of option to scatter where one specifies a number of "bins" to sort the sizes into (or maybe have an "auto" option that would do something smarter). And maybe not do anything special when the number of points is small, since it will add some preprocessing overhead.

@dpsanders
Copy link

Looks really nice!
A non-matplotlib comment: If you want to claim something about the growth compared to linear, it would be useful to add a linear growth comparison to the graph.

@ChrisBeaumont
Copy link
Contributor

Anything with curvature in a log-log plot is definitely super-linear :)

@dpsanders
Copy link

True, but even with curvature it might have not have reached linear yet by
the end of the time window shown!

On Wed, Jul 24, 2013 at 8:28 AM, Chris Beaumont notifications@github.comwrote:

Anything with curvature in a log-log plot is definitely super-linear :)


Reply to this email directly or view it on GitHubhttps://github.com//pull/2156#issuecomment-21484316
.

Dr. David P. Sanders

Profesor Titular "A" / Associate Professor
Departamento de Física, Facultad de Ciencias
Universidad Nacional Autónoma de México (UNAM)

dpsanders@gmail.com
http://sistemas.fciencias.unam.mx/~dsanders

Cubículo / office: #414, 4o. piso del Depto. de Física

Tel.: +52 55 5622 4965

… list of Transform objects.

When the collection has the same styling and only varies by offsets, use draw_markers instead.
@mdboom
Copy link
Member Author

mdboom commented Sep 30, 2013

I know @ChrisBeaumont wanted to take this even further, but I wonder if there's any objections to merging this as-is (since it appears to be an improvement on all fronts anyway, even if it doesn't go as far as it could).

@ChrisBeaumont
Copy link
Contributor

+1 on merging. I do still want to push on this more, but am mired in thesis work for the next month. In any event, that can happen in a new PR

mdboom added a commit that referenced this pull request Sep 30, 2013
[Sprint] scatter plots are (reportedly) too slow
@mdboom mdboom merged commit 379b14c into matplotlib:master Sep 30, 2013
@mdboom mdboom deleted the scatter-speed branch August 7, 2014 13:52
@tacaswell tacaswell mentioned this pull request Dec 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants