ENH: speed-up mlab.contiguous_regions using numpy #4174

jaimefrio · 2015-02-27T17:58:54Z

The docstring was asking for it, so here it goes. Added some tests as well.

efiring · 2015-02-27T18:39:05Z

I think you can do this more efficiently (at least for arrays with a large number of regions) by first calculating np.diff(mask) and then taking advantage of the fact that a positive value signals the start of a range of ones and a negative value signals the end. Adding 1 to those indices gets almost everything; then you just have to handle the possible start at index 0 and the possible need for an extra end at the end.

jaimefrio · 2015-02-27T18:45:47Z

I think you just described what my code is doing... ;-)

efiring · 2015-02-27T18:51:11Z

I know--I was trying to describe an implementation that is pure-numpy, without the loop hidden in the generator expression. It may be that your version is better for most real-world use cases. I don't know.

jaimefrio · 2015-02-27T18:59:04Z

I'm not sure I fully understand... You mean the list(zip) part? I only do that to match the current output exactly. We could simply return idx.reshape(-1, 2) and have a numpy array with the same structure and avoid the conversion to list. Not sure if that may break code elsewhere.

We could also add checks to see if the call to np.concatenate is needed or not, but I went for simplest implementation.

WeatherGod · 2015-02-27T19:17:09Z

lib/matplotlib/mlab.py

-            boundaries.append((in_region, i))
-            in_region = None
+    idx, = np.nonzero(mask[:-1] != mask[1:])
+    idx = np.concatenate((([0],) if mask[0] else ()) + (idx + 1,) +


could you add a comment here to the effect of "prepending index 0 and appending index len(mask) if needed" (at least, that is my understanding of what it is doing. Too many parens.

In fact, perhaps you could have them be lists instead of tuples to help distinguish when I am looking at a sequence and when we are merely grouping for order of operations?

efiring · 2015-02-27T19:41:42Z

This is what I had in mind. Again, I don't know whether it is better; but it is the algorithm I normally use for this sort of thing. (I thought there was a nearly identical example already in mpl somewhere, but maybe not.)

import numpy as np

def contiguous_regions(mask):
    mask = np.asarray(mask, dtype=np.int8)
    if mask.size == 0:
        return []
    if not mask.any():
        return []
    zero = np.zeros((1,), dtype=np.int8)
    mask = np.hstack((zero, mask, zero))
    diffm = np.diff(mask)
    ind0 = list(np.nonzero(diffm == 1)[0])
    ind1 = list(np.nonzero(diffm == -1)[0])
    return(list(zip(ind0, ind1)))


masks = ([0, 0, 1, 1, 0, 0, 1],
         [1, 0, 1, 1, 0, 1, 0],
         [0, 0, 1, 0, 0],
         [1, 0, 1],
         [1, 1, 1],
         [1,],
         [0,],
         [],
         )


for mask in masks:
    inds = contiguous_regions(mask)
    print(mask)
    print(inds)
    for i0, i1 in inds:
        print(mask[i0:i1])
    print('')

jaimefrio · 2015-02-27T20:09:06Z

@efiring My logic was that the index array is going to be smaller than the mask in most cases, so we will probably be better off appending and prepending to the indices than to the mask. May not always be true in a 64 bit machine with 1 byte booleans and 8 byte integers, but we probably need not worry much either way.

What we are almost certainly going to be better off with is using a boolean array directly. When you do np.nonzero(diffm == 1)) you are effectively creating a boolean array the size of diffm, then searching it for True values. Furthermore you need to do that twice, because you also have to search for the diffm == -1 values. Also, in some of the latest versions of numpy boolean operations are explicitly vectorized with SIMD, while integer ones rely on the ability of the compiler to autovectorize, which is not always great. So I really think boolean is the way to go, even though the overall idea is mostly the same.

@WeatherGod I have made the whole concatenation much more explicit and readable and added some comments, let me know if you prefer it some other way.

efiring · 2015-02-27T20:31:37Z

@jaimefrio The problem with Boolean is that on the most recent numpy, np.diff doesn't do what I need; I discovered this the hard way not too long ago. That's why I converted to int8. But I like your most recent version, which looks both efficient and readable. Elegant, even!

WeatherGod · 2015-02-27T20:40:52Z

@efiring, really?! crap! I think you just pointed me in the right direction
for a bug I have been trying to hunt down, which forced me to revert to
v1.8.x.

On Fri, Feb 27, 2015 at 3:31 PM, Eric Firing notifications@github.com
wrote:

@jaimefrio https://github.com/jaimefrio The problem with Boolean is
that on the most recent numpy, np.diff doesn't do what I need; I discovered
this the hard way not too long ago. That's why I converted to int8. But I
like your most recent version, which looks both efficient and readable.
Elegant, even!

—
Reply to this email directly or view it on GitHub
#4174 (comment)
.

jaimefrio · 2015-02-27T21:33:55Z

Has behavior actually changed? I believe in 1.9 it simply issues a deprecation warning when you try to do boolean subtraction, but hasn't changed behavior unless you explicitly turn warnings into errors. This should be the relevant change in numpy.

efiring · 2015-02-27T22:37:34Z

lib/matplotlib/mlab.py

-        elif in_region is not None and not val:
-            boundaries.append((in_region, i))
-            in_region = None
+    # Add first and/or last index if needed


While we are polishing this bikeshed: it's probably most efficient to convert idx to a list at this point. Then the prepending and/or appending can be done with a total of 4 lines, as in my example. These operations are very fast with python lists--probably faster than np.concatenate. I suspect the final indexing and zip operation would also be at least as fast if idx is already a list.

Lists are faster on my system for arrays of less than 10000 indices. After that arrays take over. It does seem like the more common use case, so I guess it does make sense to change it.

It's faster for arrays of less than 10000 indices.

ENH: speed-up mlab.contiguous_regions using numpy

efiring · 2015-02-28T02:07:07Z

The Travis failure is unrelated, and looks like a Travis glitch associated with building the docs.

jenshnielsen · 2015-02-28T08:33:02Z

@efiring The docs build issue is caused by the release of |Python 3.0 fixed by #4176

ENH: speed-up mlab.contiguous_regions using numpy

6007164

tacaswell added the status: needs review label Feb 27, 2015

jaimefrio force-pushed the contiguous_regions branch 2 times, most recently from 8ac9b92 to ddc3b6a Compare February 27, 2015 18:35

TST: added tests for mlab.contiguous_regions

3f384aa

jaimefrio force-pushed the contiguous_regions branch from ddc3b6a to 3f384aa Compare February 27, 2015 18:47

WeatherGod reviewed Feb 27, 2015
View reviewed changes

STY: Refactor concatenation of index array

9be87aa

tacaswell added this to the next point release milestone Feb 27, 2015

efiring reviewed Feb 27, 2015
View reviewed changes

ENH: Convert to list before appending and zipping

8197e2b

It's faster for arrays of less than 10000 indices.

efiring added a commit that referenced this pull request Feb 28, 2015

Merge pull request #4174 from jaimefrio/contiguous_regions

f4f872b

ENH: speed-up mlab.contiguous_regions using numpy

efiring merged commit f4f872b into matplotlib:master Feb 28, 2015

efiring removed the status: needs review label Feb 28, 2015

jaimefrio deleted the contiguous_regions branch February 28, 2015 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: speed-up mlab.contiguous_regions using numpy #4174

ENH: speed-up mlab.contiguous_regions using numpy #4174

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

WeatherGod Feb 27, 2015

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

WeatherGod commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring Feb 27, 2015

jaimefrio Feb 28, 2015

efiring commented Feb 28, 2015

jenshnielsen commented Feb 28, 2015

ENH: speed-up mlab.contiguous_regions using numpy #4174

ENH: speed-up mlab.contiguous_regions using numpy #4174

Conversation

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

WeatherGod Feb 27, 2015

Choose a reason for hiding this comment

efiring commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring commented Feb 27, 2015

WeatherGod commented Feb 27, 2015

jaimefrio commented Feb 27, 2015

efiring Feb 27, 2015

Choose a reason for hiding this comment

jaimefrio Feb 28, 2015

Choose a reason for hiding this comment

efiring commented Feb 28, 2015

jenshnielsen commented Feb 28, 2015