Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A general purpose STRtree which can store any Python object #1094

Closed
wants to merge 13 commits into from

Conversation

sgillies
Copy link
Contributor

@sgillies sgillies commented Mar 2, 2021

This STRtree can store almost anything and is backwards compatible with the one in 1.7. If initialized with a sequence of geometry objects, these are stored, and query and nearest return a list of geometries or a single geometry. If initialized with a sequence of (value, geom) pairs, the values are stored, and query and nearest return lists of these values or a single value. The values could be any Python object and don't need to be the same type.

In #1064 we have an STRtree that takes a sequence of geometry objects and stores their enumeration indices. Query returns a list of indices or a list of geometries (found by index from an internal list) depending on a flag.

The STRtree in #1064 and in this PR each have internal lists maintaining references to the initial data.

My understanding of the STRtree in #1064 is that it is designed to take a sequence of geometry objects (a DataSeries, yes?) from GeoPandas and store their enumeration indices. Syncronization of state between the enumeration indices in the tree and the DataFrame row indices would be required. The way GeoPandas would use the #1094 STRtree is to store DataFrame row indices directly. I believe this would make synchronization less fragile.

@coveralls
Copy link

coveralls commented Mar 2, 2021

Coverage Status

Coverage increased (+0.02%) to 85.179% when pulling ae6f2b8 on strtree-callbacks into 2086ba9 on master.

@jorisvandenbossche
Copy link
Member

@sgillies thanks for getting back to this!

Now, I think I don't yet fully understand how this query_cb would be used in practice by the user.
First, it requires manual usage of ctypes (and knowing the types of the pointers that the callback function receives), and I am not sure if we want that users have to deal with ctypes? (certainly given that we are removing the use of ctypes in Shapely 2.0).
Further, I don't think this actually allows to get an integer index as the result? The callback function can only work with the data that was inserted into the tree, and at the moment that is still hardcoded to be the actual geometry object.

@sgillies
Copy link
Contributor Author

sgillies commented Mar 3, 2021

Indeed I need to think more about the callback. We can't ask users to navigate the Python/C issues. Perhaps requiring callbacks to be a method of an Extension type would help, they could use a helper method of that type to do conversions. I'll stop waving my hands now and come up with some running code.

@sgillies
Copy link
Contributor Author

Finally got the time to get back to this and overcome some of the issues. First, this PR will let us store any Python objects we want in the tree. Geometries, numbers, strings. Even a mix, which is not something one would normally do. The query_cb function takes an ordinary Python function as a callback and all the conversion between Python and C types is done by a decorator. A related decorator is used in STRtree.nearest, which also now supports trees that contain non-geometry objects.

value = ctypes.cast(arg1, ctypes.py_object).value
userdata = ctypes.cast(arg2, ctypes.py_object).value
func(value, userdata)
return wrapper
Copy link
Contributor Author

@sgillies sgillies Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These decorators need docs if we're going to go in this direction. I assume we could write these in C for 2.0. Something like https://code.activestate.com/recipes/576731-c-function-decorator/.

shapely/strtree.py Outdated Show resolved Hide resolved
result.append(geom)
@query_callback
def callback(value, userdata):
result.append(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python callback function with our special decorator.


@nearest_callback
def callback(value, geom, userdata):
value_geom = geoms[values.index(value)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When non-geometry objects are stored in the tree we need to look up the geometries they were inserted with.

def test_query(geoms, query_geom, num_results):
tree = STRtree(geoms)
results = tree.query(query_geom)
assert len(results) == num_results
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a bunch of test improvements that helped in debugging.

Sean Gillies added 2 commits March 11, 2021 16:31
Reindex initdata within nearest() to speed up geom search.
@sgillies sgillies marked this pull request as ready for review March 13, 2021 22:24
Add tests of storing an enumeration
)
def test_query_idx_values(geoms, query_geom, expected):
with pytest.warns(ShapelyDeprecationWarning):
tree = STRtree(enumerate(geoms))
Copy link
Contributor Author

@sgillies sgillies Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of storing the index of the geometry. Related to #1064.

def test_query_enumeration(geoms, query_geom, expected):
"""Store enumeration"""
with pytest.warns(ShapelyDeprecationWarning):
tree = STRtree(zip(enumerate(geoms), geoms))
Copy link
Contributor Author

@sgillies sgillies Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to #1064. But instead of always only storing geometry indices in tree nodes and getting geometries through their index, we require the caller to explicitly store both index and geometry.

try:
dist.contents.value = func(value, geom, userdata)
dist.contents.value = func(value, geom)
Copy link
Contributor Author

@sgillies sgillies Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a use for the userdata argument right now. If one came up, it could be added as a keyword arg.

@sgillies sgillies changed the title Offer a query_cb alternative to #1064 A general purpose STRtree that can store any Python object Mar 14, 2021
@sgillies sgillies changed the title A general purpose STRtree that can store any Python object A general purpose STRtree which can store any Python object Mar 14, 2021
@sgillies
Copy link
Contributor Author

@jorisvandenbossche @mwtoews I've made some major improvements and have rewritten the introductory comment in this PR to better explain the intent and direction. I think this implementation covers existing features, the features in #1064, and more.

@jorisvandenbossche
Copy link
Member

Thanks, with those updates, the intent and how it would be used is much clearer now ;) (and avoids using ctypes as a user!)

A few follow-up questions / comments:

  • Do you have use cases in mind where the user might want to get something else returned as either the geometry itself or either some ID? And a similar question for the ability to specify a custom callback function.
    I am mainly looking at this from the point of view of what we need in geopandas (where we wouldn't need that), so this is of course a quite limited view! But at the same time, I think we should be cautious in implementing features "because it's possible", and think about practical use case if adding this.
    (Because adding such flexible features also has a cost. We should probably need to try it out to know for sure, but my feeling says that implementing the same features in C for 2.0 would give quite some additional complexity.)

  • Regarding the ability to "store any Python objects we want in the tree": I think I mentioned this before in Store indices instead of geometries in STRtree + option to keep returning geometries #1064, but I want to mention again that to achieve this functionality, those objects don't necessarily have to be stored in the GEOS C tree, but could also be stored in the Python tree object.
    If we want to enable this functionality for the end users, I think it is possible to keep track of the values on the python side, and still only store the actual geometries in the GEOS STRtree (as is currently done in pygeos). And based on the indices returned from this wrapped C tree, the python layer can still return any Python object we want.

  • I know that the GEOS STRtree explicitly allows storing generic values that get returned. But using Python objects for that instead of indices (or actually pointers, as currently done in pygeos), will still be more complicated, and for example won't allow to release the GIL for the query (I assume). Also a generic callback function that calls a user-provided python callback won't be able to release the GIL, so that might mean we need two different code paths to support both those cases.

@jorisvandenbossche
Copy link
Member

Only read the updated top post now, so a few comments/answers on that:

In #1064 we have an STRtree that takes a sequence of geometry objects and stores their enumeration indices. Query returns a list of indices or a list of geometries (found by index from an internal list) depending on a flag.

A bit related to my comment above, but the actual API in that PR (different return value depending on a flag) is only "a" choice out of a set of options that I made when opening the PR, but is certainly something to be discussed. I think we are maybe discussing two (potentially) different aspects, which might be useful to separate. One is the actual implementation under the hood (i.e. how pygeos has implemented it in C / how #1064 does it in the current ctypes implementation), and the other is the user-facing API (eg the query method with a flag as in #1064, or a new query_indices method, or letting the user insert the value to be returned explicitly as in this PR, etc).
My feeling is that we can separate those two aspects, and could implement any user facing API we want (maybe except the custom callback) on top of a low-level C STRtree that returns indices (which is probably the simplest to implement on that lower level).

My understanding of the STRtree in #1064 is that it is designed to take a sequence of geometry objects (a DataSeries, yes?) from GeoPandas and store their enumeration indices. Syncronization of state between the enumeration indices in the tree and the DataFrame row indices would be required. The way GeoPandas would use the #1094 STRtree is to store DataFrame row indices directly. I believe this would make synchronization less fragile.

The STRtree in #1064 just takes a sequence or iterable of geometries as input, which is exactly the same as the current version. So it's not specifically targetted at receiving a geopandas Series.
Also, as a clarification, in geopandas we don't really want to store/get row indices (generic labels, which can be of any type), but rather integer positional indices (pandas allows you to index a dataframe in two ways: the actual row labels vs integer positions). This is because using integer positions is more performant (and also more robust, eg in theory you can have duplicate labels).

I also don't think this is specific to geopandas, but more general when working with arrays of geometries. I know that eg @brendan-ward has mentioned he has already been using pygeos' strtree a lot in his work for fast spatial join like operations.

@brendan-ward
Copy link
Collaborator

Thanks @sgillies for your work here and @jorisvandenbossche for keeping the conservation going!

I am likely biased, but I think keeping the underlying C STRtree limited to enumeration indices gives us a simpler API at the C level, avoids memory traps and GIL-related issues, and yet still provides a very flexible API at the Python level. Those indices can always be used to return arbitrary values (e.g., original index for a DataFrame, the geometry at each index, some other value) at the Python level: just keep an indexable array (i.e., numpy array of dtype appropriate for values) of those values in the Python tier, and return those values using the enumeration indices depending on some parameter or a specific function layered on top of the C STRtree.

My hunch is similar to @jorisvandenbossche that storing other things directly in the C STRtree introduces a good bit more complexity and isn't necessary. The slow parts are the underlying tree query and predicate operations against the results; indexing back into an arbitrary array of values afterward should be very fast. I.e., we probably wouldn't see a performance improvement by storing the values you want back directly in the C tree (those may actually have a performance impact depending on how they are tracked there).

I have been using the enumeration indices heavily for a variety of different operations; even if the geometries originate from DataFrames or arrays, sometimes I just use these directly, sometimes I use them to get the original DataFrame index, and sometimes I use them to get the geometries or something else. There is little overhead to using enumeration indices to get to any of those. So I'm a huge fan of the enumeration indices approach at the core.

we can separate those two aspects, and could implement any user facing API we want (maybe except the custom callback) on top of a low-level C STRtree that returns indices

I concur with this; the substantive part of the API discussion should be at the Python level, and there should be easy ways to return either the enumeration indices, the geometries, or arbitrary values based on the underlying enumeration indices. This can also be a rich source of functionality in Shapely and be easier for contributors than the underlying implementation in C. Both this and #1064 have elaborated the need for multiple types of results from the tree.

I'm unclear on the use cases around custom callbacks implemented in Python. My first instinct is that I'd really want to keep these at the C level. There are issues around memory management and releasing the GIL lurking here. My preference would be to identify clear use cases for other callbacks that are needed and implement those at the C level and expose those through a well-defined API (also based on enumeration indices) back to the Python tier.

Now that predicates are implemented in query and query_bulk methods at the C level, I'm having a hard time thinking of other things I'd want to handle in a tree query callback. The tree doesn't allow us to exit early on first result (which could be helpful for certain analyses), so other than tracking the results that you get back from the underlying GEOSSTRtree_query_r or some derivative of input geometries or values that you could do after getting enumeration indices from the tree, I don't know what else you'd do in a callback.

Likewise, the custom callback we had to work out in pygeos to support nearest_all (return all nearest equidistant results and not just the first, limited by an optional max distance) was a bit tricky and definitely needed to be done at the C level.

What are some of the use cases you had in mind for custom callbacks?

@sgillies
Copy link
Contributor Author

@brendan-ward @jorisvandenbossche does storing only enumeration indices require the tree to be immutable? In #1104 I'm considering a tree that supports insertion and removal of items. It seems to me that in #1064 insertion must then be implemented as a strict append to the internal list so that indices already stored in the tree remain valid. I'm not sure if that breaks any pygeos requirements. Internally, item removal would be less constrained. If the item is left in the internal array/list all the indices remain valid.

GEOS could be modified to allow us to break out of the loop at https://github.com/libgeos/geos/blob/master/src/index/strtree/SimpleSTRtree.cpp#L226 but until then there's no great way to return early using a callback. My interest in the callbacks was coming from an interest in seeing if we could shift responsibility from mapping pointer addresses stored in the tree to Python objects onto the caller. But if we're going to release the GIL we cannot do that. Maybe it's best to shelve the callback idea for now. If we do change GEOS to allow an early return, we could add a callback keyword argument to query and switch to a branch that doesn't release the GIL. Yes?

@brendan-ward
Copy link
Collaborator

@sgillies It is my understanding that STRtree trees are immutable after first query:

once the tree has been built (explicitly or on the first call to query), items may not be added or removed

I know there are ongoing changes to the STRtree implementation, but I wasn't aware of any change to that constraint.

Pygeos expects the tree inputs to be immutable.

However, in my experience, building trees is very fast even for large (>1M) datasets, so I think it is entirely feasible to simply discard and rebuild the tree instead of trying to handle inserts. Ideally, mutations should be batched to avoid rebuilding the tree on every change.

What is the use case for adding / removing items from the tree after building it?

If GEOS STRtree is modified, this would be through the addition of new functions against STRtree, so support would need to be built into the C wrapping in pygeos / shapely at that time - and that would be the place to provide an additional function to the Python tier to handle this case. In Python, it could be exposed to users as a keyword parameter if appropriate, but use different underlying C functions. This still allows us to leverage address-based indexing, release the GIL, and keep that complexity at the C level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants