New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A general purpose STRtree which can store any Python object #1094
Conversation
@sgillies thanks for getting back to this! Now, I think I don't yet fully understand how this |
Indeed I need to think more about the callback. We can't ask users to navigate the Python/C issues. Perhaps requiring callbacks to be a method of an Extension type would help, they could use a helper method of that type to do conversions. I'll stop waving my hands now and come up with some running code. |
To prevent a crash with GEOS versions < 3.6.4
Finally got the time to get back to this and overcome some of the issues. First, this PR will let us store any Python objects we want in the tree. Geometries, numbers, strings. Even a mix, which is not something one would normally do. The query_cb function takes an ordinary Python function as a callback and all the conversion between Python and C types is done by a decorator. A related decorator is used in STRtree.nearest, which also now supports trees that contain non-geometry objects. |
value = ctypes.cast(arg1, ctypes.py_object).value | ||
userdata = ctypes.cast(arg2, ctypes.py_object).value | ||
func(value, userdata) | ||
return wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These decorators need docs if we're going to go in this direction. I assume we could write these in C for 2.0. Something like https://code.activestate.com/recipes/576731-c-function-decorator/.
result.append(geom) | ||
@query_callback | ||
def callback(value, userdata): | ||
result.append(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python callback function with our special decorator.
shapely/strtree.py
Outdated
|
||
@nearest_callback | ||
def callback(value, geom, userdata): | ||
value_geom = geoms[values.index(value)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When non-geometry objects are stored in the tree we need to look up the geometries they were inserted with.
def test_query(geoms, query_geom, num_results): | ||
tree = STRtree(geoms) | ||
results = tree.query(query_geom) | ||
assert len(results) == num_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a bunch of test improvements that helped in debugging.
Reindex initdata within nearest() to speed up geom search.
Add tests of storing an enumeration
) | ||
def test_query_idx_values(geoms, query_geom, expected): | ||
with pytest.warns(ShapelyDeprecationWarning): | ||
tree = STRtree(enumerate(geoms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of storing the index of the geometry. Related to #1064.
def test_query_enumeration(geoms, query_geom, expected): | ||
"""Store enumeration""" | ||
with pytest.warns(ShapelyDeprecationWarning): | ||
tree = STRtree(zip(enumerate(geoms), geoms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also related to #1064. But instead of always only storing geometry indices in tree nodes and getting geometries through their index, we require the caller to explicitly store both index and geometry.
try: | ||
dist.contents.value = func(value, geom, userdata) | ||
dist.contents.value = func(value, geom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a use for the userdata argument right now. If one came up, it could be added as a keyword arg.
@jorisvandenbossche @mwtoews I've made some major improvements and have rewritten the introductory comment in this PR to better explain the intent and direction. I think this implementation covers existing features, the features in #1064, and more. |
Thanks, with those updates, the intent and how it would be used is much clearer now ;) (and avoids using ctypes as a user!) A few follow-up questions / comments:
|
Only read the updated top post now, so a few comments/answers on that:
A bit related to my comment above, but the actual API in that PR (different return value depending on a flag) is only "a" choice out of a set of options that I made when opening the PR, but is certainly something to be discussed. I think we are maybe discussing two (potentially) different aspects, which might be useful to separate. One is the actual implementation under the hood (i.e. how pygeos has implemented it in C / how #1064 does it in the current ctypes implementation), and the other is the user-facing API (eg the
The STRtree in #1064 just takes a sequence or iterable of geometries as input, which is exactly the same as the current version. So it's not specifically targetted at receiving a geopandas Series. I also don't think this is specific to geopandas, but more general when working with arrays of geometries. I know that eg @brendan-ward has mentioned he has already been using pygeos' strtree a lot in his work for fast spatial join like operations. |
Thanks @sgillies for your work here and @jorisvandenbossche for keeping the conservation going! I am likely biased, but I think keeping the underlying C STRtree limited to enumeration indices gives us a simpler API at the C level, avoids memory traps and GIL-related issues, and yet still provides a very flexible API at the Python level. Those indices can always be used to return arbitrary values (e.g., original index for a DataFrame, the geometry at each index, some other value) at the Python level: just keep an indexable array (i.e., numpy array of dtype appropriate for values) of those values in the Python tier, and return those values using the enumeration indices depending on some parameter or a specific function layered on top of the C STRtree. My hunch is similar to @jorisvandenbossche that storing other things directly in the C STRtree introduces a good bit more complexity and isn't necessary. The slow parts are the underlying tree query and predicate operations against the results; indexing back into an arbitrary array of values afterward should be very fast. I.e., we probably wouldn't see a performance improvement by storing the values you want back directly in the C tree (those may actually have a performance impact depending on how they are tracked there). I have been using the enumeration indices heavily for a variety of different operations; even if the geometries originate from DataFrames or arrays, sometimes I just use these directly, sometimes I use them to get the original DataFrame index, and sometimes I use them to get the geometries or something else. There is little overhead to using enumeration indices to get to any of those. So I'm a huge fan of the enumeration indices approach at the core.
I concur with this; the substantive part of the API discussion should be at the Python level, and there should be easy ways to return either the enumeration indices, the geometries, or arbitrary values based on the underlying enumeration indices. This can also be a rich source of functionality in Shapely and be easier for contributors than the underlying implementation in C. Both this and #1064 have elaborated the need for multiple types of results from the tree. I'm unclear on the use cases around custom callbacks implemented in Python. My first instinct is that I'd really want to keep these at the C level. There are issues around memory management and releasing the GIL lurking here. My preference would be to identify clear use cases for other callbacks that are needed and implement those at the C level and expose those through a well-defined API (also based on enumeration indices) back to the Python tier. Now that predicates are implemented in Likewise, the custom callback we had to work out in pygeos to support What are some of the use cases you had in mind for custom callbacks? |
@brendan-ward @jorisvandenbossche does storing only enumeration indices require the tree to be immutable? In #1104 I'm considering a tree that supports insertion and removal of items. It seems to me that in #1064 insertion must then be implemented as a strict append to the internal list so that indices already stored in the tree remain valid. I'm not sure if that breaks any pygeos requirements. Internally, item removal would be less constrained. If the item is left in the internal array/list all the indices remain valid. GEOS could be modified to allow us to break out of the loop at https://github.com/libgeos/geos/blob/master/src/index/strtree/SimpleSTRtree.cpp#L226 but until then there's no great way to return early using a callback. My interest in the callbacks was coming from an interest in seeing if we could shift responsibility from mapping pointer addresses stored in the tree to Python objects onto the caller. But if we're going to release the GIL we cannot do that. Maybe it's best to shelve the callback idea for now. If we do change GEOS to allow an early return, we could add a callback keyword argument to query and switch to a branch that doesn't release the GIL. Yes? |
@sgillies It is my understanding that STRtree trees are immutable after first query:
I know there are ongoing changes to the STRtree implementation, but I wasn't aware of any change to that constraint. Pygeos expects the tree inputs to be immutable. However, in my experience, building trees is very fast even for large (>1M) datasets, so I think it is entirely feasible to simply discard and rebuild the tree instead of trying to handle inserts. Ideally, mutations should be batched to avoid rebuilding the tree on every change. What is the use case for adding / removing items from the tree after building it? If GEOS STRtree is modified, this would be through the addition of new functions against STRtree, so support would need to be built into the C wrapping in pygeos / shapely at that time - and that would be the place to provide an additional function to the Python tier to handle this case. In Python, it could be exposed to users as a keyword parameter if appropriate, but use different underlying C functions. This still allows us to leverage address-based indexing, release the GIL, and keep that complexity at the C level. |
This STRtree can store almost anything and is backwards compatible with the one in 1.7. If initialized with a sequence of geometry objects, these are stored, and query and nearest return a list of geometries or a single geometry. If initialized with a sequence of
(value, geom)
pairs, the values are stored, and query and nearest return lists of these values or a single value. The values could be any Python object and don't need to be the same type.In #1064 we have an STRtree that takes a sequence of geometry objects and stores their enumeration indices. Query returns a list of indices or a list of geometries (found by index from an internal list) depending on a flag.
The STRtree in #1064 and in this PR each have internal lists maintaining references to the initial data.
My understanding of the STRtree in #1064 is that it is designed to take a sequence of geometry objects (a DataSeries, yes?) from GeoPandas and store their enumeration indices. Syncronization of state between the enumeration indices in the tree and the DataFrame row indices would be required. The way GeoPandas would use the #1094 STRtree is to store DataFrame row indices directly. I believe this would make synchronization less fragile.