New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store indices instead of geometries in STRtree + option to keep returning geometries #1064
Store indices instead of geometries in STRtree + option to keep returning geometries #1064
Conversation
shapely/strtree.py
Outdated
for geom in geoms: | ||
lgeos.GEOSSTRtree_insert(self._tree_handle, geom._geom, ctypes.py_object(geom)) | ||
self._idxs = list(range(len(geoms))) | ||
for idx, geom in zip(self._idxs, geoms): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be
for idx, geom in enumerate(geoms):
It doesn't look like self._idxs
is used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add a comment about it to clarify, but we need to store self._idxs
for memory management purposes (python needs to keep track of those integers, to avoid garbage collecting them while they are stored in the C++ GEOS STRtree).
Still, this for
loop could be simplified to use enumerate
, but since we have the self._idxs
anyway, it seems a bit more explicit to use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely worthy of a comment making that clear. If we need to hold on to the indexes then the way you have it makes total sense.
I like the simplicity and explicitness of a |
I also like the keyword solution. It is straightforward and effortless to understand. |
Using a keyword for this behavior seems reasonable. I'd suggest adding a bit to the docstrings of Personally, I really like the index-oriented approach used by Is there ever a case for returning both index and geometry? It might be worth considering how this generalizes to a vectorized / bulk case as well, as in |
Yes, that's what I would advocate for as well and was planning to do (once we agree on the interface, and I still need to add docs and tests first).
I don't think there is necessarily a need to add the keyword to |
shapely/strtree.py
Outdated
@@ -84,8 +84,9 @@ def __init__(self, geoms): | |||
def _init_tree_handle(self, geoms): | |||
# GEOS STRtree capacity has to be > 1 | |||
self._tree_handle = lgeos.GEOSSTRtree_create(max(2, len(geoms))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but the argument you pass to GEOSSTRtree_create
is the capacity of a tree node, not the tree itself! Using len(geoms)
will give you a tree with a single leaf, i.e. a glorified array of geometries. 10 is a good value to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would explain this: https://gis.stackexchange.com/questions/353619/shapely-with-rtree-versus-strtree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, awesome catch Dan. I've seen that question before but had no idea what caused it, glad to see there's a reasonable explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #1070
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbaston thanks a lot for the catch!
In pygeos, we are luckily already using a fixed default of 5 (https://github.com/pygeos/pygeos/blob/master/pygeos/strtree.py#L54). Is that a good default as well, or would you recommend changing it to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I've experimented with it I get the best performance in the 8-16 neighborhood. GEOS internally defaults to 10 and it's hard to see a reason to change it. I'm not sure why GEOS exposes this to the user in the first place; I've made the same mistake myself. postgis/postgis@1ff9c14
I dislike using functions that have different return types depending on a keyword argument. I would expect this to be a source of bugs. I haven't seen it discussed yet, maybe because it's so obvious, but storing only indexes means that only geometries in a container can be indexed. Right? And storing only integers means only containers that are indexed with an integer can be used. Python is more flexible than this. Lists and arrays require integer keys, but mappings can use any immutable object as a key. What would you all think about allowing storage of any Python object? Not just geometries or integers. Then we'd have a very general purpose spatially-indexed container that could be used in many ways. The GEOS API function https://github.com/libgeos/geos/blob/master/capi/geos_ts_c.cpp#L2979-L2988 takes a geometry and a |
I think this is similar as what Personally, I have never used this ability of rtree, so can't really judge about it. But it would be good to get some idea or feedback on whether and how this is used in rtree.
I don't fully understand the logic here. Right now the STRtree in Shapely does not accept a geometry, but an iterable of geometries. The alternative would be to have an iterable of That's certainly more flexible for the user (the current behaviour of returning the geometry would be mimicked by Some other notes:
|
Further thoughts about this?
BTW, I agree, in general, that a changing output type depending on a keyword is not always the nicest API design. In the RFC discussion, an alternative that was raised was to have two separate methods for this reason (eg To also stress another point: for me, this is mostly a discussion for the API on the python side. IMO we can perfectly keep the STRtree code at the C level to only work with integers for simplicity, and at the python level we can provide wrappers to return other values than integer indices. |
Friendly ping for this discussion. |
@jorisvandenbossche in #1094 I have another take on what the Python API in 1.8 and 2.0 might look like. Geopandas needs a method that returns a numpy array of ints (addresses) and avoids Python callbacks so the GIL can be released, yes? What would you think about that being a seperate method entirely? |
From a user perspective it would be nice to have an explicit procedure for recovering some or other identifying Further, it would be nice if this Use-case: when using various data structures, e.g.:
Then I need to use an STRTree to recover nearby adjacent nodes or buildings etc., and it is not the geometry that I'm looking for, per se, but the I realise this is bit of a hack, but at the moment (existing |
@songololo one way to do this is to create two parallel numpy arrays: one of your |
@songololo in #1112 I have a proposal for a tree that supports bringing your own integer identifiers. This is something I would like to have, too. |
To be clear, also in this PR it would be easy to add the functionality to allow passing custom But so, in my mind, the main question is not so much about capabilities, but the final API question about how to expose those different options to the user (as I listed several ways in the top post): (personally, I have a slight preference for the keyword instead of the separate methods, but I can certainly live with the other options as well as a compromise) |
😂 to muddy the waters a vote for @sgillies suggestion of keeping the return types distinct per method name as this keeps the mental abstractions cleaner and reduces potential downstream errors. By having distinct methods coupled to distinct return types my inclination is that discovery of functionality is also a bit easier for end users because it is sometimes easy to miss or otherwise gloss over the implications of keyword options. |
Just to make it easier to compare and discuss both approaches / PRs, I quickly added the ability to store custom ids with a list of >>> from shapely.geometry import Point
>>> from shapely.strtree import STRtree
>>> geoms = [Point(i, i) for i in range(10)]
# passing list of geometries
>>> tree = STRtree(geoms)
>>> tree.query(Point(2, 2).buffer(1.0))
[<shapely.geometry.point.Point at 0x7fc3a21fc550>,
<shapely.geometry.point.Point at 0x7fc3a21fc520>,
<shapely.geometry.point.Point at 0x7fc3a21fc4c0>]
# this returns "default" (0..n) indices
>>>> tree.query(Point(2, 2).buffer(1.0), return_geometries=False)
[1, 2, 3]
# passing list of (geom, id) tuples
>>> tree = STRtree((g, i+1) for i, g in enumerate(geoms))
# this now returns the custom (user-provided) ids
>>> tree.query(Point(2, 2).buffer(1.0), return_geometries=False)
[2, 3, 4] (and of course also in this PR I could easily add |
In principle this should also work for |
Just reading up on the discussions here. About the general API, I don’t think it is a good thing to change things unless awe have to. Also, a function should have a return type that is independent of its input parameters. So I like to take side with @sgillies and @songololo on that one. I’d opt for keeping the current method as is, and adding a “query_index” method. User-supplied labels would be easy to make, but I’d stear clear of the list comprehensions and make the API like “STRTree(geometries, labels=None)” defaulting “labels” to “np.arange(len(geometries))” Implementation-wise, the strtree will just contain primary keys, which can be mapped as a last step to geometries (or other labels). In pygeos, STRTree().geometries is already there and can be used for that purpose. |
@caspervdw yes, I agree that we should disentangle geoms and values and use two input sequences (one optional). Thank you. I can see how this helps efficient implementations. |
Sorry for the very slow follow-up here.
For completeness, not changing the API here in Shapely will also mean a change in API for some users, as it will mean a change for people already using pygeos (and a change for users of geopandas' The question is also what we do with the
Indeed, and that's what I am also already doing here. |
@sgillies I updated this PR after your merge of #1112. It doesn't change any user facing API now (after your refactor of that in #1112), but only refactors the implementation (how to store the items). As mentioned in #1112 (comment), this version is closer to what we would do in the shapely-2.0 branch on top of the C-based STRtree. IMO it is useful to already do this now. |
if geoms is None: | ||
# TODO should we allow this? | ||
geoms = [] | ||
|
||
geoms = list(geoms) | ||
if geoms and isinstance(geoms[0], tuple): | ||
# TODO should we allow this? | ||
# unzip the list of tuples | ||
geoms, items = list(zip(*geoms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two TODO items could be removed after #1172
In context of the Shapely 2.0 RFC (shapely/shapely-rfc#1) we had some discussion about storing/returning integer indices instead of geometry objects, but didn't yet finalize the exact API how we want this to look like.
With this PR, I am testing one concrete option, and which can then also trigger some discussion about this, and if we agree on the API, I can also update the RFC text about that.
Generally, I think we can always store the indices in the GEOS tree, and then on the python level we can still return geometries or indices as we want (since we store the original geometries which are inserted into the GEOS tree).
Then specifically in this PR, I am keeping the existing
query
andnearest
methods of theSTRtree
, but add areturn_geometries
keyword. Currently, I set it as default to True (to preserve existing behaviour). But then, we can also decide to raise a warning in Shapely 1.8 that this will switch to False, and have it set to False as default in Shapely 2.0 (so then by default return indices), as a way to have a deprecation cycle.Some possible alternatives / discussion points (also based on aspects mentioned in the RFC discussion):
return_geometries=True/False
good? Other ideas?query_index
andquery_geometries
)Personally, I think a keyword option in a single method is the cleanest and easiest to implement (but certainly open to the other options as well). It's also similar to rtree's
idx.intersection(..., objects=False)
.Whichever option we choose, I think it will be straightforward to add it to the pygeos-based implementation for Shapely 2.0.
This PR obviously still needs more tests.
Also closes #615 (this PR is based on @snorfalorpagus's comment there)
cc @sgillies @mwtoews @caspervdw @brendan-ward