Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store indices instead of geometries in STRtree + option to keep returning geometries #1064

Conversation

jorisvandenbossche
Copy link
Member

In context of the Shapely 2.0 RFC (shapely/shapely-rfc#1) we had some discussion about storing/returning integer indices instead of geometry objects, but didn't yet finalize the exact API how we want this to look like.

With this PR, I am testing one concrete option, and which can then also trigger some discussion about this, and if we agree on the API, I can also update the RFC text about that.

Generally, I think we can always store the indices in the GEOS tree, and then on the python level we can still return geometries or indices as we want (since we store the original geometries which are inserted into the GEOS tree).
Then specifically in this PR, I am keeping the existing query and nearest methods of the STRtree, but add a return_geometries keyword. Currently, I set it as default to True (to preserve existing behaviour). But then, we can also decide to raise a warning in Shapely 1.8 that this will switch to False, and have it set to False as default in Shapely 2.0 (so then by default return indices), as a way to have a deprecation cycle.

Some possible alternatives / discussion points (also based on aspects mentioned in the RFC discussion):

  • If we are fine with a keyword, is return_geometries=True/False good? Other ideas?
  • Keep a single class, but separate methods that return indices (eg query_index and query_geometries)
  • Separate class with the behaviour to return indices

Personally, I think a keyword option in a single method is the cleanest and easiest to implement (but certainly open to the other options as well). It's also similar to rtree's idx.intersection(..., objects=False).

Whichever option we choose, I think it will be straightforward to add it to the pygeos-based implementation for Shapely 2.0.


This PR obviously still needs more tests.

Also closes #615 (this PR is based on @snorfalorpagus's comment there)

cc @sgillies @mwtoews @caspervdw @brendan-ward

for geom in geoms:
lgeos.GEOSSTRtree_insert(self._tree_handle, geom._geom, ctypes.py_object(geom))
self._idxs = list(range(len(geoms)))
for idx, geom in zip(self._idxs, geoms):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be

for idx, geom in enumerate(geoms):

It doesn't look like self._idxs is used anywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add a comment about it to clarify, but we need to store self._idxs for memory management purposes (python needs to keep track of those integers, to avoid garbage collecting them while they are stored in the C++ GEOS STRtree).

Still, this for loop could be simplified to use enumerate, but since we have the self._idxs anyway, it seems a bit more explicit to use it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worthy of a comment making that clear. If we need to hold on to the indexes then the way you have it makes total sense.

@tomplex
Copy link
Contributor

tomplex commented Jan 17, 2021

I like the simplicity and explicitness of a return_geometries kwarg being True or False. Being like rtree with the flag in the method is also a +1 for me, as a heavy user of that library.

@martinfleis
Copy link
Contributor

I also like the keyword solution. It is straightforward and effortless to understand.

@brendan-ward
Copy link
Collaborator

Using a keyword for this behavior seems reasonable. I'd suggest adding a bit to the docstrings of query and nearest to make that output more clear.

Personally, I really like the index-oriented approach used by pygeos.STRtree.query because it enables spatial joins and in my specific usage, the identity of the result is more important than the geometry (which I can always join in later); so I think that getting geometries back instead could be more of an opt-in instead of opt-out. Thus I'd suggest having default return_geometries=True for existing behavior with a deprecation warning, and longer-term flip this to return_geometries=False.

Is there ever a case for returning both index and geometry?

It might be worth considering how this generalizes to a vectorized / bulk case as well, as in pygeos.STRtree.query_bulk where the left side of the query would always be returned as indices, but the right side would be either indexes (as now in query_bulk) or have an option like this keyword to return geometries instead, or perhaps both right side indexes and geometries depending on the above question. If the bulk function meets the needs of spatial joins, then the default behavior for a singular query operation may not need to change, but does present a potential consistency issue.

@jorisvandenbossche
Copy link
Member Author

Thus I'd suggest having default return_geometries=True for existing behavior with a deprecation warning, and longer-term flip this to return_geometries=False

Yes, that's what I would advocate for as well and was planning to do (once we agree on the interface, and I still need to add docs and tests first).

It might be worth considering how this generalizes to a vectorized / bulk case as well, as in pygeos.STRtree.query_bulk

I don't think there is necessarily a need to add the keyword to query_bulk as well (unless someone comes up with a good use case, the only reason I can think of now would be for consistency, which in itself is not worth it IMO).
But you are right that with bulk_query returning indices, that can cover a large part of the use cases that need indices. Personally I would still prefer to (eventually, in Shapely 2.0) return indices in query as well, for consistency, but it's certainly a good argument.

@@ -84,8 +84,9 @@ def __init__(self, geoms):
def _init_tree_handle(self, geoms):
# GEOS STRtree capacity has to be > 1
self._tree_handle = lgeos.GEOSSTRtree_create(max(2, len(geoms)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but the argument you pass to GEOSSTRtree_create is the capacity of a tree node, not the tree itself! Using len(geoms) will give you a tree with a single leaf, i.e. a glorified array of geometries. 10 is a good value to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, awesome catch Dan. I've seen that question before but had no idea what caused it, glad to see there's a reasonable explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1070

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbaston thanks a lot for the catch!
In pygeos, we are luckily already using a fixed default of 5 (https://github.com/pygeos/pygeos/blob/master/pygeos/strtree.py#L54). Is that a good default as well, or would you recommend changing it to 10?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I've experimented with it I get the best performance in the 8-16 neighborhood. GEOS internally defaults to 10 and it's hard to see a reason to change it. I'm not sure why GEOS exposes this to the user in the first place; I've made the same mistake myself. postgis/postgis@1ff9c14

@sgillies
Copy link
Contributor

sgillies commented Jan 27, 2021

I dislike using functions that have different return types depending on a keyword argument. I would expect this to be a source of bugs.

I haven't seen it discussed yet, maybe because it's so obvious, but storing only indexes means that only geometries in a container can be indexed. Right? And storing only integers means only containers that are indexed with an integer can be used. Python is more flexible than this. Lists and arrays require integer keys, but mappings can use any immutable object as a key. What would you all think about allowing storage of any Python object? Not just geometries or integers. Then we'd have a very general purpose spatially-indexed container that could be used in many ways.

The GEOS API function https://github.com/libgeos/geos/blob/master/capi/geos_ts_c.cpp#L2979-L2988 takes a geometry and a void * pointer. Our tree constructor, now, takes only a geometry. We could extend it to take either a geometry or a geometry, object tuple and store the object if it is given, else the geometry. The query method would return whatever was stored.

@sgillies sgillies self-requested a review January 27, 2021 17:11
@jorisvandenbossche
Copy link
Member Author

I think this is similar as what rtree does? It allows to insert, in addition to the integer id and bbox, also a generic (picklable) obj, and at query time, you can ask for the object instead of the id. See https://toblerity.org/rtree/tutorial.html#using-rtree-as-a-cheapo-spatial-database

Personally, I have never used this ability of rtree, so can't really judge about it. But it would be good to get some idea or feedback on whether and how this is used in rtree.
(to be fully correct, in geopandas we actually used to insert the index label of the Series as obj, but we never did anything with it afterwards).

I haven't seen it discussed yet, maybe because it's so obvious, but storing only indexes means that only geometries in a container can be indexed. Right? And storing only integers means only containers that are indexed with an integer can be used. Python is more flexible than this. Lists and arrays require integer keys, but mappings can use any immutable object as a key. What would you all think about allowing storage of any Python object?

I don't fully understand the logic here. Right now the STRtree in Shapely does not accept a geometry, but an iterable of geometries. The alternative would be to have an iterable of (geom, obj) tuples as input, I assume? (and the obj gets returned from the query)?

That's certainly more flexible for the user (the current behaviour of returning the geometry would be mimicked by (geom, geom) tuples, the proposal of returning integers by (geom, i) tuples). But I don't understand how that enables to store "any" python object.
To take the practical example of a mapping: assume you have a dict of {key: geom} structure, you still need to convert this to an iterable of (geom, key) tuples to be able to store it, which could be something like STRtree(zip(dict.keys(), dict.values())). So the interface doesn't allow to directly store the dictionary, you still have to do quite some manipulation to get to that. And this way you get the dictionary keys back from the query function. But with integer indices this is very easy to mimic as well: you pass the dict values (geoms) to the tree (STRtree(dict.values())), and can convert to integer index from the query to the dict key with list(dict.keys())[i].

Some other notes:

  • The current implementation of STRtree in pygeos actually stores the addresses of the geometry in the array of geometries, and doesn't store actual integer index objects (on suggestion of @dbaston)
  • The current implementation also releases the GIL for queries: this is something that I certainly would like to keep (or at least in case of integer indices as output), so that would probably be a complication for generic objects.
  • If we would like this generic interface of (geom, obj), I think this is still easily possible on the Python side and we don't need to actually store those obj inside the GEOS STRtree (even if it allows to do that).
    We already store the input geometries on the python class, so we could also perfectly store the objects as an array/list as well. Assume that you have an iterable of (geom, obj) tuples as input. Then getting a list of geometries and corresponding objects is basically geoms, objs = list(zip(*input)); the geoms can be stored in the GEOS STRtree, and when doing a query and if we want to return the obj instead of the integer indices we get from the underlying GEOS query, we can return [objs[i] for i in idxs] (note for simplicity I showed python code, but this could also all be vectorized/optimized with numpy arrays).

@jorisvandenbossche
Copy link
Member Author

Further thoughts about this?

I dislike using functions that have different return types depending on a keyword argument. I would expect this to be a source of bugs.

BTW, I agree, in general, that a changing output type depending on a keyword is not always the nicest API design. In the RFC discussion, an alternative that was raised was to have two separate methods for this reason (eg query_index vs query_gemetry).
Personally, I see the return_geometries keyword mostly as a good compromise given the already existing STRtree query method. Since an integer or a geometry is quite different, and it's a very explicit keyword, I personally don't think it would cause many bugs in this case.

To also stress another point: for me, this is mostly a discussion for the API on the python side. IMO we can perfectly keep the STRtree code at the C level to only work with integers for simplicity, and at the python level we can provide wrappers to return other values than integer indices.

@jorisvandenbossche
Copy link
Member Author

Friendly ping for this discussion.

@sgillies
Copy link
Contributor

sgillies commented Mar 2, 2021

@jorisvandenbossche in #1094 I have another take on what the Python API in 1.8 and 2.0 might look like. Geopandas needs a method that returns a numpy array of ints (addresses) and avoids Python callbacks so the GIL can be released, yes? What would you think about that being a seperate method entirely?

@songololo
Copy link
Contributor

songololo commented Mar 30, 2021

From a user perspective it would be nice to have an explicit procedure for recovering some or other identifying id associated with a geometry.

Further, it would be nice if this id parameter could be set explicitly to concur with data structures created by the user outside of STRTree.

Use-case: when using various data structures, e.g.:

  • a node in a graph of nodes which has a geom attribute pointing to a Point geometry;
  • a building element in a dictionary of buildings which has a geom attribute pointing to a building footprint Polygon.

Then I need to use an STRTree to recover nearby adjacent nodes or buildings etc., and it is not the geometry that I'm looking for, per se, but the id which I can then use to retrieve whatever attribute I need from my own data structures (which may or may not be a geom or an id.)

I realise this is bit of a hack, but at the moment (existing shapely versions) I overload the geometry object to contain a custom uid so that when STRTree returns the list of geoms I can parse these and pull out the identifying uid which I can then link back to the originating data structures.

@brendan-ward
Copy link
Collaborator

@songololo one way to do this is to create two parallel numpy arrays: one of your id values, and one of your geometries. Create the tree from your geometries, and query it to return indices of the geometries (as proposed here) instead of geometry objects. The returned indices can then be used to index into your array of id values and return your original ids for the geometries that resulted from the query against the tree.

@sgillies
Copy link
Contributor

@songololo in #1112 I have a proposal for a tree that supports bringing your own integer identifiers. This is something I would like to have, too.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 31, 2021

To be clear, also in this PR it would be easy to add the functionality to allow passing custom ids (without storing them in the actual underlying C tree, but just storing them in the python tree object).
I am personally certainly fine with adding such capability in general (it can indeed be more convenient than having to keep track of the mapping yourself).

But so, in my mind, the main question is not so much about capabilities, but the final API question about how to expose those different options to the user (as I listed several ways in the top post):
using a keyword like return_geometries=True/False in a single method (current state of this PR) vs separate methods like query_geoms() and query_items() (as in #1112)?

(personally, I have a slight preference for the keyword instead of the separate methods, but I can certainly live with the other options as well as a compromise)

@songololo
Copy link
Contributor

😂 to muddy the waters a vote for @sgillies suggestion of keeping the return types distinct per method name as this keeps the mental abstractions cleaner and reduces potential downstream errors. By having distinct methods coupled to distinct return types my inclination is that discovery of functionality is also a bit easier for end users because it is sometimes easy to miss or otherwise gloss over the implications of keyword options.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 31, 2021

Just to make it easier to compare and discuss both approaches / PRs, I quickly added the ability to store custom ids with a list of (geom, item) as input in this PR as well. This gives:

>>> from shapely.geometry import Point
>>> from shapely.strtree import STRtree
>>> geoms = [Point(i, i) for i in range(10)]

# passing list of geometries
>>> tree = STRtree(geoms)
>>> tree.query(Point(2, 2).buffer(1.0))
[<shapely.geometry.point.Point at 0x7fc3a21fc550>,
 <shapely.geometry.point.Point at 0x7fc3a21fc520>,
 <shapely.geometry.point.Point at 0x7fc3a21fc4c0>]
# this returns "default" (0..n) indices
>>>> tree.query(Point(2, 2).buffer(1.0), return_geometries=False)
[1, 2, 3]

# passing list of (geom, id) tuples
>>> tree = STRtree((g, i+1) for i, g in enumerate(geoms))
# this now returns the custom (user-provided) ids
>>> tree.query(Point(2, 2).buffer(1.0), return_geometries=False)
[2, 3, 4]

(and of course also in this PR I could easily add query_geoms/query_items methods instead of the keyword switch)

@songololo
Copy link
Contributor

In principle this should also work for str instances as well? (Since int and str are both quite common as unique identifiers it would be nice to be able to use str.)

@caspervdw
Copy link
Collaborator

Just reading up on the discussions here. About the general API, I don’t think it is a good thing to change things unless awe have to. Also, a function should have a return type that is independent of its input parameters. So I like to take side with @sgillies and @songololo on that one. I’d opt for keeping the current method as is, and adding a “query_index” method.

User-supplied labels would be easy to make, but I’d stear clear of the list comprehensions and make the API like “STRTree(geometries, labels=None)” defaulting “labels” to “np.arange(len(geometries))”

Implementation-wise, the strtree will just contain primary keys, which can be mapped as a last step to geometries (or other labels). In pygeos, STRTree().geometries is already there and can be used for that purpose.

@sgillies
Copy link
Contributor

sgillies commented Apr 5, 2021

@caspervdw yes, I agree that we should disentangle geoms and values and use two input sequences (one optional). Thank you. I can see how this helps efficient implementations.

@coveralls
Copy link

coveralls commented Jul 6, 2021

Coverage Status

Coverage increased (+0.05%) to 85.34% when pulling 3bc928f on jorisvandenbossche:strtree-return-indices into a228836 on Toblerity:master.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 6, 2021

Sorry for the very slow follow-up here.

About the general API, I don’t think it is a good thing to change things unless awe have to.

For completeness, not changing the API here in Shapely will also mean a change in API for some users, as it will mean a change for people already using pygeos (and a change for users of geopandas' GeoDataFrame.sindex.query(..), unless we keep that as it is, but which gives an inconsistency with Shapely, which is also not ideal).

The question is also what we do with the query_bulk() method, which also returns integer indices (keep as is, or should we also add _items to that name?)

Implementation-wise, the strtree will just contain primary keys, which can be mapped as a last step to geometries (or other labels).

Indeed, and that's what I am also already doing here.
I updated this PR to mimic the latest version of #1112, but using the above approach just storing primary keys in the tree, and mapping that to custom items or geometries as a last step.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 16, 2021

@sgillies I updated this PR after your merge of #1112. It doesn't change any user facing API now (after your refactor of that in #1112), but only refactors the implementation (how to store the items).

As mentioned in #1112 (comment), this version is closer to what we would do in the shapely-2.0 branch on top of the C-based STRtree. IMO it is useful to already do this now.

Comment on lines +109 to +117
if geoms is None:
# TODO should we allow this?
geoms = []

geoms = list(geoms)
if geoms and isinstance(geoms[0], tuple):
# TODO should we allow this?
# unzip the list of tuples
geoms, items = list(zip(*geoms))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two TODO items could be removed after #1172

@jorisvandenbossche
Copy link
Member Author

I opened a new, clean PR with just the refactor on top of #1112 that remained in this PR -> #1177. So closing this one which started as a different discussion (the exact API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STRTree.query that returns index instead of geometry object
9 participants