Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexing #173

Closed
morganherlocker opened this issue Dec 30, 2014 · 8 comments
Closed

indexing #173

morganherlocker opened this issue Dec 30, 2014 · 8 comments

Comments

@morganherlocker
Copy link
Member

So far I have rejected anything related to indexing. I think of it as a problem for the data layer, rather than the processing layer. That said, I think it should be considered.

pros

  • we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).
  • it is useful functionality, and many users will be unable to figure out how to use index modules in conjunction with turf effectively on their own
  • several indexes (s2 cells, tile quadkeys, and geohashes) are data independent, so they scale across clusters (r-tree, for example, does not)

cons

  • I am not aware of an existing convention for indexes in GeoJSON, besides the id field, which is flimsy for this
  • multiple types of indexes on the same dataset might clash
  • some common indexing schemes do not scale well in clusters (r-tree)
  • the fact that users have a hard time understanding indexes might just mean this allows them to do dumb things faster 😄
  • scalable indexes tend to rely on "universal axioms" about planet earth; turf should work work gracefully without these contextual assumptions (turf is not currently perfect on this point: units of measurement, WGS84 encoding, assumptions about precision with 32bit floats)

candidates for submodules

  • tile-cover: fast and scales across clusters
  • rbush: fast but each box needs the index in memory (this might not matter, depending on some architecture decisions we make)
  • geohash: fast and scalable, but does not work on lines or polygons (although maybe it would work fine with geohashed bbox sw & ne points)
  • tilebelt: minimum bbox quadkey would be fast and scalable (same issues as geohash)

I am 100% on the fence with this. Input greatly appreciated from anyone who understands how indexing works at a low level and people who just need to use them). This has huge architecture implications, so we really need both sides.

@tmcw
Copy link
Collaborator

tmcw commented Dec 30, 2014

I've tinkered with this with the revisions to turf-count and making turf-extent wayfaster - basically for a bunch of algorithms where we repeatedly do something expensive like point-in-polygon, doing point-in-extent first gives a serious boost. But the algorithms that benefit from this, or any other indexing strategy, seem somewhat limited for now.

we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).

Only for repeated checking of the same polygon - the initial processing of the index would be more expensive than the current pip algorithm.


From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

@morganherlocker
Copy link
Member Author

@tmcw automatic indexes on the fly are probably bad (without some finicky heuristics as to when they should happen). What if indexing required a manual call (something like var myPoly = index(myPoly, 'quadkey')), that left an index embedded in the data. This would be checked for in turf subs that could use it.

From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

Agree with this 100% from an advanced user's perspective; people who can chop processing into reasonable chunks using indexes. I am thinking more about someone trying to find which of a million points are within a polygon. Right now, a beginner is going to be using turf like PostGIS minus indexes (unusably slow in common real world situations).

For example, what if turf-inside looked for a tile index? If its there, then it does a trivial in-out quadkey check. If not, then it goes ahead with the full algorithm. This might be close to best-of-both-worlds.

@tmcw
Copy link
Collaborator

tmcw commented Dec 30, 2014

I am thinking more about someone trying to find which of a million points are within a polygon.

I think with the small optimizations I've made for this will give a pretty good bump - I'll try this out later today.

@tmcw
Copy link
Collaborator

tmcw commented Jan 23, 2015

@jfirebaugh
Copy link
Member

If there are big unknowns or tradeoffs in the choice of index datastructure, would it be possible to define algorithms like turf-inside in terms of an index protocol rather than a specific concrete implementation? Turf would then provide implementations of the protocol for rbush, tile-cover, tilebelt, etc, and users would supply their choice when instantiating a particular algorithm.

@stepankuzmin
Copy link
Contributor

Hi all! +1 for protocols as @jfirebaugh mentioned (like rbush implementation with https://github.com/jvrousseau/turf-index).

@DenisCarriere
Copy link
Member

An old issue, however very useful topic!

So far I've been using RBush for everything index related and it is VERY efficient on extremely large GeoJSON datasets.

I had to create my own "Ad Hoc" GeoJSON implementation in @turf/line-intersect, this example is a bit TOO complex for what we would need.

It would be useful to have a basic GeoJSON support implementation of RBush called @turf/index.

We would need to add some extra default properties while loading the RBush Tree to be able to retrieve the GeoJSON again (index & geojson) and yes this would destroy your memory if you are handling large Objects... (open for suggestions).

tree.insert({
    minX: minX,
    minY: minY,
    maxX: maxX,
    maxY: maxY,
    index: index,
    geojson: geojson
})

It would @returns {RBush Tree}, that way TurfJS only needs to implement loading GeometryCollection, FeatureCollection & Feature.

@DenisCarriere
Copy link
Member

Since there are many different ways to implement an index (r-tree, geohashes, etc...). It's best to leave these types of index operations/implementations outside of TurfJS.

For example, the r-tree index implementation rbush or geojson-rbush can be used.

Without going too much outside the TurfJS scope, we could include the bbox attribute to the output results for a few modules (ex: @turf/line-segment) without sacrificing too much performance loss.

The benefit of having this bbox attribute already existing to each feature would reduce significantly the number of @turf/bbox calculations done when building an geohash/ r-tree index (in the case that the geometry changes, then the bbox attribute would have to be re-calculated).
http://geojson.org/geojson-spec.html#bounding-boxes

GeoJSON BBox attribute example

  { "type": "Feature",
    "properties": {},
    "bbox": [-10.0, -10.0, 10.0, 10.0],
    "geometry": {
      "type": "Polygon",
      "coordinates": [[
        [-10.0, -10.0], [10.0, -10.0], [10.0, 10.0], [-10.0, 10.0]
        ]]
      }
    ...
    }

@tmcw @morganherlocker Fair to say we can close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants