GeoShape precision #2803

chilling · 2013-03-20T09:58:12Z

GeoShape precision

The geo_shape precision could only be set via tree_levels option. Since setting this option requires extensive knowledge of the underlying data structures, a new option should to be defined to set the levels of these structures by precision. Such a precision in turn should be defined as a distance like 50m.

The text was updated successfully, but these errors were encountered:

The `geo_shape` precision could be only set via `tree_levels` so far. A new option `precision` now allows to set the levels of the underlying tree structures to be set by distances like `50m`. The default unit for this option is `meters`. In case of both options, `tree_levels` and `precision`, are set the greater depth of these values will be taken into account. ## Example ```json curl -XPUT 'http://127.0.0.1:9200/myindex/' -d '{ "mappings" : { "type1": { "dynamic": "false", "properties": { "location" : { "type" : "geo_shape", "geohash" : "true", "store" : "yes", "precision":"50m" } } } } }' ``` ## Changes - GeoUtils defines the [WGS84](http://en.wikipedia.org/wiki/WGS84) reference ellipsoid of earth - DistanceUnits refer to a more precise definition of earth circumference - DistanceUnits for inch, yard and meter have been defined - Set default levels in GeoShapeFieldMapper to 50m precision Closes elastic#2803

s1monw · 2013-03-20T10:26:43Z

cool stuff - this would also resolve this issue right? #2756

chilling · 2013-03-20T10:39:37Z

@s1monw yes - by setting the default precision to 50m, quadtrees will have a depth 21 and geohashtree a depth of 9 levels. This will reduce the index size notably.

jillesvangurp · 2013-03-20T11:17:59Z

For geo hashes, it is possible to calculate a sensible geohash length given a latitude/longitude. There's a bit of variation depending on the latitude though since the distance covered by a degree of longitude varies with the latitude and gets smaller the closer you get to the poles.

I think quad tree has the same issue. The rest of this post is specific for geohashes but should mostly apply to the quad tree implementation as well.

I have a little library that includes function to calculate the appropriate geohash length given a distance in meters and a coordinate. https://github.com/jillesvangurp/geogeometry I think spatial4j has similar functionality.

I used it to generate an overview of how the accuracy of a given geohash length changes with the latitude:

latitude 0 accuracy 1 requires length 10
latitude 0 accuracy 5 requires length 10
latitude 0 accuracy 39 requires length 9
latitude 0 accuracy 153 requires length 8
latitude 0 accuracy 1222 requires length 7
latitude 0 accuracy 4887 requires length 6

latitude 0 accuracy 39092 requires length 5

latitude 10 accuracy 1 requires length 10
latitude 10 accuracy 5 requires length 10
latitude 10 accuracy 38 requires length 9
latitude 10 accuracy 151 requires length 8
latitude 10 accuracy 1204 requires length 7
latitude 10 accuracy 4813 requires length 6

latitude 10 accuracy 38517 requires length 5

latitude 20 accuracy 1 requires length 10
latitude 20 accuracy 5 requires length 10
latitude 20 accuracy 36 requires length 9
latitude 20 accuracy 144 requires length 8
latitude 20 accuracy 1148 requires length 7
latitude 20 accuracy 4592 requires length 6

latitude 20 accuracy 36767 requires length 5

latitude 30 accuracy 1 requires length 10
latitude 30 accuracy 5 requires length 10
latitude 30 accuracy 34 requires length 9
latitude 30 accuracy 133 requires length 8
latitude 30 accuracy 1058 requires length 7
latitude 30 accuracy 4234 requires length 6

latitude 30 accuracy 33895 requires length 5

latitude 40 accuracy 1 requires length 10
latitude 40 accuracy 5 requires length 10
latitude 40 accuracy 30 requires length 9
latitude 40 accuracy 117 requires length 8
latitude 40 accuracy 936 requires length 7
latitude 40 accuracy 3744 requires length 6

latitude 40 accuracy 29989 requires length 5

latitude 50 accuracy 1 requires length 10
latitude 50 accuracy 5 requires length 10
latitude 50 accuracy 25 requires length 9
latitude 50 accuracy 99 requires length 8
latitude 50 accuracy 786 requires length 7
latitude 50 accuracy 3144 requires length 6

latitude 50 accuracy 25169 requires length 5

latitude 60 accuracy 1 requires length 10
latitude 60 accuracy 5 requires length 10
latitude 60 accuracy 20 requires length 9
latitude 60 accuracy 77 requires length 8
latitude 60 accuracy 611 requires length 7
latitude 60 accuracy 2445 requires length 6
latitude 60 accuracy 19581 requires length 5

latitude 60 accuracy 80388 requires length 4

latitude 70 accuracy 1 requires length 10
latitude 70 accuracy 5 requires length 10
latitude 70 accuracy 14 requires length 9
latitude 70 accuracy 53 requires length 8
latitude 70 accuracy 418 requires length 7
latitude 70 accuracy 1675 requires length 6
latitude 70 accuracy 13396 requires length 5

latitude 70 accuracy 56275 requires length 4

latitude 80 accuracy 1 requires length 10
latitude 80 accuracy 5 requires length 10
latitude 80 accuracy 7 requires length 9
latitude 80 accuracy 27 requires length 8
latitude 80 accuracy 213 requires length 7
latitude 80 accuracy 851 requires length 6
latitude 80 accuracy 6802 requires length 5
latitude 80 accuracy 30506 requires length 4

One of the problems with the current implementation is that it doesn't do any kind of distance filtering on the results. This means that the tree_levels is the only way of controlling the accuracy and it is fixed as part of the mapping. Typically, increasing the levels to improve accuracy is far more expensive than simply choosing a low enough level to get decent accuracy and then doing some post processing.

I have another project, https://github.com/jillesvangurp/geokv, that I use for batch processing geospatial data. It implements geospatial search using my geohash library. Typically, I break things down into blocks of a few hundred meter (i.e. length 7) and use simple distance based filtering to throw out things I'm not interested in. So, load everything within the hashes of length 7 that cover the area of interest (circle or polygon) and throw out anything that falls outside the shape.

So you control result set size with the hash length and accuracy with the filtering. If you want to do less work filtering, use a larger hash length. If you want to do less work moving around data, use a smaller hash length.

For large data sets that span the globe with tens to hundreds of thousands of objects in cities like Berlin, specifying an accuracy of 1m would translate into a hashlenth of 10 (I think this is the default in es currently). Even very modest data sets result in extremely large indices with this. I've done some experimentation with open street map data: https://www.dropbox.com/sh/ulr6s08km87o17o/dCNsRgiyTT. Indexing this 35 MB file with the default settings results in nearly a GB of index data.

So, having just an accuracy setting might be kind of deceptive. What is really needed is a way to control the amount of terms generated at indexing and querying time and enabling/disabling any post processing. For most users a length of 7 or 8 with post processing would probably be vastly superior to indexing everything at length 10 (or worse).

s1monw · 2013-03-20T12:51:18Z

I agree @jillesvangurp there is a lot to improve here. I think we will open followup issues to address your concerns. For now I think having a precision in meters at equator level is a net/net win for lots of users. We will go off and explore the ability to dynamically adjust the levels for the prefix tree based on the latitude of the incoming object / search request / shape etc. and go even further and allow to set precision based on (lat,distance_precision) pairs so you can adjust this as you go north/south. Does this make sense?

jillesvangurp · 2013-03-20T12:53:16Z

That would work.

s1monw · 2013-03-20T12:54:43Z

cool stuff...

chilling · 2013-03-20T13:18:07Z

Thanks a lot for your comments. You are absolutely right that the definition of precision might be deceptive since shorter geohashes/tree_levels provide the same accuracy at latitudes away from the equator. My implementation so far guarantees that the cell size will not exceed the distance defined by precision. But to respect this fact to reduce the index size, we need support in Lucene. I'm going to work on this.

s1monw · 2013-03-20T13:55:24Z

I marked this as breaking since we changed the defaults to be 50m on equator level for both quadtree and geohash

dsmiley · 2014-02-19T19:02:59Z

FWIW on the Solr side, you specify precision as a distance, and then under the hood, Lucene-spatial's SpatialPrefixTree (geohash or quad impl) figures out how many levels are needed to satisfy that for any location on the globe, be it the equator or pole. Sure, it might be more levels than needed for stuff at the pole but I think it's not worth bothering trying to optimize that.

@jillesvangurp had some interesting comments on spatial information-retrieval algorithms using a combination of a grid and then at a certain detail level one does a simple iteration/scan to check the distance. But guess what? Lucene-spatial's RecursivePrefixTreeStrategy's Intersects & IsWithin already does this! It's been there in various forms since 2011 -- a long time. What you can configure is the "prefixGridScanLevel", which is the level at which the algorithm switches from recursive grid decomposition to scanning. Does ES expose it? It should. By default it's 4 up from the bottom, which was about right for geohashes and the benchmark I ran long ago. I have plans to optimize this to use the docFreq as a better guide rather than a fixed #. And, what I'd like to do is make a better tree encoding that won't generate intermediate cells a the bottom few levels, which usually won't be seek()'ed to anyway due to the scanning algorithm kicking in above that.

My closing remark is that people should please submit issues to Lucene's JIRA flagged with the spatial module for improving Lucene-spatial.

jillesvangurp · 2014-02-19T19:31:17Z

+1 for @dsmiley

Right now you have to choose between accuracy or performance. We should be able to get both (within reason).

dsmiley · 2014-02-19T20:20:06Z

I feel that for indexed point-data, accuracy & performance has been very good for a long time. Performance can be improved, sure, but IMO it's darned good today. The only knock on accuracy is LUCENE-4978 but there are work-arounds: either index a micro-circle/box instead of a point, or slightly buffer the query shape.

For indexing non-point data (e.g. polygons), it's another story -- there's a big trade-off on accuracy & precision. I plan to address that in ES on issue #2361 with Lucene 4.7's new SerializedDVStrategy. Subscribe to that issue to stay informed.

chilling mentioned this issue Mar 20, 2013

GeoShape Precision #2804

Closed

s1monw closed this as completed in f08d458 Mar 20, 2013

s1monw mentioned this issue Mar 20, 2013

change geo_shape default configuration #2756

Closed

clintongormley mentioned this issue May 6, 2014

Fixes #2361 - geo_shape filter does not match false positives. #2460

Closed

clintongormley mentioned this issue Jul 8, 2014

GeoShape intersects operation returns false positives #2361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeoShape precision #2803

GeoShape precision #2803

chilling commented Mar 20, 2013

s1monw commented Mar 20, 2013

chilling commented Mar 20, 2013

jillesvangurp commented Mar 20, 2013

s1monw commented Mar 20, 2013

jillesvangurp commented Mar 20, 2013

s1monw commented Mar 20, 2013

chilling commented Mar 20, 2013

s1monw commented Mar 20, 2013

dsmiley commented Feb 19, 2014

jillesvangurp commented Feb 19, 2014

dsmiley commented Feb 19, 2014

GeoShape precision #2803

GeoShape precision #2803

Comments

chilling commented Mar 20, 2013

GeoShape precision

s1monw commented Mar 20, 2013

chilling commented Mar 20, 2013

jillesvangurp commented Mar 20, 2013

latitude 0 accuracy 39092 requires length 5

latitude 10 accuracy 38517 requires length 5

latitude 20 accuracy 36767 requires length 5

latitude 30 accuracy 33895 requires length 5

latitude 40 accuracy 29989 requires length 5

latitude 50 accuracy 25169 requires length 5

latitude 60 accuracy 80388 requires length 4

latitude 70 accuracy 56275 requires length 4

s1monw commented Mar 20, 2013

jillesvangurp commented Mar 20, 2013

s1monw commented Mar 20, 2013

chilling commented Mar 20, 2013

s1monw commented Mar 20, 2013

dsmiley commented Feb 19, 2014

jillesvangurp commented Feb 19, 2014

dsmiley commented Feb 19, 2014