Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoShape precision #2803

Closed
chilling opened this issue Mar 20, 2013 · 11 comments
Closed

GeoShape precision #2803

chilling opened this issue Mar 20, 2013 · 11 comments

Comments

@chilling
Copy link
Contributor

GeoShape precision

The geo_shape precision could only be set via tree_levels option. Since setting this option requires extensive knowledge of the underlying data structures, a new option should to be defined to set the levels of these structures by precision. Such a precision in turn should be defined as a distance like 50m.

chilling added a commit to chilling/elasticsearch that referenced this issue Mar 20, 2013
The `geo_shape` precision could be only set via `tree_levels` so far. A new option `precision` now allows to set the levels of the underlying tree structures to be set by distances like `50m`. The default unit for this option is `meters`. In case of both options, `tree_levels` and `precision`, are set the greater depth of these values will be taken into account.

## Example
```json
curl -XPUT 'http://127.0.0.1:9200/myindex/' -d '{
    "mappings" : {
        "type1": {
            "dynamic": "false",
            "properties": {
                "location" : {
                    "type" : "geo_shape",
                    "geohash" : "true",
                    "store" : "yes",
                    "precision":"50m"
                }
            }
        }
    }
}'
```

## Changes
- GeoUtils defines the [WGS84](http://en.wikipedia.org/wiki/WGS84) reference ellipsoid of earth
- DistanceUnits refer to a more precise definition of earth circumference
- DistanceUnits for inch, yard and meter have been defined
- Set default levels in GeoShapeFieldMapper to 50m precision

Closes elastic#2803
@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

cool stuff - this would also resolve this issue right? #2756

@chilling
Copy link
Contributor Author

@s1monw yes - by setting the default precision to 50m, quadtrees will have a depth 21 and geohashtree a depth of 9 levels. This will reduce the index size notably.

@jillesvangurp
Copy link
Contributor

For geo hashes, it is possible to calculate a sensible geohash length given a latitude/longitude. There's a bit of variation depending on the latitude though since the distance covered by a degree of longitude varies with the latitude and gets smaller the closer you get to the poles.

I think quad tree has the same issue. The rest of this post is specific for geohashes but should mostly apply to the quad tree implementation as well.

I have a little library that includes function to calculate the appropriate geohash length given a distance in meters and a coordinate. https://github.com/jillesvangurp/geogeometry I think spatial4j has similar functionality.

I used it to generate an overview of how the accuracy of a given geohash length changes with the latitude:

latitude 0 accuracy 1 requires length 10
latitude 0 accuracy 5 requires length 10
latitude 0 accuracy 39 requires length 9
latitude 0 accuracy 153 requires length 8
latitude 0 accuracy 1222 requires length 7
latitude 0 accuracy 4887 requires length 6

latitude 0 accuracy 39092 requires length 5

latitude 10 accuracy 1 requires length 10
latitude 10 accuracy 5 requires length 10
latitude 10 accuracy 38 requires length 9
latitude 10 accuracy 151 requires length 8
latitude 10 accuracy 1204 requires length 7
latitude 10 accuracy 4813 requires length 6

latitude 10 accuracy 38517 requires length 5

latitude 20 accuracy 1 requires length 10
latitude 20 accuracy 5 requires length 10
latitude 20 accuracy 36 requires length 9
latitude 20 accuracy 144 requires length 8
latitude 20 accuracy 1148 requires length 7
latitude 20 accuracy 4592 requires length 6

latitude 20 accuracy 36767 requires length 5

latitude 30 accuracy 1 requires length 10
latitude 30 accuracy 5 requires length 10
latitude 30 accuracy 34 requires length 9
latitude 30 accuracy 133 requires length 8
latitude 30 accuracy 1058 requires length 7
latitude 30 accuracy 4234 requires length 6

latitude 30 accuracy 33895 requires length 5

latitude 40 accuracy 1 requires length 10
latitude 40 accuracy 5 requires length 10
latitude 40 accuracy 30 requires length 9
latitude 40 accuracy 117 requires length 8
latitude 40 accuracy 936 requires length 7
latitude 40 accuracy 3744 requires length 6

latitude 40 accuracy 29989 requires length 5

latitude 50 accuracy 1 requires length 10
latitude 50 accuracy 5 requires length 10
latitude 50 accuracy 25 requires length 9
latitude 50 accuracy 99 requires length 8
latitude 50 accuracy 786 requires length 7
latitude 50 accuracy 3144 requires length 6

latitude 50 accuracy 25169 requires length 5

latitude 60 accuracy 1 requires length 10
latitude 60 accuracy 5 requires length 10
latitude 60 accuracy 20 requires length 9
latitude 60 accuracy 77 requires length 8
latitude 60 accuracy 611 requires length 7
latitude 60 accuracy 2445 requires length 6
latitude 60 accuracy 19581 requires length 5

latitude 60 accuracy 80388 requires length 4

latitude 70 accuracy 1 requires length 10
latitude 70 accuracy 5 requires length 10
latitude 70 accuracy 14 requires length 9
latitude 70 accuracy 53 requires length 8
latitude 70 accuracy 418 requires length 7
latitude 70 accuracy 1675 requires length 6
latitude 70 accuracy 13396 requires length 5

latitude 70 accuracy 56275 requires length 4

latitude 80 accuracy 1 requires length 10
latitude 80 accuracy 5 requires length 10
latitude 80 accuracy 7 requires length 9
latitude 80 accuracy 27 requires length 8
latitude 80 accuracy 213 requires length 7
latitude 80 accuracy 851 requires length 6
latitude 80 accuracy 6802 requires length 5
latitude 80 accuracy 30506 requires length 4

One of the problems with the current implementation is that it doesn't do any kind of distance filtering on the results. This means that the tree_levels is the only way of controlling the accuracy and it is fixed as part of the mapping. Typically, increasing the levels to improve accuracy is far more expensive than simply choosing a low enough level to get decent accuracy and then doing some post processing.

I have another project, https://github.com/jillesvangurp/geokv, that I use for batch processing geospatial data. It implements geospatial search using my geohash library. Typically, I break things down into blocks of a few hundred meter (i.e. length 7) and use simple distance based filtering to throw out things I'm not interested in. So, load everything within the hashes of length 7 that cover the area of interest (circle or polygon) and throw out anything that falls outside the shape.

So you control result set size with the hash length and accuracy with the filtering. If you want to do less work filtering, use a larger hash length. If you want to do less work moving around data, use a smaller hash length.

For large data sets that span the globe with tens to hundreds of thousands of objects in cities like Berlin, specifying an accuracy of 1m would translate into a hashlenth of 10 (I think this is the default in es currently). Even very modest data sets result in extremely large indices with this. I've done some experimentation with open street map data: https://www.dropbox.com/sh/ulr6s08km87o17o/dCNsRgiyTT. Indexing this 35 MB file with the default settings results in nearly a GB of index data.

So, having just an accuracy setting might be kind of deceptive. What is really needed is a way to control the amount of terms generated at indexing and querying time and enabling/disabling any post processing. For most users a length of 7 or 8 with post processing would probably be vastly superior to indexing everything at length 10 (or worse).

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

I agree @jillesvangurp there is a lot to improve here. I think we will open followup issues to address your concerns. For now I think having a precision in meters at equator level is a net/net win for lots of users. We will go off and explore the ability to dynamically adjust the levels for the prefix tree based on the latitude of the incoming object / search request / shape etc. and go even further and allow to set precision based on (lat,distance_precision) pairs so you can adjust this as you go north/south. Does this make sense?

@jillesvangurp
Copy link
Contributor

That would work.

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

cool stuff...

@chilling
Copy link
Contributor Author

Thanks a lot for your comments. You are absolutely right that the definition of precision might be deceptive since shorter geohashes/tree_levels provide the same accuracy at latitudes away from the equator. My implementation so far guarantees that the cell size will not exceed the distance defined by precision. But to respect this fact to reduce the index size, we need support in Lucene. I'm going to work on this.

@s1monw s1monw closed this as completed in f08d458 Mar 20, 2013
@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

I marked this as breaking since we changed the defaults to be 50m on equator level for both quadtree and geohash

@dsmiley
Copy link
Contributor

dsmiley commented Feb 19, 2014

FWIW on the Solr side, you specify precision as a distance, and then under the hood, Lucene-spatial's SpatialPrefixTree (geohash or quad impl) figures out how many levels are needed to satisfy that for any location on the globe, be it the equator or pole. Sure, it might be more levels than needed for stuff at the pole but I think it's not worth bothering trying to optimize that.

@jillesvangurp had some interesting comments on spatial information-retrieval algorithms using a combination of a grid and then at a certain detail level one does a simple iteration/scan to check the distance. But guess what? Lucene-spatial's RecursivePrefixTreeStrategy's Intersects & IsWithin already does this! It's been there in various forms since 2011 -- a long time. What you can configure is the "prefixGridScanLevel", which is the level at which the algorithm switches from recursive grid decomposition to scanning. Does ES expose it? It should. By default it's 4 up from the bottom, which was about right for geohashes and the benchmark I ran long ago. I have plans to optimize this to use the docFreq as a better guide rather than a fixed #. And, what I'd like to do is make a better tree encoding that won't generate intermediate cells a the bottom few levels, which usually won't be seek()'ed to anyway due to the scanning algorithm kicking in above that.

My closing remark is that people should please submit issues to Lucene's JIRA flagged with the spatial module for improving Lucene-spatial.

@jillesvangurp
Copy link
Contributor

+1 for @dsmiley

Right now you have to choose between accuracy or performance. We should be able to get both (within reason).

@dsmiley
Copy link
Contributor

dsmiley commented Feb 19, 2014

I feel that for indexed point-data, accuracy & performance has been very good for a long time. Performance can be improved, sure, but IMO it's darned good today. The only knock on accuracy is LUCENE-4978 but there are work-arounds: either index a micro-circle/box instead of a point, or slightly buffer the query shape.

For indexing non-point data (e.g. polygons), it's another story -- there's a big trade-off on accuracy & precision. I plan to address that in ES on issue #2361 with Lucene 4.7's new SerializedDVStrategy. Subscribe to that issue to stay informed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants