-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Geo::Index is intended for stand-alone applications that need a way to quickly perform proximity seaches on relatively small datasets (at most a few million points). Typical search speeds are three to five orders of magnitude faster than a linear search. For larger datasets and for applications running in a server environment using something like PostGIS is more appropriate.
Indexing speed is about 50,000 points per second when levels
is 20. Search
speeds are highly dependant on the data indexed and on search parameters but are
typically in the neighborhood of a few thousand searches per second (see
Benchmark results, below).
Memory usage tends to be rather high; for 1,000,000 points the index is ~3.2 GB for tightly clustered points or ~4.6 GB when spread evenly world-wide. The size of an index grows linearly with each added point at a rate of about 4 kB per point. When a point is first encountered whilst searching its size will increase by about 100 bytes (this only happens once per point).
Since performance is so dependant on data and usage, the the user is encouraged
to test all available options while developing their application before choosing
the one that works fastest. The examples/benchmark.pl
script included with
this module may be helpful for measuring this module's performance.
Here are some guidelines for best results:
-
Requesting results as a list reference is faster than asking for a plain list.
That is, e.g.,
$results = Search(...);
is faster than@results = Search(...);
-
Post-conditions are faster than pre-conditions.
Benchmarking has shown that the cost of the Perl function call is higher than that of the distance-related code.Thus there is probably no reason to use pre-conditions. Put concisely,
$results = $index->Search( $point, { post_condition => $code_ref } );
is faster than
$results = $index->Search( $point, { pre_condition => $code_ref } );
-
Choose an appropriate value for
levels
when creating the indexThe
Search(…)
method has best performance when the size of the most detailed level of the index has a smaller physical size than the radius of a typical search. For example, if your searches are typically for points within 100 meters then an index withlevels
should be set to at least 18 (~75 meters at the equator) to yield best results; if typical searches have 10 meter radius thenlevels
should be 22.The
Closest(…)
method works best when the most detailed level of the index contains a single point per tile and search points lie close to potential result points.To help tune the
levels
value, theGetConfiguration( )
method can be used to find out the physical size of the most detailed level along with statistics on the number of points per index tile. A script showing the use of this method can be found inexamples/show_configuration.pl
-
Use the
quick_results
option when possible.Filtering points and combining them into a single, flat list can be very expensive. Many applications can tolerate getting additional points beyond those matching the search criteria. An example of this is drawing points on a map; if points are clipped to the visible area when they are drawn it may not matter if some of them lie outside of it.
-
Use
Search(…)
instead ofClosest(…)
when you have a search radius.The
Closest(…)
function is most efficient when no search radius is specified or when result points lie very close to the search point. Closeness is relative to the tile size of the most detailed index level; for the default index depth (20
), "very close" is roughly within about 100 meters.When clipping results to a maximal radius it is typically much faster to use
Search(…)
with thesort_results
andmax_results
options*.For example, to find the closest
$n
points within distance$d
of a point$p
it is usually much faster to use%options = ( max_results => $n, radius => $d, sort_results => 1, post_condition => sub { return $_[0] != $_[1]; } ); $results = $index->Search( $p, \%options );
instead of
$results = $index->Closest( $p, $n { radius => $d } );
* The
post_condition
shown in the example omits the search point from the results and is needed to fully emulate the behavior ofClosest(…)
.
Both Search(…)
and SearchByBounds(…)
are very fast since
they can find the relevant index tiles in linear time. Since the time needed to
filter the results is directly proportional to the number of points retrieved
from the index, best performance occurs when the size of the most detailed tiles
is smaller than that of the typical search radius or search bounds.
Searches run using Closest(…)
are done starting from the most
detailed level and work upwards. Best performance occurs when a result is found
in the first few iterations. If the first iteration that finds points yields a
large number of points then performance will suffer since the distance to each
of these points will need to be measured to find the closest. For similar
reasons, requesting a large number of closest points in a single call will also
impact performance. The Farthest(…)
method is largely a wrapper for
Closest(…)
and thus exhibits similar behavior.
Some functions within Geo::Index have optional implementations written in C. If these are active (by default they are whenever possible) searches typically run 25% to 50% faster.
Whenever possible Geo::Index uses numeric index keys. Compared to text index
keys, numeric keys improve performance with about 30% faster speed and about 50%
smaller index memory usage. The downside to numeric keys is that they are less
legible to humans while debugging. (Whether numeric or text keys are used can
be changed by setting the appropriate value at the top of Geo/Index.pm
)
Typical benchmark results run on a modern workstation using numeric keys and double-precision C low-level code with the index containing 1,000,000 points are as follows:
-
Points can be added to an index at the rate of about 50,000 per second.
-
Typical searches returning values run at about 25,000 to 50,000 searches per second. Worst-case performance is under 50 searches per second and searches returning no results run at over 100,000 searches per second. The overhead of traversing the results is fairly negligible.
Quick searches run at 120,000 to 150,000 searches per second. Actually doing anything with the results slows things down a lot. Including traversal of the results, a typical quick search runs at 40,000 to 100,000 searches per second with the worst-case being about 80 searches per second.
If distances to the result points are not needed, quick searches are typically about 75% faster than normal ones albeit with about 5 times as many results being returned.
-
For the
SearchByBounds(…)
method run time correlates with the size of the bounding box with smaller bounding boxes typically yielding faster run times.A fairly typical search yielding about 50 results runs at about 10,000 searches per second in normal mode and about 30,000 searches per second in quick mode. A nearly worst case example is a search returning 100,000 points; this will run at about 5 searches per second in normal mode or about 8,000 searches per second in quick mode.
-
For the
Closest(...)
method the highest performance is seen when there are result points close to the search point. Search speeds for the single closest point are typically in excess of 20,000 per second for close-by results or about 8,000 per second when results are far away. Worst case speeds of about 1,000 searches per second occur when all indexed points are in the hemisphere opposite the search point. -
For the
Farthest(…)
method the highest performance is seen when there are result points nearly antipodal to the search point. Search speeds for the single farthest point are typically in excess of 20,000 per second when nearly-antipodal points exist. Worst case speeds of about 1,000 searches per second occur when all indexed points are in the same hemisphere as the search point.
Note that the numbers above are approximate and are highly dependent on the data being searched, the type of search being run, and on the number of results returned. Actual searches may be an order of magnitude or more slower.
A sample benchmark run can be found in examples/sample_benchmark.txt
To run the
benchmarks yourself you can run examples/benchmark.pl
It needs the
Devel::Size and Time::HiRes modules installed and a single run takes about 8
minutes.
Since Perl constants cannot be changed from the commandline you will need to
edit the Geo/Index.pm
to force the use of numeric keys, packed numeric keys,
or text keys. This can be done by uncommenting the appropriate lines at the
head of the file (look for USE_NUMERIC_KEYS
and USE_PACKED_KEYS
). When
running benchmark.pl
, the various other options can be found at the top of
the script. When writing your own programs you can switch between the Perl and
C single- or double-precision low-level code by using the function_type
option when calling new(…)
.
The high cost of constructing and traversing the results seems inherent to the Perl language and there does not seem to be any way to avoid it. The is some potential for optimization though:
- The
pre_condition
andpost_condition
function calls might be sped up by assigning them to function handles (much as is done withHaversineDistance
, etc.) instead of making the calls by dereferencing the variables. - Performance could potentially be improved by splitting the current combined index into individual indices for each level. Having smaller keys and indices should result in higher performance but the additional layer of indirection could slow things down in some circumstances.
- Improvements might be possible to the performance of
Closest( n, ...)
wheren
> 1 and per-point distances are not needed by the caller. At each iteration of the algorithm the previously-used search radius gives the maximal distance to all points already found, obviating the need to calculate every point's distance. Only points in the final level of the search would need to have their distances calculated. The downside to this method is that the point distances would not be available for all points in a result set (only for those found in the final search level). - A number of alternative datastructures were explored for the point index but benchmarking showed plain Perl hashes to be by far the most efficient. It is possible, though in my opinion unlikely, that a faster data structure choice exists that is suitable for use in this module.
- It might be worthwhile to make hash key names shorter. A toy test using the
keys used for points showed about a 5% improvement when using single-character
keys (e.g.
d
) instead of long keys (e.g.search_result_distance
). Further improvements might be seen by switching from hashes to lists for the stored points. Whether the loss in legibility is worth the improvement in speed is unclear.
- Overview
- Points
-
Methods
-
Indexing
new
Index
IndexPoints
Unindex -
Searching
Search
SearchByBounds
Closest
Farthest
AllPoints
Condition functions -
Distance and measurement
Distance
DistanceFrom
DistanceTo
OneDegreeInMeters
OneMeterInDegrees -
Metadata
GetConfiguration
GetStatistics -
Miscellaneous
Sweep
Vacuum
PointCount
-
Indexing
- Alternate method names
- Performance
- Theory of operation
- Bugs and deficiencies
- Version history
- Author and copyright