Skip to content

Performance

Alex-Kent edited this page Apr 3, 2019 · 6 revisions

Overview

Geo::Index is intended for stand-alone applications that need a way to quickly perform proximity seaches on relatively small datasets (at most a few million points). Typical search speeds are three to five orders of magnitude faster than a linear search. For larger datasets and for applications running in a server environment using something like PostGIS is more appropriate.

Indexing speed is about 50,000 points per second when levels is 20. Search speeds are highly dependant on the data indexed and on search parameters but are typically in the neighborhood of a few thousand searches per second (see Benchmark results, below).

Memory usage tends to be rather high; for 1,000,000 points the index is ~3.2 GB for tightly clustered points or ~4.6 GB when spread evenly world-wide. The size of an index grows linearly with each added point at a rate of about 4 kB per point. When a point is first encountered whilst searching its size will increase by about 100 bytes (this only happens once per point).

Since performance is so dependant on data and usage, the the user is encouraged to test all available options while developing their application before choosing the one that works fastest. The examples/benchmark.pl script included with this module may be helpful for measuring this module's performance.

General tips

Here are some guidelines for best results:

  • Requesting results as a list reference is faster than asking for a plain list.

    That is, e.g., $results = Search(...); is faster than @results = Search(...);

  • Post-conditions are faster than pre-conditions.

    Benchmarking has shown that the cost of the Perl function call is higher than that of the distance-related code.Thus there is probably no reason to use pre-conditions. Put concisely,

    $results = $index->Search( $point, { post_condition => $code_ref } );
    

    is faster than

    $results = $index->Search( $point, { pre_condition => $code_ref } );
    
  • Choose an appropriate value for levels when creating the index

    The Search(…) method has best performance when the size of the most detailed level of the index has a smaller physical size than the radius of a typical search. For example, if your searches are typically for points within 100 meters then an index with levels should be set to at least 18 (~75 meters at the equator) to yield best results; if typical searches have 10 meter radius then levels should be 22.

    The Closest(…) method works best when the most detailed level of the index contains a single point per tile and search points lie close to potential result points.

    To help tune the levels value, the GetConfiguration( ) method can be used to find out the physical size of the most detailed level along with statistics on the number of points per index tile. A script showing the use of this method can be found in examples/show_configuration.pl

  • Use the quick_results option when possible.

    Filtering points and combining them into a single, flat list can be very expensive. Many applications can tolerate getting additional points beyond those matching the search criteria. An example of this is drawing points on a map; if points are clipped to the visible area when they are drawn it may not matter if some of them lie outside of it.

  • Use Search(…) instead of Closest(…) when you have a search radius.

    The Closest(…) function is most efficient when no search radius is specified or when result points lie very close to the search point. Closeness is relative to the tile size of the most detailed index level; for the default index depth (20), "very close" is roughly within about 100 meters.

    When clipping results to a maximal radius it is typically much faster to use Search(…) with the sort_results and max_results options*.

    For example, to find the closest $n points within distance $d of a point $p it is usually much faster to use

    %options = (
                 max_results    => $n, 
                 radius         => $d, 
                 sort_results   => 1,
                 post_condition => sub { return $_[0] != $_[1]; }
               );
    $results = $index->Search( $p, \%options );
    

    instead of

    $results = $index->Closest( $p, $n { radius => $d } );
    

    * The post_condition shown in the example omits the search point from the results and is needed to fully emulate the behavior of Closest(…).

Technical discussion

Both Search(…) and SearchByBounds(…) are very fast since they can find the relevant index tiles in linear time. Since the time needed to filter the results is directly proportional to the number of points retrieved from the index, best performance occurs when the size of the most detailed tiles is smaller than that of the typical search radius or search bounds.

Searches run using Closest(…) are done starting from the most detailed level and work upwards. Best performance occurs when a result is found in the first few iterations. If the first iteration that finds points yields a large number of points then performance will suffer since the distance to each of these points will need to be measured to find the closest. For similar reasons, requesting a large number of closest points in a single call will also impact performance. The Farthest(…) method is largely a wrapper for Closest(…) and thus exhibits similar behavior.

Some functions within Geo::Index have optional implementations written in C. If these are active (by default they are whenever possible) searches typically run 25% to 50% faster.

Whenever possible Geo::Index uses numeric index keys. Compared to text index keys, numeric keys improve performance with about 30% faster speed and about 50% smaller index memory usage. The downside to numeric keys is that they are less legible to humans while debugging. (Whether numeric or text keys are used can be changed by setting the appropriate value at the top of Geo/Index.pm)

Benchmark results

Typical benchmark results run on a modern workstation using numeric keys and double-precision C low-level code with the index containing 1,000,000 points are as follows:

  • IndexPoints(…)

    Points can be added to an index at the rate of about 50,000 per second.

  • Search(…)

    Typical searches returning values run at about 25,000 to 50,000 searches per second. Worst-case performance is under 50 searches per second and searches returning no results run at over 100,000 searches per second. The overhead of traversing the results is fairly negligible.

    Quick searches run at 120,000 to 150,000 searches per second. Actually doing anything with the results slows things down a lot. Including traversal of the results, a typical quick search runs at 40,000 to 100,000 searches per second with the worst-case being about 80 searches per second.

    If distances to the result points are not needed, quick searches are typically about 75% faster than normal ones albeit with about 5 times as many results being returned.

  • SearchByBounds(…)

    For the SearchByBounds(…) method run time correlates with the size of the bounding box with smaller bounding boxes typically yielding faster run times.

    A fairly typical search yielding about 50 results runs at about 10,000 searches per second in normal mode and about 30,000 searches per second in quick mode. A nearly worst case example is a search returning 100,000 points; this will run at about 5 searches per second in normal mode or about 8,000 searches per second in quick mode.

  • Closest(…)

    For the Closest(...) method the highest performance is seen when there are result points close to the search point. Search speeds for the single closest point are typically in excess of 20,000 per second for close-by results or about 8,000 per second when results are far away. Worst case speeds of about 1,000 searches per second occur when all indexed points are in the hemisphere opposite the search point.

  • Farthest(…)

    For the Farthest(…) method the highest performance is seen when there are result points nearly antipodal to the search point. Search speeds for the single farthest point are typically in excess of 20,000 per second when nearly-antipodal points exist. Worst case speeds of about 1,000 searches per second occur when all indexed points are in the same hemisphere as the search point.

Note that the numbers above are approximate and are highly dependent on the data being searched, the type of search being run, and on the number of results returned. Actual searches may be an order of magnitude or more slower.

A sample benchmark run can be found in examples/sample_benchmark.txt To run the benchmarks yourself you can run examples/benchmark.pl It needs the Devel::Size and Time::HiRes modules installed and a single run takes about 8 minutes.

Since Perl constants cannot be changed from the commandline you will need to edit the Geo/Index.pm to force the use of numeric keys, packed numeric keys, or text keys. This can be done by uncommenting the appropriate lines at the head of the file (look for USE_NUMERIC_KEYS and USE_PACKED_KEYS). When running benchmark.pl, the various other options can be found at the top of the script. When writing your own programs you can switch between the Perl and C single- or double-precision low-level code by using the function_type option when calling new(…).

Potential optimizations

The high cost of constructing and traversing the results seems inherent to the Perl language and there does not seem to be any way to avoid it. The is some potential for optimization though:

  • The pre_condition and post_condition function calls might be sped up by assigning them to function handles (much as is done with HaversineDistance, etc.) instead of making the calls by dereferencing the variables.
  • Performance could potentially be improved by splitting the current combined index into individual indices for each level. Having smaller keys and indices should result in higher performance but the additional layer of indirection could slow things down in some circumstances.
  • Improvements might be possible to the performance of Closest( n, ...) where n > 1 and per-point distances are not needed by the caller. At each iteration of the algorithm the previously-used search radius gives the maximal distance to all points already found, obviating the need to calculate every point's distance. Only points in the final level of the search would need to have their distances calculated. The downside to this method is that the point distances would not be available for all points in a result set (only for those found in the final search level).
  • A number of alternative datastructures were explored for the point index but benchmarking showed plain Perl hashes to be by far the most efficient. It is possible, though in my opinion unlikely, that a faster data structure choice exists that is suitable for use in this module.
  • It might be worthwhile to make hash key names shorter. A toy test using the keys used for points showed about a 5% improvement when using single-character keys (e.g. d) instead of long keys (e.g. search_result_distance). Further improvements might be seen by switching from hashes to lists for the stored points. Whether the loss in legibility is worth the improvement in speed is unclear.