Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Smarter search radius formula for map matching #3184
This is a long commit message, sorry! The gist is that I found a better formula for the map match search radius that gives good results but is still small enough to be performant.
OSRM’s /match endpoint can take a long time to finish a request, even on a powerful, modern server, which has led to a lot of additional latency as we deploy map matching in more places.
The running time of the map-matching algorithm is approximately quartic (O(n^4)) in the candidate point search radius; a 2x increase in the search radius for each input point will tend to lead to a 4x increase in the number of candidate points, and a 4x increase in candidate points (states) will lead to a 16x increase in the number of operations the Viterbi algorithm performs.
So decreasing the search radius can dramatically improve the running time. However, the formula OSRM v5 uses, which is search_radius = 3 * gps_accuracy, doesn’t find the correct point in many cases. We internally patched OSRM to use the same formula as OSRM v4, search_radius = 10 * gps_accuracy, which gives good results.
I suspected the optimal search radius isn’t linear in the gps_accuracy, and so I pulled ~1 million data points from Lyft drivers and compared the distance from the raw point to the map-matched point output by our current system (data file here). I bucketed the points by rounding their phone-reported accuracy (Location.getAccuracy() on Android or CLLocation.horizontalAccuracy on iOS) down to the nearest integer. To handle sparsity and to smooth out the data I also added each point to the two neighboring buckets in each direction. For example, a point with accuracy 3.7 would be included in the buckets 1, 2, 3, 4, and 5. Finally, I computed the 99.9th percentile raw->map-match distance for each bucket. See this script for the specific computation: https://drive.google.com/file/d/0B30B6-L__QYKbXZwUl9DbkVsbDA/view
Here’s a graph of the results: https://drive.google.com/file/d/0B30B6-L__QYKU0RmZjI0ZGxYZ1E/view
The upward trend stops at around bucket = 47. Removing buckets >= 48, we see a clear linear trend: https://drive.google.com/file/d/0B30B6-L__QYKNTF5bm1YWGNaOGc/view?usp=sharing
We can fit a trendline to this data: https://drive.google.com/file/d/0B30B6-L__QYKeF8yb3ZiTkhjV1E/view?usp=sharing
This gives the formula search_radius = 3.45 * gps_accuracy + 44.4. Since P99.9 radius was essentially never more than 200 meters, we cap the search radius at 200 meters and round up the coefficients, giving the final formula search_radius = min(3.5 * gps_accuracy + 45, 200). This formula should yield a search radius that contains the correct point 99.9% of the time.
We allow the caller to configure these parameters so they can tweak the performance / accuracy tradeoff. The 200 meter cap might be a quirk of the data I processed, but the caller can change it if they want.
As a result of this change, the latency of our map match calls dropped significantly (vs our patched OSRM that used 10 * gps_accuracy) without any degradation in accuracy on our test dataset. OSRM currently uses the formula 3 * gps_accuracy, so for mainline OSRM this change means a modest increase in latency, but the map match results should be more accurate. The new formula provides a good tradeoff between latency and accuracy.
Thanks for this great analysis!
I need some time to look at this in more detail, but I think we can use this as a basis to not only fix the search radius, but actually provide a real empirical correction over the original measurements done in the Newson and Krumm paper.
Right off the bat I'm skeptical about exposing this as a querty-time parameter. Basically what we are defining here are some properties of an empirical distribution. Hard-coding that would be fine with me, if we can ensure that the measurements are good. The 200m limit is also something that was enforce in the Newson and Krumm paper.
My main concern is about documenting this behavior. A user might expect increasing the value for
BTW. If you are looking at some more performance improvements, there are a few things you can do. One would be to incorporate a bearing filter (or the more advanced version, modify the emission probability in the hidden markov model to use the bearing value). Speeding up the Viterbi algorithm by using a less naive version that limits the amount of memory used would also be possible. Hit me up by email if you have questions.
I'm not sure why the Appveyor build is failing, could you help me debug that?
Thanks for your response :) I'm working on getting the
I'd strongly prefer to keep the
I'll update the docs to indicate the
We've actually experimented with using the phone-reported bearing to modify the emission probability! The distribution of
We're definitely interested in any performance gains we can get. I'll start an email thread so we can discuss there.
I understand this is great for experimenting and debugging. But when it comes to including something in our API my primary concern is always two factors:
From experience, once you include something in your API that is not completely obvious people will misunderstand and misuse it (
I know this is unsatisfying, but let's keep production and prototyping code separate for now.
That would be great.
There were a number of problems with Windows unrelated to your PR, there is a good chance that a rebase will fix it.
Great! I'm currently trying to find a test data set that includes device bearings for some own experiments.