Smarter search radius formula for map matching #3184

Merged
merged 5 commits into from Nov 17, 2016

Projects

None yet

3 participants

@kerrick-lyft
Contributor
kerrick-lyft commented Oct 26, 2016 edited

This is a long commit message, sorry! The gist is that I found a better formula for the map match search radius that gives good results but is still small enough to be performant.

OSRM’s /match endpoint can take a long time to finish a request, even on a powerful, modern server, which has led to a lot of additional latency as we deploy map matching in more places.

The running time of the map-matching algorithm is approximately quartic (O(n^4)) in the candidate point search radius; a 2x increase in the search radius for each input point will tend to lead to a 4x increase in the number of candidate points, and a 4x increase in candidate points (states) will lead to a 16x increase in the number of operations the Viterbi algorithm performs.

So decreasing the search radius can dramatically improve the running time. However, the formula OSRM v5 uses, which is search_radius = 3 * gps_accuracy, doesn’t find the correct point in many cases. We internally patched OSRM to use the same formula as OSRM v4, search_radius = 10 * gps_accuracy, which gives good results.

I suspected the optimal search radius isn’t linear in the gps_accuracy, and so I pulled ~1 million data points from Lyft drivers and compared the distance from the raw point to the map-matched point output by our current system (data file here). I bucketed the points by rounding their phone-reported accuracy (Location.getAccuracy() on Android or CLLocation.horizontalAccuracy on iOS) down to the nearest integer. To handle sparsity and to smooth out the data I also added each point to the two neighboring buckets in each direction. For example, a point with accuracy 3.7 would be included in the buckets 1, 2, 3, 4, and 5. Finally, I computed the 99.9th percentile raw->map-match distance for each bucket. See this script for the specific computation: https://drive.google.com/file/d/0B30B6-L__QYKbXZwUl9DbkVsbDA/view

Here’s a graph of the results: https://drive.google.com/file/d/0B30B6-L__QYKU0RmZjI0ZGxYZ1E/view

The upward trend stops at around bucket = 47. Removing buckets >= 48, we see a clear linear trend: https://drive.google.com/file/d/0B30B6-L__QYKNTF5bm1YWGNaOGc/view?usp=sharing

We can fit a trendline to this data: https://drive.google.com/file/d/0B30B6-L__QYKeF8yb3ZiTkhjV1E/view?usp=sharing

This gives the formula search_radius = 3.45 * gps_accuracy + 44.4. Since P99.9 radius was essentially never more than 200 meters, we cap the search radius at 200 meters and round up the coefficients, giving the final formula search_radius = min(3.5 * gps_accuracy + 45, 200). This formula should yield a search radius that contains the correct point 99.9% of the time.

We allow the caller to configure these parameters so they can tweak the performance / accuracy tradeoff. The 200 meter cap might be a quirk of the data I processed, but the caller can change it if they want.

As a result of this change, the latency of our map match calls dropped significantly (vs our patched OSRM that used 10 * gps_accuracy) without any degradation in accuracy on our test dataset. OSRM currently uses the formula 3 * gps_accuracy, so for mainline OSRM this change means a modest increase in latency, but the map match results should be more accurate. The new formula provides a good tradeoff between latency and accuracy.

@TheMarex
Member

Thanks for this great analysis! 👍

I need some time to look at this in more detail, but I think we can use this as a basis to not only fix the search radius, but actually provide a real empirical correction over the original measurements done in the Newson and Krumm paper.

Right off the bat I'm skeptical about exposing this as a querty-time parameter. Basically what we are defining here are some properties of an empirical distribution. Hard-coding that would be fine with me, if we can ensure that the measurements are good. The 200m limit is also something that was enforce in the Newson and Krumm paper.

My main concern is about documenting this behavior. A user might expect increasing the value for radius will also continuously increase the search radius.

BTW. If you are looking at some more performance improvements, there are a few things you can do. One would be to incorporate a bearing filter (or the more advanced version, modify the emission probability in the hidden markov model to use the bearing value). Speeding up the Viterbi algorithm by using a less naive version that limits the amount of memory used would also be possible. Hit me up by email if you have questions.

@MoKob MoKob added the Review label Oct 27, 2016
@kerrick-lyft
Contributor

I'm not sure why the Appveyor build is failing, could you help me debug that?

Thanks for your response :) I'm working on getting the accuracy_to_distance.csv file released so you all can take a look and maybe improve the emission probability function. Our Legal team wanted me to run it by them but I don't think there will be anything objectionable about releasing it.

I'd strongly prefer to keep the search_radius_* values as parameters because it makes it really easy to debug what's happening and answer questions like "would this case have behaved differently if the parameters were set higher?". I have noticed cases where the values of these parameters matter, it's just not different enough to materially affect our overall metric. Otherwise you need to do a compile (and in our case, a deploy, since the United States data file doesn't fit in RAM on our development machines) in order to test a new combination of the parameters.

I'll update the docs to indicate the radius param doesn't directly correspond to the search radius, and also add the search_radius_* params to the docs.

We've actually experimented with using the phone-reported bearing to modify the emission probability! The distribution of street_bearing - reported_bearing looked pretty tight but when we tried incorporating it into our model we didn't really see gains. We were pretty time-crunched and didn't have much time to spend on it though, so it may be worth revisiting. Will let you know if we make progress in this direction.

We're definitely interested in any performance gains we can get. I'll start an email thread so we can discuss there.

@TheMarex
Member
TheMarex commented Nov 4, 2016

I'd strongly prefer to keep the search_radius_* values as parameters because it makes it really easy to debug what's happening and answer questions like "would this case have behaved differently if the parameters were set higher?".

I understand this is great for experimenting and debugging. But when it comes to including something in our API my primary concern is always two factors:

  1. Limit surface area of API to minimize the chance of needing to do breaking changes
  2. Offer an API that is as simple as possible

From experience, once you include something in your API that is not completely obvious people will misunderstand and misuse it (👋 bearings and radiuses). Since this is sadly not an internal API I can't add these parameters without fear that we will need to change/remove/extend them soon.

I know this is unsatisfying, but let's keep production and prototyping code separate for now.

Thanks for your response :) I'm working on getting the accuracy_to_distance.csv file released so you all can take a look and maybe improve the emission probability function.

That would be great.

I'm not sure why the Appveyor build is failing, could you help me debug that?

There were a number of problems with Windows unrelated to your PR, there is a good chance that a rebase will fix it.

We've actually experimented with using the phone-reported bearing to modify the emission probability!

Great! I'm currently trying to find a test data set that includes device bearings for some own experiments.

Proposal:

  • Remove API parameters
  • Use current calibration value as constant
  • Extend docs to capture the delicate matter of the radius <-> GPS precision relationship
@kerrick-lyft
Contributor
kerrick-lyft commented Nov 8, 2016 edited

Here is the accuracy_to_distance.csv data file. Keep in mind that this is just based off our best estimate of the driver's position, using OSRM 5.2.6 with the OSRM v4 radius formula.

Sounds good, I'll make those changes and then ping this PR.

kerrick-lyft added some commits Sep 20, 2016
@kerrick-lyft @kerrick-lyft kerrick-lyft Smarter search radius formula for map matching 463da4a
@kerrick-lyft kerrick-lyft Fix capture ebd82e9
@kerrick-lyft kerrick-lyft Hardcode search radius parameters 21882cf
@kerrick-lyft kerrick-lyft Update docs
be3d3c2
@kerrick-lyft
Contributor

I've implemented the changes you suggested and rebased, PR is ready for review again.

@kerrick-lyft kerrick-lyft Fix call to std::min
989bb72
@TheMarex
Member

@kerrick-lyft looks good to me. 👍 Merging.

@TheMarex TheMarex merged commit 8bb183b into Project-OSRM:master Nov 17, 2016

2 of 4 checks passed

codecov/patch 0.00% of diff hit (target 85.01%)
Details
codecov/project 85.00% (-0.01%) compared to 17c32f5
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@TheMarex TheMarex added Ready To Merge and removed Review labels Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment