-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential false positives for equal_shape_distance_diff_coordinates #1258
Comments
Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it. |
Posting on the behalf of Marcy Jaffe with the National RTAP. I'd like to offer training and recommend your MD Schedule Validator tools if it might be possible to reconsider as warning (vs. error) for |
I am tempted to downgrade this notice, and propose a modification to the spec from:
to:
@bdferris-v2 thoughts? |
Is it possible to configure the rules' severities when using it? (Without rebuilding the For my use cases, I'd like to have the option to treat If this is possible already, why not let people who focus on small/rural GTFS providers re-configure |
@isabelle-dr do you have a GTFS data set that exhibits this issue? |
@KClough I have requested it!
We are working on it 🙃 I have the impression that some vendor tools create this issue in a systemic way. It might be worth digging into how |
Edit on what I've just said: I am not entirely sure error is the right severity, and I'd like to take a data-driven approach to figure this out. |
@KClough here are three datasets that trigger this rule and also the Interior Alaska This feed also has warning While this agency has related warning & error And here is Marcy's answer to my question: how to you create the shape files.
|
A comment from our slack channel on this issue
|
It looks like a reasonable next step is to see how this user is generating shapes and if the file can be cleaned-up. I can commit to doing this in the next few weeks. |
@isabelle-dr Currently the notice data does not include the lat, long for the affected row. Adding that would be one way of getting all the data needed to test the Mobility Database data in the compiled reports that are run as part of the github actions and we could then potentially use some JSON querying tools to check stats on that ouput. This could be done as a test PR if it's not desirable to add those fields into the official output. Alternatively or additionally, the Cal-ITP project could potentially query this information for the feeds in their database, if we want to make a request to them, but would be limited to California data. |
After a discussion with @qcdyx, the strategy to solve this issue is:
|
Do we have agreement on downgrading equal_shape_distance_diff_coordinates from error to warning? Based on my observation, equal_shape_distance_diff_coordinates happens when two consecutive points are very close. For example, the actualDistanceBetweenShapePoints for previous point (lat 48.36919, long -124.63073) and current point (lat 48.36919, long -124.63074) is 0, so these two consecutive points have equal I prefer downgrading equal_shape_distance_diff_coordinates to a warning than searching for other substitute geometry libraries. @isabelle-dr @emmambd |
@qcdyx Hey Jingsi! The goal right now is to conduct analytics on when equal_shape_distance_diff_coordinates to make a decision — it's too soon to decide about severity at the moment without doing an evaluation of the Mobility Database feeds that we use in acceptance tests, as specified in the next step heading here
This is interesting! I believe up to this point we thought that the validator was using the actual shape_dist_traveled points defined by the feeds, not doing any additional interpretation. Let's talk about this more offline and then I'll circle back here to document next steps. |
An update on our approach on this PR: #1675 (comment) Current approach is to implement a threshold of 1.11m on distances between shape point pairs for the ERROR (to capture any "same" values that result from precision/rounding issues at 5 decimal places for lat long values) , and create a WARNING for any distances that are less than that. We plan to include this in the upcoming release, and will only take next steps to make this threshold more permissive if we receive user feedback on it. |
I'm going to close issue based on #1675 and re-open it if there is new community feedback after this release that indicates we should make the threshold more permissive. |
Problem
We hear from users that
equal_shape_distance_diff_coordinates
(which is currently an error) is often present in datasets that contain shapes, and the work needed to fix this issue in the datasets gives an incentive for users not to use shapes at all.This rule was initially implemented in PR #1083, alongside two others:
decreasing_shape_distance
: error, andequal_shape_distance_same_coordinates
: warning,with the intention of validating the shapes.xt Reference:
What to do
Re-visit if the conditions that trigger
equal_shape_distance_diff_coordinates
should really be an error: talk to the community, and analyze production data.Consider lowering the severity to a warning and opening a discussion in the specification to make it clearer.
Next Steps from Most Recent Comment
After a discussion with @qcdyx, the strategy to solve this issue is:
we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
plot it on a histogram with frequency on the y and latitude and longitude diff value on the x. Then, assess based on what we see:
Spreadsheet values (example from @cka-y's past analytics work in https://github.com/MobilityData/mobility-database-catalogs/pull/275/files)
Once this spreadsheet is created, we can see if there's a common threshold for actualDistanceBetweenShapePoints (how far apart are they typically for feeds generating this error?)
do we have a clear threshold that has the majority of the values below it?
if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
if so: does this need a spec amendment?
Additional Context
These three rules were initially created to replace the
decreasing_or_equal_shape_distance
notice because this rule was triggered by two things that deserved to be treated differently:shape_dist_traveled
decreases between two consecutive shape points (which is a clear violation of the spec)shape_dist_traveled
is equal between two consecutive shape points (also a violation but is not as big of a problem)By digging deeper into number 2 above, we noticed that we were seeing two cases in production data:
2.1
shape_dist_traveled
is equal between two consecutive shape points and the lat/long coordinates are equal (which seems fine)2.2
shape_dist_traveled
is equal between two consecutive shape points and the lat/long coordinates are not equal (which seems like a problem, but it could be caused by the scheduling software that roundsshape_dist_traveled
when the two shape points are really close)We went ahead and made our own interpretation of the specification based on what we saw in the production data: condition 2.1 would be a warning, whereas conditions 1 & 2.2 would be errors, which is slightly less strict than the spec that strictly mentions "must increase".
The text was updated successfully, but these errors were encountered: