Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential false positives for equal_shape_distance_diff_coordinates #1258

Closed
isabelle-dr opened this issue Sep 19, 2022 · 16 comments · Fixed by #1675
Closed

Potential false positives for equal_shape_distance_diff_coordinates #1258

isabelle-dr opened this issue Sep 19, 2022 · 16 comments · Fixed by #1675
Assignees
Labels
bug Something isn't working (crash, a rule has a problem) enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference

Comments

@isabelle-dr
Copy link
Contributor

isabelle-dr commented Sep 19, 2022

Problem
We hear from users that equal_shape_distance_diff_coordinates (which is currently an error) is often present in datasets that contain shapes, and the work needed to fix this issue in the datasets gives an incentive for users not to use shapes at all.

This rule was initially implemented in PR #1083, alongside two others:

  • decreasing_shape_distance: error, and
  • equal_shape_distance_same_coordinates: warning,

with the intention of validating the shapes.xt Reference:

Values must increase along with shape_pt_sequence; they must not be used to show reverse travel along a route.

What to do
Re-visit if the conditions that trigger equal_shape_distance_diff_coordinates should really be an error: talk to the community, and analyze production data.
Consider lowering the severity to a warning and opening a discussion in the specification to make it clearer.

Next Steps from Most Recent Comment

After a discussion with @qcdyx, the strategy to solve this issue is:

we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
plot it on a histogram with frequency on the y and latitude and longitude diff value on the x. Then, assess based on what we see:
Spreadsheet values (example from @cka-y's past analytics work in https://github.com/MobilityData/mobility-database-catalogs/pull/275/files)

  • ID of each feed
  • URL of each feed
  • csvRowNumber
  • shapeDistTraveled
  • shapePtLat
  • shapePtLon
  • prevCsvRowNumber
  • prevShapeDistTraveled
  • prevshapePtLat
  • prevshapeptLon
  • actualDistanceBetweenShapePoints

Once this spreadsheet is created, we can see if there's a common threshold for actualDistanceBetweenShapePoints (how far apart are they typically for feeds generating this error?)

do we have a clear threshold that has the majority of the values below it?
if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
if so: does this need a spec amendment?

Additional Context
These three rules were initially created to replace the decreasing_or_equal_shape_distance notice because this rule was triggered by two things that deserved to be treated differently:

  1. shape_dist_traveled decreases between two consecutive shape points (which is a clear violation of the spec)
  2. shape_dist_traveled is equal between two consecutive shape points (also a violation but is not as big of a problem)

By digging deeper into number 2 above, we noticed that we were seeing two cases in production data:
2.1 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are equal (which seems fine)
2.2 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are not equal (which seems like a problem, but it could be caused by the scheduling software that rounds shape_dist_traveled when the two shape points are really close)

We went ahead and made our own interpretation of the specification based on what we saw in the production data: condition 2.1 would be a warning, whereas conditions 1 & 2.2 would be errors, which is slightly less strict than the spec that strictly mentions "must increase".

@isabelle-dr isabelle-dr added enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference status: Needs discussion We need a discussion on requirements before calling this issue ready labels Sep 19, 2022
@isabelle-dr isabelle-dr added this to the Rules improvements milestone Oct 3, 2022
@isabelle-dr isabelle-dr added the need spec clarification Needs a modification in the specification label Oct 3, 2022
@isabelle-dr isabelle-dr removed this from the Rules improvements milestone Oct 3, 2022
@isabelle-dr isabelle-dr added bug Something isn't working (crash, a rule has a problem) and removed bug Something isn't working (crash, a rule has a problem) labels Oct 3, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2022

Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.

@isabelle-dr isabelle-dr added the bug Something isn't working (crash, a rule has a problem) label Feb 2, 2023
@isabelle-dr
Copy link
Contributor Author

Posting on the behalf of Marcy Jaffe with the National RTAP.

I'd like to offer training and recommend your MD Schedule Validator tools if it might be possible to reconsider as warning (vs. error) for equal_shape_distance_diff_coordinates
When Google Validates it is a warning

Screenshot 2023-01-24 at 9 29 53 AM

For Mobility Data it is an error - which I will need to advise in my trainings that they can ignore which might mean they want to ignore other errors - which I do not want

Screenshot 2023-02-02 at 4 33 01 PM

It will be nearly impossible and offer very little benefit to go in for literally 100 rows of data and delete the nearby points. Riders will not have a much better experience and I believe some agencies will not want to manage their GTFS! Might you please consider revising this advisory to a warning?

@isabelle-dr
Copy link
Contributor Author

isabelle-dr commented Feb 2, 2023

I am tempted to downgrade this notice, and propose a modification to the spec from:

Values must increase along with shape_pt_sequence

to:

Values must not decrease along with shape_pt_sequence

@bdferris-v2 thoughts?

@derhuerst
Copy link

derhuerst commented Feb 3, 2023

Is it possible to configure the rules' severities when using it? (Without rebuilding the .jar/Docker image.)

For my use cases, I'd like to have the option to treat equal_shape_distance_diff_coordinates as an error, even if it is decided here that it should be a warning.

If this is possible already, why not let people who focus on small/rural GTFS providers re-configure equal_shape_distance_diff_coordinates to warning? I'm not saying that their perspective isn't relevant, but I think there is a trade-off to be made, between small providers – who might not have the technical resources to produce high-quality GTFS feeds – and big metropolitan or even national providers – who should encouraged to follow the spec (and its implicit intentions) rather strictly. Because I see this range of sophistication as inevitable, I'd rather opt for more strict defaults.

@isabelle-dr isabelle-dr added this to the Q1 2023 milestone Feb 14, 2023
@KClough
Copy link
Collaborator

KClough commented Feb 18, 2023

@isabelle-dr do you have a GTFS data set that exhibits this issue?

@isabelle-dr
Copy link
Contributor Author

isabelle-dr commented Mar 3, 2023

@KClough I have requested it!
I agree with @derhuerst's point of view, this should stay an error in this validator (unless we change the spec).

Is it possible to configure the rules' severities when using it?

We are working on it 🙃

I have the impression that some vendor tools create this issue in a systemic way. It might be worth digging into how shapes.txt is created...
An immediate action item might be to update the documentation to explain to users when it's acceptable to ignore this issue.

@isabelle-dr
Copy link
Contributor Author

Edit on what I've just said: I am not entirely sure error is the right severity, and I'd like to take a data-driven approach to figure this out.
@KClough: can you get the list of all the datasets in the Mobility Database that trigger this notice? We can attempt to draw patterns based on what we see in production.

@isabelle-dr
Copy link
Contributor Author

@KClough here are three datasets that trigger this rule and also the equal_shape_distance_same_coordinates.

Interior Alaska
https://www.dropbox.com/s/5es8jexp0qpmsxd/interiorak_google_transit.zip?dl=0

This feed also has warning
https://www.dropbox.com/s/h16ny11hlln3k9h/centraltransit_google_transit.zip?dl=0

While this agency has related warning & error
https://www.dropbox.com/s/8c8zjbp89fff0do/makah_google_transit.zip?dl=0

And here is Marcy's answer to my question: how to you create the shape files.

Rural agencies without a GIS staff product higher quality GTFS with shapes generated using MyMaps - guided by the stops along the way - in the direction of travel >> see this file

https://www.google.com/maps/d/edit?mid=1WuMlxgYa-NCZLZAuyQdwFhDnq6gvsGk&usp=sharing

and then export the shape as KML , name the shape_id

At times a point is near another point along the route & voila - an error
I've tried to search for duplicate values and delete - while with multiple routes there are too many values & this was not a quick fix

@isabelle-dr
Copy link
Contributor Author

A comment from our slack channel on this issue

Because the shape is a GPS trace, is it possible to quantize the coordinates to the 1m level, ie remove the second coordinate with the same dist

@isabelle-dr
Copy link
Contributor Author

It looks like a reasonable next step is to see how this user is generating shapes and if the file can be cleaned-up. I can commit to doing this in the next few weeks.
@KClough, I'd still be interested to get the list of datasets that trigger this notice to have a closer look

@briandonahue
Copy link
Collaborator

briandonahue commented May 30, 2023

@isabelle-dr Currently the notice data does not include the lat, long for the affected row. Adding that would be one way of getting all the data needed to test the Mobility Database data in the compiled reports that are run as part of the github actions and we could then potentially use some JSON querying tools to check stats on that ouput. This could be done as a test PR if it's not desirable to add those fields into the official output.

Alternatively or additionally, the Cal-ITP project could potentially query this information for the feeds in their database, if we want to make a request to them, but would be limited to California data.

@isabelle-dr isabelle-dr added status: Work in progress A PR that would close this issue has been opened. and removed status: Needs discussion We need a discussion on requirements before calling this issue ready labels Jun 10, 2023
@isabelle-dr
Copy link
Contributor Author

isabelle-dr commented Aug 2, 2023

After a discussion with @qcdyx, the strategy to solve this issue is:

  • we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
  • pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
  • plot it on a histogram with frequency on the y and shape_dist_traveled value on the x. Then, assess based on what we see:
    • do we have a clear threshold that has the majority of the values below it?
    • if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
    • if so: does this need a spec amendment?

@qcdyx qcdyx self-assigned this Sep 20, 2023
@qcdyx
Copy link
Contributor

qcdyx commented Sep 25, 2023

Do we have agreement on downgrading equal_shape_distance_diff_coordinates from error to warning?

Based on my observation, equal_shape_distance_diff_coordinates happens when two consecutive points are very close. For example, the actualDistanceBetweenShapePoints for previous point (lat 48.36919, long -124.63073) and current point (lat 48.36919, long -124.63074) is 0, so these two consecutive points have equal shape_dist_traveled and but different lat/lon coordinates inshapes.txt. Based on the Haversine formula, the distance between these two points is approximately 0.5702 meters, which is very close to 0. The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

I prefer downgrading equal_shape_distance_diff_coordinates to a warning than searching for other substitute geometry libraries. @isabelle-dr @emmambd

@emmambd
Copy link
Contributor

emmambd commented Sep 25, 2023

@qcdyx Hey Jingsi! The goal right now is to conduct analytics on when equal_shape_distance_diff_coordinates to make a decision — it's too soon to decide about severity at the moment without doing an evaluation of the Mobility Database feeds that we use in acceptance tests, as specified in the next step heading here

The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

This is interesting! I believe up to this point we thought that the validator was using the actual shape_dist_traveled points defined by the feeds, not doing any additional interpretation. Let's talk about this more offline and then I'll circle back here to document next steps.

@emmambd
Copy link
Contributor

emmambd commented Feb 20, 2024

An update on our approach on this PR: #1675 (comment)

Current approach is to implement a threshold of 1.11m on distances between shape point pairs for the ERROR (to capture any "same" values that result from precision/rounding issues at 5 decimal places for lat long values) , and create a WARNING for any distances that are less than that.

We plan to include this in the upcoming release, and will only take next steps to make this threshold more permissive if we receive user feedback on it.

@emmambd
Copy link
Contributor

emmambd commented Mar 8, 2024

I'm going to close issue based on #1675 and re-open it if there is new community feedback after this release that indicates we should make the threshold more permissive.

@emmambd emmambd closed this as completed Mar 8, 2024
@emmambd emmambd linked a pull request Mar 8, 2024 that will close this issue
5 tasks
@emmambd emmambd removed status: Work in progress A PR that would close this issue has been opened. need spec clarification Needs a modification in the specification labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working (crash, a rule has a problem) enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference
Projects
6 participants