Potential false positives for equal_shape_distance_diff_coordinates #1258

isabelle-dr · 2022-09-19T16:28:30Z

Problem
We hear from users that equal_shape_distance_diff_coordinates (which is currently an error) is often present in datasets that contain shapes, and the work needed to fix this issue in the datasets gives an incentive for users not to use shapes at all.

This rule was initially implemented in PR #1083, alongside two others:

decreasing_shape_distance: error, and
equal_shape_distance_same_coordinates: warning,

with the intention of validating the shapes.xt Reference:

Values must increase along with shape_pt_sequence; they must not be used to show reverse travel along a route.

What to do
Re-visit if the conditions that trigger equal_shape_distance_diff_coordinates should really be an error: talk to the community, and analyze production data.
Consider lowering the severity to a warning and opening a discussion in the specification to make it clearer.

Next Steps from Most Recent Comment

After a discussion with @qcdyx, the strategy to solve this issue is:

we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
plot it on a histogram with frequency on the y and latitude and longitude diff value on the x. Then, assess based on what we see:
Spreadsheet values (example from @cka-y's past analytics work in https://github.com/MobilityData/mobility-database-catalogs/pull/275/files)

ID of each feed
URL of each feed
csvRowNumber
shapeDistTraveled
shapePtLat
shapePtLon
prevCsvRowNumber
prevShapeDistTraveled
prevshapePtLat
prevshapeptLon
actualDistanceBetweenShapePoints

Once this spreadsheet is created, we can see if there's a common threshold for actualDistanceBetweenShapePoints (how far apart are they typically for feeds generating this error?)

do we have a clear threshold that has the majority of the values below it?
if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
if so: does this need a spec amendment?

Additional Context
These three rules were initially created to replace the decreasing_or_equal_shape_distance notice because this rule was triggered by two things that deserved to be treated differently:

shape_dist_traveled decreases between two consecutive shape points (which is a clear violation of the spec)
shape_dist_traveled is equal between two consecutive shape points (also a violation but is not as big of a problem)

By digging deeper into number 2 above, we noticed that we were seeing two cases in production data:
2.1 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are equal (which seems fine)
2.2 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are not equal (which seems like a problem, but it could be caused by the scheduling software that rounds shape_dist_traveled when the two shape points are really close)

We went ahead and made our own interpretation of the specification based on what we saw in the production data: condition 2.1 would be a warning, whereas conditions 1 & 2.2 would be errors, which is slightly less strict than the spec that strictly mentions "must increase".

The text was updated successfully, but these errors were encountered:

github-actions · 2022-10-03T16:51:55Z

Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.

isabelle-dr · 2023-02-02T21:39:09Z

Posting on the behalf of Marcy Jaffe with the National RTAP.

I'd like to offer training and recommend your MD Schedule Validator tools if it might be possible to reconsider as warning (vs. error) for equal_shape_distance_diff_coordinates
When Google Validates it is a warning

For Mobility Data it is an error - which I will need to advise in my trainings that they can ignore which might mean they want to ignore other errors - which I do not want

It will be nearly impossible and offer very little benefit to go in for literally 100 rows of data and delete the nearby points. Riders will not have a much better experience and I believe some agencies will not want to manage their GTFS! Might you please consider revising this advisory to a warning?

isabelle-dr · 2023-02-02T21:39:45Z

I am tempted to downgrade this notice, and propose a modification to the spec from:

Values must increase along with shape_pt_sequence

to:

Values must not decrease along with shape_pt_sequence

@bdferris-v2 thoughts?

derhuerst · 2023-02-03T20:10:01Z

Is it possible to configure the rules' severities when using it? (Without rebuilding the .jar/Docker image.)

For my use cases, I'd like to have the option to treat equal_shape_distance_diff_coordinates as an error, even if it is decided here that it should be a warning.

If this is possible already, why not let people who focus on small/rural GTFS providers re-configure equal_shape_distance_diff_coordinates to warning? I'm not saying that their perspective isn't relevant, but I think there is a trade-off to be made, between small providers – who might not have the technical resources to produce high-quality GTFS feeds – and big metropolitan or even national providers – who should encouraged to follow the spec (and its implicit intentions) rather strictly. Because I see this range of sophistication as inevitable, I'd rather opt for more strict defaults.

KClough · 2023-02-18T17:01:27Z

@isabelle-dr do you have a GTFS data set that exhibits this issue?

isabelle-dr · 2023-03-03T15:52:55Z

@KClough I have requested it!
I agree with @derhuerst's point of view, this should stay an error in this validator (unless we change the spec).

Is it possible to configure the rules' severities when using it?

We are working on it 🙃

I have the impression that some vendor tools create this issue in a systemic way. It might be worth digging into how shapes.txt is created...
An immediate action item might be to update the documentation to explain to users when it's acceptable to ignore this issue.

isabelle-dr · 2023-03-03T15:57:09Z

Edit on what I've just said: I am not entirely sure error is the right severity, and I'd like to take a data-driven approach to figure this out.
@KClough: can you get the list of all the datasets in the Mobility Database that trigger this notice? We can attempt to draw patterns based on what we see in production.

isabelle-dr · 2023-03-03T21:45:40Z

@KClough here are three datasets that trigger this rule and also the equal_shape_distance_same_coordinates.

Interior Alaska
https://www.dropbox.com/s/5es8jexp0qpmsxd/interiorak_google_transit.zip?dl=0

This feed also has warning
https://www.dropbox.com/s/h16ny11hlln3k9h/centraltransit_google_transit.zip?dl=0

While this agency has related warning & error
https://www.dropbox.com/s/8c8zjbp89fff0do/makah_google_transit.zip?dl=0

And here is Marcy's answer to my question: how to you create the shape files.

Rural agencies without a GIS staff product higher quality GTFS with shapes generated using MyMaps - guided by the stops along the way - in the direction of travel >> see this file

https://www.google.com/maps/d/edit?mid=1WuMlxgYa-NCZLZAuyQdwFhDnq6gvsGk&usp=sharing

and then export the shape as KML , name the shape_id

At times a point is near another point along the route & voila - an error
I've tried to search for duplicate values and delete - while with multiple routes there are too many values & this was not a quick fix

isabelle-dr · 2023-03-24T18:10:04Z

A comment from our slack channel on this issue

Because the shape is a GPS trace, is it possible to quantize the coordinates to the 1m level, ie remove the second coordinate with the same dist

isabelle-dr · 2023-03-24T18:12:09Z

It looks like a reasonable next step is to see how this user is generating shapes and if the file can be cleaned-up. I can commit to doing this in the next few weeks.
@KClough, I'd still be interested to get the list of datasets that trigger this notice to have a closer look

briandonahue · 2023-05-30T21:02:18Z

@isabelle-dr Currently the notice data does not include the lat, long for the affected row. Adding that would be one way of getting all the data needed to test the Mobility Database data in the compiled reports that are run as part of the github actions and we could then potentially use some JSON querying tools to check stats on that ouput. This could be done as a test PR if it's not desirable to add those fields into the official output.

Alternatively or additionally, the Cal-ITP project could potentially query this information for the feeds in their database, if we want to make a request to them, but would be limited to California data.

isabelle-dr · 2023-08-02T13:54:05Z

After a discussion with @qcdyx, the strategy to solve this issue is:

we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
plot it on a histogram with frequency on the y and shape_dist_traveled value on the x. Then, assess based on what we see:
- do we have a clear threshold that has the majority of the values below it?
- if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
- if so: does this need a spec amendment?

qcdyx · 2023-09-25T18:52:41Z

Do we have agreement on downgrading equal_shape_distance_diff_coordinates from error to warning?

Based on my observation, equal_shape_distance_diff_coordinates happens when two consecutive points are very close. For example, the actualDistanceBetweenShapePoints for previous point (lat 48.36919, long -124.63073) and current point (lat 48.36919, long -124.63074) is 0, so these two consecutive points have equal shape_dist_traveled and but different lat/lon coordinates inshapes.txt. Based on the Haversine formula, the distance between these two points is approximately 0.5702 meters, which is very close to 0. The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

I prefer downgrading equal_shape_distance_diff_coordinates to a warning than searching for other substitute geometry libraries. @isabelle-dr @emmambd

emmambd · 2023-09-25T18:55:52Z

@qcdyx Hey Jingsi! The goal right now is to conduct analytics on when equal_shape_distance_diff_coordinates to make a decision — it's too soon to decide about severity at the moment without doing an evaluation of the Mobility Database feeds that we use in acceptance tests, as specified in the next step heading here

The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

This is interesting! I believe up to this point we thought that the validator was using the actual shape_dist_traveled points defined by the feeds, not doing any additional interpretation. Let's talk about this more offline and then I'll circle back here to document next steps.

emmambd · 2024-02-20T23:01:22Z

An update on our approach on this PR: #1675 (comment)

Current approach is to implement a threshold of 1.11m on distances between shape point pairs for the ERROR (to capture any "same" values that result from precision/rounding issues at 5 decimal places for lat long values) , and create a WARNING for any distances that are less than that.

We plan to include this in the upcoming release, and will only take next steps to make this threshold more permissive if we receive user feedback on it.

emmambd · 2024-03-08T20:56:12Z

I'm going to close issue based on #1675 and re-open it if there is new community feedback after this release that indicates we should make the threshold more permissive.

isabelle-dr added enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference status: Needs discussion We need a discussion on requirements before calling this issue ready labels Sep 19, 2022

isabelle-dr added this to the Rules improvements milestone Oct 3, 2022

isabelle-dr added the need spec clarification Needs a modification in the specification label Oct 3, 2022

isabelle-dr removed this from the Rules improvements milestone Oct 3, 2022

isabelle-dr added bug Something isn't working (crash, a rule has a problem) and removed bug Something isn't working (crash, a rule has a problem) labels Oct 3, 2022

isabelle-dr added the bug Something isn't working (crash, a rule has a problem) label Feb 2, 2023

isabelle-dr added this to the Q1 2023 milestone Feb 14, 2023

isabelle-dr assigned KClough Mar 17, 2023

isabelle-dr modified the milestones: Q1 2023, Next Mar 24, 2023

holly-g assigned briandonahue Apr 25, 2023

briandonahue mentioned this issue May 8, 2023

False positive mixed_case_recommended_field for numeric values #1402

Closed

isabelle-dr unassigned briandonahue and KClough Jun 6, 2023

briandonahue mentioned this issue Jun 6, 2023

feat: add lat/lon to shape distance notices" #1480

Closed

4 tasks

isabelle-dr linked a pull request Jun 10, 2023 that will close this issue

feat: add lat/lon to shape distance notices" #1480

Closed

4 tasks

isabelle-dr added status: Work in progress A PR that would close this issue has been opened. and removed status: Needs discussion We need a discussion on requirements before calling this issue ready labels Jun 10, 2023

isabelle-dr modified the milestones: MobilityData Next, MobilityData Now Jun 12, 2023

qcdyx self-assigned this Sep 20, 2023

emmambd mentioned this issue Jan 10, 2024

Test setting threshold of 1.11m on equal_shape_dist_diff_coordinates #1638

Closed

emmambd removed this from the Now milestone Feb 12, 2024

emmambd closed this as completed Mar 8, 2024

emmambd linked a pull request Mar 8, 2024 that will close this issue

feat: threshold of 1.11m on equal_shape_distance_diff_coordinates #1675

Merged

5 tasks

emmambd removed status: Work in progress A PR that would close this issue has been opened. need spec clarification Needs a modification in the specification labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential false positives for equal_shape_distance_diff_coordinates #1258

Potential false positives for equal_shape_distance_diff_coordinates #1258

isabelle-dr commented Sep 19, 2022 •

edited by emmambd

Loading

github-actions bot commented Oct 3, 2022

isabelle-dr commented Feb 2, 2023

isabelle-dr commented Feb 2, 2023 •

edited

Loading

derhuerst commented Feb 3, 2023 •

edited

Loading

KClough commented Feb 18, 2023

isabelle-dr commented Mar 3, 2023 •

edited

Loading

isabelle-dr commented Mar 3, 2023

isabelle-dr commented Mar 3, 2023

isabelle-dr commented Mar 24, 2023

isabelle-dr commented Mar 24, 2023

briandonahue commented May 30, 2023 •

edited

Loading

isabelle-dr commented Aug 2, 2023 •

edited by emmambd

Loading

qcdyx commented Sep 25, 2023

emmambd commented Sep 25, 2023

emmambd commented Feb 20, 2024

emmambd commented Mar 8, 2024

Potential false positives for equal_shape_distance_diff_coordinates #1258

Potential false positives for equal_shape_distance_diff_coordinates #1258

Comments

isabelle-dr commented Sep 19, 2022 • edited by emmambd Loading

github-actions bot commented Oct 3, 2022

isabelle-dr commented Feb 2, 2023

isabelle-dr commented Feb 2, 2023 • edited Loading

derhuerst commented Feb 3, 2023 • edited Loading

KClough commented Feb 18, 2023

isabelle-dr commented Mar 3, 2023 • edited Loading

isabelle-dr commented Mar 3, 2023

isabelle-dr commented Mar 3, 2023

isabelle-dr commented Mar 24, 2023

isabelle-dr commented Mar 24, 2023

briandonahue commented May 30, 2023 • edited Loading

isabelle-dr commented Aug 2, 2023 • edited by emmambd Loading

qcdyx commented Sep 25, 2023

emmambd commented Sep 25, 2023

emmambd commented Feb 20, 2024

emmambd commented Mar 8, 2024

isabelle-dr commented Sep 19, 2022 •

edited by emmambd

Loading

isabelle-dr commented Feb 2, 2023 •

edited

Loading

derhuerst commented Feb 3, 2023 •

edited

Loading

isabelle-dr commented Mar 3, 2023 •

edited

Loading

briandonahue commented May 30, 2023 •

edited

Loading

isabelle-dr commented Aug 2, 2023 •

edited by emmambd

Loading