Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ShapeIncreasingDistanceValidator #1083

Merged
merged 24 commits into from
Jan 25, 2022

Conversation

lionel-nj
Copy link
Contributor

@lionel-nj lionel-nj commented Dec 22, 2021

closes #1070

Summary:

This PR provides support to: make warning-level notice when an equal shape_dist_traveled value is encountered and the previous row in the sequence had the same GPS coordinate.

Expected behavior:

  • DecreasingShapeDistanceNotice as an ERROR: should be issued when shape_dist_traveled decreases between two consecutive shape points (based on shape_pt_sequence);
    EqualShapeDistanceDiffCoordinatesNotice as an ERROR: should be issued when shape_dist_traveled is equal between two consecutives shape points that have different GPS coordinates;
    EqualShapeDistanceSameCoordinatesNotice as a WARNING: should be issued when shape_dist_traveled is equal between two consecutives shape points that have the same GPS coordinates.

@evansiroky @botanize your comments and feedback are more than welcome :)

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • [ ] Include screenshot(s) showing how this pull request works and fixes the issue(s)

lionel-nj added 2 commits December 22, 2021 12:05
- additional unit test
- update documentation to match latest development
Comment on lines 84 to 85
return shape.shapePtLon() == otherShape.shapePtLon()
&& shape.shapePtLat() == otherShape.shapePtLat();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the errors we've been seeing a lot is when there are two shape points that are very close to each other (<1 meter apart) and the shape distance was equal. It could've been that the shape distance units in a particular feed were in kilometers and that due to a rounding process the distance was equal and therefore probably an OK value. It would be great if this check here could return true if the coordinates are less than 1 meter apart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question is why 1 meter, why not 80cm or 1.20m?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 meter is arbitrary honestly. Maybe @barbeau can provide some more educated insights into GPS precision and measurement.

RULES.md Outdated

<a name="EqualShapeDistanceNotice"/>

When sorted by `shape.shape_pt_sequence`, two consecutive shape points must not have equal values for `shape_dist_traveled` if their GPS coordinates are the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "if their GPS coordinates are not the same". It's an error to have equal distances with different coordinates, it's a warning to have equal distances and equal coordinates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other concern is that at least for our feed, which generates equal distances every stop pair, we get 10s of thousands of these notices per validation. I'd love to have the ability to output an abbreviated json validation report that reported the number of times each notices happened, but only the first X occurrences.

Copy link
Contributor Author

@lionel-nj lionel-nj Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "if their GPS coordinates are not the same".

👍🏾

It's an error to have equal distances with different coordinates, it's a warning to have equal distances and equal coordinates.

Or if said coordinates are actually less more than 1 meter away according to #1083 (comment).

My only other concern is that at least for our feed, which generates equal distances every stop pair, we get 10s of thousands of these notices per validation. I'd love to have the ability to output an abbreviated json validation report that reported the number of times each notices happened, but only the first X occurrences.

As of now a maximum of 1000 of notices are exported by type and severity, and the validation report gives information about the total number of notices per type and severity.

private static final int MAX_EXPORTS_PER_NOTICE_TYPE_AND_SEVERITY = 1_000;

if (i > maxExportsPerNoticeTypeAndSeverity) {
// Do not export too many notices for this type.
break;
}

I'd love to have the ability to output an abbreviated json validation report that reported the number of times each notices happened, but only the first X occurrences.

I see this as separate from the bug fixed here. We could allow users to reduce this limit in a separate PR by making the limit user configurable. cc @isabelle-dr.

- additional unit test
- update documentation to match latest development
@lionel-nj
Copy link
Contributor Author

lionel-nj commented Dec 23, 2021

Some stats:

As of now:

  • 108 datasets trigger EqualShapeDistanceNotice as an ERROR
  • 272 datasets trigger EqualShapeDistanceNotice as aWARNING
  • 2 datasets trigger DecreasingShapeDistanceNotice
  • 713 datasets are valid (no error)

Before the fix:

  • 277 datasets triggered DecreasingOrEqualShapeDistanceNotice as an ERROR
  • 608 datasets were valid (no error)

Also, with this fix no additional error type is introduced (i.e. no dataset go from valid to invalid).

cc @isabelle-dr

@isabelle-dr
Copy link
Contributor

isabelle-dr commented Dec 23, 2021

Hello,

A few thoughts/questions about the original issue and the modification of DecreasingOrEqualShapeDistanceNotice in this PR.

  1. Getting those stats is amazing! Seing that among all the datasets that have equal values for shape_dist_traveled, there is almost a 30-70% split between the ones that have the same coordinates and the ones that have different coordinates confirms that it's worth looking into what is actually a severe problem here. Thanks @botanize and @evansiroky for highlighting this issue and @lionel-nj for getting these numbers!

  2. Would EqualShapeDistanceNotice be it the first rule that can appear as an error or a warning, depending on the logic? How would this work with our plans with the custom validation feature? (especially, if we go towards our second option using extended base validation class.)
    Would it be more clear for our users if we had two distinct rules for EqualShapeDistanceNotice for the two cases we have here?
    Again, keeping in mind that soon, users will be able to upgrade or downgrade a notice. My initial feeling is that it might be more clear to have two distinct notices for different severity levels, even if part of the code is the same. I'd love to hear your thoughts here @evansiroky and @botanize.

Here are other options we could look at:

  • option 1:

    • DecreasingOrEqualShapeDistanceNotice (ERROR): when shape_dist_traveled values are decreasing OR (when they are equal AND coordinates are different)
    • EqualShapeDistanceNotice (WARNING): when shape_dist_traveled values are equal AND coordinates are equal
  • option 2:

    • DecreasingShapeDistanceNotice (ERROR)
    • EqualShapeDistanceDiffCoordinatesNotice (ERROR)
    • EqualShapeDistanceSameCoordinatesNotice (WARNING)
  1. Do we have the same problem for DecreasingOrEqualStopTimeDistanceNotice?

  2. About what we consider as "same coordinates"

It would be great if this check here could return true if the coordinates are less than 1 meter apart.

This makes a lot of sense. As @lionel-nj mentioned, I'd be interested to look at what would be the best threshold to use.
Would it be possible to make a 1-D plot showing values of distances between coordinates for all pairs of shape points that have the same shape_dist_traveled (and same shape_id) in order to see what threshold would capture all those rounded values?
We can't play too much with the datasets yet at MobilityData because we don't have an API yet 😭

  1. Are there datasets that contain two shape points with the same coordinates and different values for shape_dist_traveled in shapes.txt? Should we also flag this?

@evansiroky
Copy link

Would it be more clear for our users if we had two distinct rules for EqualShapeDistanceNotice for the two cases we have here?

Again, keeping in mind that soon, users will be able to upgrade or downgrade a notice. My initial feeling is that it might be more clear to have two distinct notices for different severity levels, even if part of the code is the same. I'd love to hear your thoughts here @evansiroky and @botanize.

I'm in favor of having two distinct rules with different error levels.

Here are other options we could look at:

  • option 1:

    • DecreasingOrEqualShapeDistanceNotice (ERROR): when shape_dist_traveled values are decreasing OR (when they are equal AND coordinates are different)
    • EqualShapeDistanceNotice (WARNING): when shape_dist_traveled values are equal AND coordinates are equal
  • option 2:

    • DecreasingShapeDistanceNotice (ERROR)
    • EqualShapeDistanceDiffCoordinatesNotice (ERROR)
    • EqualShapeDistanceSameCoordinatesNotice (WARNING)

I like option 2.

  1. Do we have the same problem for DecreasingOrEqualStopTimeDistanceNotice?

I suppose this could technically be OK in a 3-dimensional sense if there was a transit vehicle that traveled along a pattern that involved an elevator with different "stops" at different levels. However, I don't think that currently exists anywhere yet, so it may be best to keep this as an error.

  1. About what we consider as "same coordinates"

It would be great if this check here could return true if the coordinates are less than 1 meter apart.

This makes a lot of sense. Like @lionel-nj mentioned, I'd be interested to look at what would be the best threshold to use. Would it be possible to make a 1-D plot showing values of distances between coordinates for all pairs of shape points that have the same shape_dist_traveled (and same shape_id) in order to see what threshold would capture all those rounded values? We can't play too much with the datasets yet at MobilityData because we don't have an API yet 😭

I'll defer to @barbeau's expertise here.

  1. Are they datasets that contain two shape points with the same coordinates and different values for shape_dist_traveled in shapes.txt? Should we also flag this?

Yes, this should be an error, but I'd be surprised if it occurred in the wild.

@barbeau
Copy link
Member

barbeau commented Jan 3, 2022

Would it be possible to make a 1-D plot showing values of distances between coordinates for all pairs of shape points that have the same shape_dist_traveled (and same shape_id) in order to see what threshold would capture all those rounded values?

This would be ideal (i.e., create the threshold based on a percentile based on real data), but to my knowledge not easy to calculate at the moment without a lot of manual effort.

One way to approach this is based on precision of decimal places for latitude and longitude. From http://wiki.gis.com/wiki/index.php/Decimal_degrees:

decimal places degrees distance
0 1.0 111 km
1 0.1 11.1 km
2 0.01 1.11 km
3 0.001 111 m
4 0.0001 11.1 m
5 0.00001 1.11 m
6 0.000001 0.111 m
7 0.0000001 1.11 cm
8 0.00000001 1.11 mm

So, if you cap the number of decimal places to 5 the smallest difference between values you can represent is 1.11 meters. Therefore, it would be possible for two values less than 1.11 meters apart to be capped/rounded to the same latitude and longitude value.

So perhaps 1.11 meters is a good value? That should capture any "same" values that result from precision/rounding issues at 5 decimal places. That's also better accuracy that you'd typically see with an off-the-shelf consumer-grade GNSS unit capturing a vehicle path in motion.

@botanize
Copy link

botanize commented Jan 4, 2022

Here are other options we could look at:

  • option 2:

    • DecreasingShapeDistanceNotice (ERROR)
    • EqualShapeDistanceDiffCoordinatesNotice (ERROR)
    • EqualShapeDistanceSameCoordinatesNotice (WARNING)

I prefer option 2 here.

  1. Do we have the same problem for DecreasingOrEqualStopTimeDistanceNotice?

We have historically scheduled holds at stops as two consecutive departures from the same stop, so those would have shown up as equal distance in stop_times. I believe we now schedule those holds as actual holds with arrival and departure times or post-process them to merge the pair of stops into an arrival and departure time for one stop. The spec seems to allow our old practice, so we'd expect that equal stop time distances would be allowed, if not encouraged.

  1. About what we consider as "same coordinates"

1.11 m is fine, but realistically, anything less that 13 m (length of our regular buses) is basically the same from an operations perspective.

  1. Are they datasets that contain two shape points with the same coordinates and different values for shape_dist_traveled in shapes.txt? Should we also flag this?

If the shape contains a loop then you could definitely have two different shape points with the same coordinates and different shape_dist_traveled values. I don't think you'd want to flag that, unless you had some way of detecting invalid looping.

@isabelle-dr
Copy link
Contributor

isabelle-dr commented Jan 6, 2022

Thanks for the comments on this PR 🙏

If the shape contains a loop then you could definitely have two different shape points with the same coordinates and different shape_dist_traveled values. I don't think you'd want to flag that, unless you had some way of detecting invalid looping.

Good point. I was thinking of consecutive stops there, which would be an error I think. Let's keep this as a potential candidate for later if we see a need.

We have historically scheduled holds at stops as two consecutive departures from the same stop, so those would have shown up as equal distance in stop_times. I believe we now schedule those holds as actual holds with arrival and departure times or post-process them to merge the pair of stops into an arrival and departure time for one stop. The spec seems to allow our old practice, so we'd expect that equal stop time distances would be allowed, if not encouraged.

Opened #1084 to follow up.

So in conclusion, we agree on the option 2:

  • DecreasingShapeDistanceNotice as an ERROR
  • EqualShapeDistanceDiffCoordinatesNotice as an ERROR
    • consider as "different coordinates" pairs of lat/long that are more than 1,11 meters apart and see how many datasets would trigger this notice after the addition of this threshold (currently 108 datasets with no threshold). Do a quick inspection if needed, decide from there if the threshold seems acceptable).
  • EqualShapeDistanceSameCoordinatesNotice as a WARNING
    • @botanize @evansiroky, I'm in favour of not using the threshold so the rule flags a "truly" duplicative record (i. e. pairs of lat/long that are less than 1,11 meters apart would NOT be considered equal in this context). Are you OK with this?

lionel-nj added 2 commits January 10, 2022 09:24
- create EqualShapeDistanceSameCoordinatesNotice
- create EqualShapeDistanceDiffCoordinatesNotice
- update docstrings
@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit f8519a2 here (report will disappear after 90 days).

@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit a636f19 here (report will disappear after 90 days).

- update unit tests
Comment on lines 99 to 100
return shape.shapePtLon() != otherShape.shapePtLon()
&& shape.shapePtLat() != otherShape.shapePtLat()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have different coordinates while either lon XOR lat is identical. Shouldn't this be !(shape.shapePtLon() == otherShape.shapePtLon() && shape.shapePtLat() == otherShape.shapePtLat())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be !(shape.shapePtLon() == otherShape.shapePtLon() && shape.shapePtLat() == otherShape.shapePtLat())

Thanks for flagging that! Modified in 574cd7b.

Comment on lines 73 to 82
if (MAX_DISTANCE_SHAPEPOINTS_METERS
< getDistanceMeters(curr.shapePtLatLon(), prev.shapePtLatLon())) {
noticeContainer.addValidationNotice(
new EqualShapeDistanceDiffCoordinatesNotice(prev, curr));
}
} else if (curr.shapePtLat() == prev.shapePtLat()
&& curr.shapePtLon() == prev.shapePtLon()) {
// equal shape_dist_traveled and same coordinates
noticeContainer.addValidationNotice(
new DecreasingOrEqualShapeDistanceNotice(
curr.shapeId(),
curr.csvRowNumber(),
curr.shapeDistTraveled(),
curr.shapePtSequence(),
prev.csvRowNumber(),
prev.shapeDistTraveled(),
prev.shapePtSequence()));
new EqualShapeDistanceSameCoordinatesNotice(prev, curr));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if coordinates are different and getDistanceMeters(curr.shapePtLatLon(), prev.shapePtLatLon()) <= MAX_DISTANCE_SHAPEPOINTS_METERS ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, no notice should be triggered, see:

public void shapeWithEqualDistance_closeGpsCoordinates_shouldNotGenerateNotice() {

@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit e03811f here (report will disappear after 90 days).

@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit 9f0b8c0 here (report will disappear after 90 days).

@github-actions
Copy link
Contributor

This contribution does not follow the conventions set by the Google Java style guide. Please run the following command line at the root of the project to fix formatting errors: ./gradlew goJF.

@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit 5348ed0 here (report will disappear after 90 days).

Copy link
Contributor

@maximearmstrong maximearmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and comments in-line before approval :)

lionel-nj and others added 6 commits January 21, 2022 10:28
@github-actions
Copy link
Contributor

This contribution does not follow the conventions set by the Google Java style guide. Please run the following command line at the root of the project to fix formatting errors: ./gradlew goJF.

@MobilityData MobilityData deleted a comment from github-actions bot Jan 21, 2022
@github-actions
Copy link
Contributor

Thank you for this contribution! 🍰✨🦄

Information about source corruption

0 out of 1248 sources are corrupted.

Acceptance test details

Due to changes in this pull request, the following validation rules trigger new errors:

Download the full acceptance test report for commit e80cee0 here (report will disappear after 90 days).

RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
Copy link
Contributor

@maximearmstrong maximearmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

@Test
public void noShapeDistTravelled_shouldNotGenerateNotice() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition 🚀

@lionel-nj
Copy link
Contributor Author

Thank you for your contribution ! @evansiroky @botanize @maximearmstrong @isabelle-dr

@lionel-nj lionel-nj merged commit 1a9ce3a into master Jan 25, 2022
@lionel-nj lionel-nj deleted the fix-ShapeIncreasingDistanceValidator branch January 25, 2022 19:54
@@ -34,7 +36,7 @@ public static GtfsShape createShapePoint(
double shapePtLat,
double shapePtLon,
int shapePtSequence,
double shapeDistTraveled) {
Double shapeDistTraveled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for coming in late here, but is there a reason this was changed from double to Double?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done in order to support testing agains null value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants