Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: False positives for StopTimeTimepointWithoutTimesNotice #1044

Merged
merged 27 commits into from
Nov 16, 2021

Conversation

lionel-nj
Copy link
Contributor

@lionel-nj lionel-nj commented Oct 26, 2021

Summary:

This PR provides support to no longer throw false positives in TimepointValidator.

Expected behavior:

sample data 1 (no timepoint column)

stop_sequence arrival_time departure_time
1 00:00:00 00:00:00
2    
3    
4 00:10:00 00:10:00

MissingTimepointColumnNotice should be issued once.

sample data 2 (timepoint column + no empty values)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00 0
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

No notice

sample data 3 (timepoint with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 1
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

StopTimeTimepointWithoutTimesNotice should be issued for the 2nd row.

sample data 4 (approximate, with times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00
3 00:08:00 00:08:00
4 00:10:00 00:10:00 1

MissingTimepointValueNotice should be issued for the 2nd and 3rd row.

sample data 5 (approximate, with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

MissingTimepointValueNotice should be issued for the 2nd and 3rd row.

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • [ ] Include screenshot(s) showing how this pull request works and fixes the issue(s)

@lionel-nj lionel-nj self-assigned this Oct 26, 2021
@lionel-nj lionel-nj linked an issue Oct 26, 2021 that may be closed by this pull request
@isabelle-dr isabelle-dr added this to the v3.0.0 milestone Oct 27, 2021
Copy link
Contributor

@maximearmstrong maximearmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and comments in-line before approval :)

RULES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
* </ul>
*/
@GtfsValidator
public class TimepointTimeValidator extends SingleEntityValidator<GtfsStopTime> {

@Override
public void validate(GtfsStopTime stopTime, NoticeContainer noticeContainer) {
if (!stopTime.hasTimepoint()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the validator fill the column with a default value if it is not in the dataset, or are we not entering this TimepointTimeValidator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the validator fill the column with a default value if it is not in the dataset,

Yes, default is 1.

lionel-nj and others added 3 commits November 1, 2021 09:18
Co-authored-by: Maxime Armstrong <46797220+maximearmstrong@users.noreply.github.com>
@isabelle-dr
Copy link
Contributor

One question here:
If the Sample data 1, it's written

MissingTimepointValueNotice should be issued for the 2nd and 3rd row.

It's not exactly what is mentioned here, is this a typo in the description of this PR? We talked about throwing the MissingTimepointValueNotice only if the column is present.

Also: do you need real datasets that fall in those different categories to test this PR?

@lionel-nj
Copy link
Contributor Author

lionel-nj commented Nov 1, 2021

It's not exactly what is mentioned here, is this a typo in the description of this PR? We talked about throwing the MissingTimepointValueNotice only if the column is present.

Indeed, this will be fixed in the next commits. Thanks for flagging that.

Also: do you need real datasets that fall in those different categories to test this PR?

I'll use the one you flagged in this document, thanks.

@lionel-nj
Copy link
Contributor Author

lionel-nj commented Nov 1, 2021

@maximearmstrong PTAL - the rule logic has been changed a bit to generate MissingTimepointColumnNotice when a dataset has no field timepoint in GTFS file stop_times.txt.

At present, out of ~1200 datasets on the MobilityDatabse:

  • 44 datasets trigger StopTimeTimepointWithoutTimesNotice;
  • 560 datasets trigger MissingTimepointColumnNotice (legacy datasets);
  • 125 datasets trigger MissingTimepointValueNotice.

@isabelle-dr
Copy link
Contributor

That StopTimeTimepointWithoutTimesNotice number makes a lot more sense!

The number of MissingTimepointColumnNotice is surprisingly high.
After looking at a few datasets that trigger it, I found a false positive 😩, here it the dataset.

Copy link
Contributor

@isabelle-dr isabelle-dr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename the notice names? :)

  • MissingTimepointColumn -> MissingTimePointColumnNotice
  • MissingTimepointValue -> MissingTimepointValueNotice

RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
docs/NOTICES.md Outdated Show resolved Hide resolved
}
}

private boolean isTimepoint(GtfsStopTime stopTime) {
return stopTime.timepoint().equals(GtfsStopTimeTimepoint.EXACT);
return stopTime.hasTimepoint() && stopTime.timepoint().equals(GtfsStopTimeTimepoint.EXACT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in this case should we remove default value instead?

.setDepartureTime(null)
.setStopId("stop id 0")
.setStopSequence(2)
.setTimepoint((Integer) null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

It looks like we may need to create a feature request to have an ability to specify test entities in the csv format, otherwise, we're trying to repeat logic of auto generated code, which may change and we end up with wrong tests.

- clarify docstrings
- remove use of Strings.format()
@isabelle-dr
Copy link
Contributor

@asvechnikov2 can we get a review so we can merge this PR?
See this comment for what we are proposing to do.

.setDepartureTime(GtfsTime.fromSecondsSinceMidnight(580))
.setStopId("stop id 2")
.setStopSequence(2)
.setTimepoint(1)
Copy link
Member

@barbeau barbeau Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right to set timepoint equal to 1 for this test? From the method name I expected both to be set to null. If it's intended, IMHO it's a bit awkward as it's not real data - you'd never have two records for stop times where one has the column and the other does not.

Copy link
Member

@barbeau barbeau Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up question after looking at other tests - how do you differentiate between "no column" and "no value" in the unit tests?

Is the missing timepoint value in CsvHeader header what triggers MissingTimepointColumnNotice (even if you call .setTimepoint(1) on one of the StopTime records)?

Is setting .setTimepoint(null) what is used to define a missing value (even if the timepoint value is included in header)?

I'd suggest clearly documenting this behavior in the unit tests as right now that's not 100% clear (to me at least), even following the discussion thread in https://github.com/MobilityData/gtfs-validator/pull/1044/files#r742464479. And I'd avoid any confusing data that won't exist in the real world, like a missing column but a record with a timepoint value (i.e., this test currently).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up question after looking at other tests - how do you differentiate between "no column" and "no value" in the unit tests?

  • no column -> these tests use he legacy header
  • no value -> these tests use the regular header, with null timepoint value.

Is the missing timepoint value in CsvHeader header what triggers MissingTimepointColumnNotice (even if you call .setTimepoint(1) on one of the StopTime records)?

Yes.

Is setting .setTimepoint(null) what is used to define a missing value (even if the timepoint value is included in header)?

Exactly.

I'd suggest clearly documenting this behavior in the unit tests as right now that's not 100% clear (to me at least), even following the discussion thread in https://github.com/MobilityData/gtfs-validator/pull/1044/files#r742464479. And I'd avoid any confusing data that won't exist in the real world, like a missing column but a record with a timepoint value (i.e., this test currently).

👍🏾 I removed the non-necessary (and misleading records) in e33c960. And added additional documentation in 84926f4.

@github-actions
Copy link
Contributor

This contribution does not follow the conventions set by the Google Java style guide. Please run the following command line at the root of the project to fix formatting errors: ./gradlew goJF.

Copy link
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lionel-nj, a suggestion for docs and a few more comments in-line.

Copy link
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@lionel-nj lionel-nj merged commit 57689d7 into master Nov 16, 2021
@lionel-nj lionel-nj deleted the fix/timepoint branch November 16, 2021 21:23
Copy link
Collaborator

@asvechnikov2 asvechnikov2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asvechnikov2 can we get a review so we can merge this PR? See this comment for what we are proposing to do.

LGTM! Sorry, I was sure that I'd removed "requested changes" review and the PR had been merged already.

}
}

private boolean isTimepoint(GtfsStopTime stopTime) {
return stopTime.timepoint().equals(GtfsStopTimeTimepoint.EXACT);
return stopTime.hasTimepoint() && stopTime.timepoint().equals(GtfsStopTimeTimepoint.EXACT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add a comment to clarify the behaviour, here and probably in the class docstring as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False positives for StopTimeTimepointWithoutTimesNotice
5 participants