Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives for StopTimeTimepointWithoutTimesNotice #954

Closed
mcplanner-zz opened this issue Aug 12, 2021 · 13 comments · Fixed by #1044
Closed

False positives for StopTimeTimepointWithoutTimesNotice #954

mcplanner-zz opened this issue Aug 12, 2021 · 13 comments · Fixed by #1044
Assignees
Labels
bug Something isn't working (crash, a rule has a problem) GTFS Reference Used for Adding or changing rules that belong in the GTFS reference

Comments

@mcplanner-zz
Copy link

Bug report

Describe the bug
The StopTimeTimepointWithoutTimesNotice error is displayed when the stop_times.timepoint field is blank. This is unexpected. I would expect this to be interpreted as having no timepoints.

How we reproduce the bug
Steps to reproduce the behaviour:
This happened with this field: http://iportal.sacrt.com/GTFS/Unitrans/google_transit.zip

Expected behaviour
If the timepoint field isn't being used, I expect an error about it's being missing.

Observed behaviour
I was alerted that every timepoint was lacking an arrival and departure time.

Screenshots:

Environment versions
Java 11, v2.0.0 validator, https://github.com/cal-itp/gtfs-validator

@mcplanner-zz mcplanner-zz added the bug Something isn't working (crash, a rule has a problem) label Aug 12, 2021
@isabelle-dr isabelle-dr added the GTFS Reference Used for Adding or changing rules that belong in the GTFS reference label Aug 31, 2021
@lionel-nj
Copy link
Contributor

lionel-nj commented Sep 20, 2021

Thanks @mcplanner for opening.

The StopTimeTimepointWithoutTimesNotice error is displayed when the stop_times.timepoint field is blank. This is unexpected. I would expect this to be interpreted as having no timepoints.

The GTFS specification says that if stop_times.timepoint = 1 or empty then the times are considered exact. From my understanding a stop time with no value for timepoint and no value for stop_times.arrival_time or stop_times.departure_time should issue a StopTimeTimepointWithoutTimesNotice.

See stop_times.txt reference.

If the timepoint field isn't being used, I expect an error about it's being missing.

I am not sure if a notice should be generated here, the specification handles the case where no value is attributed to stop_times.timepoint: default value is 1.

Please feel free to close this issue, if that answers your interrogations.

@barbeau
Copy link
Member

barbeau commented Sep 20, 2021

Here's example data from the above GTFS file:

trip_id arrival_time departure_time stop_id stop_sequence stop_headsign pickup_type drop_off_type shape_dist_traveled
784678 07:25:00 07:25:00 22256 1
784678 22239 2
784678 07:28:00 07:28:00 22361 3
784678 22225 4

tl;dr - This data is fine and the validator shouldn't be logging any errors. The reason is that this data conforms to the original GTFS spec prior to the timepoint field existing (timepoint is optional). The original GTFS spec said that you should only provide times for records that were timepoints. So in the above data stop_sequence 1 and 3 are timepoints, and 2 and 4 are not.

IMHO the timepoint definition really needs to be clarified in the spec as it currently doesn't communicate this concept clearly - see google/transit#61.

@MobilityData/transit-specs

@barbeau
Copy link
Member

barbeau commented Sep 20, 2021

If the timepoint field isn't being used, I expect an error about it's being missing.

The GTFS Best Practices (http://gtfs.org/best-practices/#stop_timestxt) say:

The timepoint field should be provided. It specifies which stop_times the operator will attempt to strictly adhere to (timepoint=1), and that other stop times are estimates (timepoint=0).

Sticking to the canonical validator definition, timepoint field missing couldn't be an error, but could certainly be a warning.

@lionel-nj
Copy link
Contributor

lionel-nj commented Sep 20, 2021

Thanks for the clarification @barbeau! Following your suggestion, I agree that the spec is not clear on this particular point. I'll keep this issue open so that we can fix this bug.

Does the following seems reasonable?

  1. Identify if stop_times.txt comes from a legacy dataset: does the header contain column timepoint or not? This means that we'll have to modify the core module to include the boolean field hasTimepointColumn (or similar) boolean to GtfsStopTimeTableContainer.
  2. Remove defaut value 1 from GtfsStopTimeSchema for field timepoint:
  3. Issue a notice (warning) if a row from stop_times.txt has value timepoint=1 and no value for stop_times.arrival_time and/or stop_times.departure_time.

@barbeau
Copy link
Member

barbeau commented Sep 20, 2021

Here are the cases to cover.

For all stop_times.txt files:

  • First and last record for each trip must have arrival and departure time, else ERROR
  • If arrival time or departure time is populated for a record, then both arrival time AND departure time must be populated for that record, else ERROR

If timepoint column does not exist:

  • Emit WARNING that timepoint column should be added

If timepoint column exists:

  • All records must have 0 or 1 value in timepoint column, else ERROR
  • If timepoint==1, arrival and departure time must both be populated, else ERROR

There is also the case where a stop_times.txt file does not have the timepoint column, but all records have arrival and departure times populated. Technically this was against the original GTFS spec, but it became a common practice, and therefore we shouldn't consider it an error because it will invalidate a lot of old datasets. In this case we'll emit a warning anyway that timepoint should be added.

Does that make sense?

@lionel-nj
Copy link
Contributor

lionel-nj commented Sep 27, 2021

Here are the cases to cover.

For all stop_times.txt files:

  • First and last record for each trip must have arrival and departure time, else ERROR

✅ Covered in MissingTripEdgeValidator.

  • If arrival time or departure time is populated for a record, then both arrival time AND departure time must be populated for that record, else ERROR

✅ Covered in StopTimeArrivalAndDepartureTimeValidator.

If timepoint column does not exist:

  • Emit WARNING that timepoint column should be added

If timepoint column exists:

  • All records must have 0 or 1 value in timepoint column, else ERROR
  • If timepoint==1, arrival and departure time must both be populated, else ERROR

There is also the case where a stop_times.txt file does not have the timepoint column, but all records have arrival and departure times populated. Technically this was against the original GTFS spec, but it became a common practice, and therefore we shouldn't consider it an error because it will invalidate a lot of old datasets. In this case we'll emit a warning anyway that timepoint should be added.

TimepointTimeValidator should be modified to this extent.

Does that make sense?

Yes, @isabelle-dr should the specification be clarified first or could these changes be implemented before?

@isabelle-dr isabelle-dr added this to the v3.0.0 milestone Oct 18, 2021
@isabelle-dr
Copy link
Contributor

isabelle-dr commented Oct 20, 2021

Hey all, thanks for flagging and looking into this.

The case of legacy data with no timepoint column (prior to the timepoint field existing) was flagged when we initially introduced that rule.

I agree that the root problem is that the specification contains two contradictory statements about the meaning of an "empty" value in stop_times.timepoint

  1. in stop_times.timepoint:

1 or empty - Times are considered exact.

  1. in stop_times.departure_time and stop_times.arrival_time:

Conditionally Required:
Required for timepoint=1.
Optional otherwise.

The former could mean thatstop_times.timepoint == 1 is equivalent than stop_times.timepoint == "".
The latter could mean that the times are required when stop_times.timepoint == 1 and are not required when stop_times.timepoint == "".
If we consider that an empty timepoint column is equivalent to empty values in all stops_times.timepoint fields, all legacy datasets become invalid, which is how it was coded in the validator and what @mcplanner-zz is flagging here. She also points out how counter-intuitive it is that having no values for timepoint is considered a timepoint.

This is extremely well described in google/transit/issues/61.
The "empty" word in the first statement seems to actually refer to an empty timepoint column which maintains the backwards compatibility.

How should the validator work?

If timepoint column does not exist:

Emit WARNING that timepoint column should be added

Absolutely, and this will be a new notice based on the Best Practices, that we should add at some point.

If timepoint column exists:

All records must have 0 or 1 value in timepoint column, else ERROR
If timepoint==1, arrival and departure time must both be populated, else ERROR

I would only implement the second statement proposed here (at least for now) because the first is stricter than the specification: it would make the following data invalid, and there is nothing in spec currently that would justify that.

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

@lionel-nj I'd say we're good to make the changes in the validator to maintain the backwards compatibility (second statement proposed by Sean, and your initial idea), but not upgrading this rule to error, this will be done in the PR we already have.

@isabelle-dr
Copy link
Contributor

Here are some examples of what would change, @barbeau does this seem right?

Case 1: legacy dataset

Current behaviour: The notice is emitted for stops 2 & 3
Expected behaviour: No notice

stop_sequence arrival_time departure_time
1 00:00:00 00:00:00
2    
3    
4 00:10:00 00:10:00

Case 2: dataset with timepoint column, filled

Current behaviour: No notice
Expected behaviour: No notice

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00 0
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

Case 3: invalid data

Current behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stop 2
Expected behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stop 2

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 1
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

Case 4 (edge case): dataset with timepoint columns, only 1's, all times populated

Current behaviour: No notice
Expected behaviour: No notice

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00
3 00:08:00 00:08:00
4 00:10:00 00:10:00 1

Case 5 (edge case): dataset with timepoint column, one 1's and some empty times

Current behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stops 2 and 3
Expected behaviour: No notice

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

@barbeau
Copy link
Member

barbeau commented Oct 20, 2021

@isabelle-dr Cases 1, 2, and 3 LGTM 👍

I would only implement the second statement proposed here (at least for now) because the first is stricter than the specification: it would make the following data invalid, and there is nothing in spec currently that would justify that.

So Cases 4 and 5 fall into this category. I agree that the current language in the spec is fuzzy around this, although the original intent when introducing the timepoint column was to specify all or no values - obviously in hindsight it would have been better if it said that explicitly.

Do we know if there are real datasets in the wild that are partially populating timepoint like Case 4 and 5? If so, then I agree that the spec probably needs to be clarified before classifying this as an error. If not, IMHO it could go either way, and IMHO it should be an error.

@isabelle-dr
Copy link
Contributor

isabelle-dr commented Oct 22, 2021

@barbeau here are some interesting edge cases with partial implementations, with some real data samples.

stop_sequence arrival_time departure_time timepoint
0 13:15:00 13:15:00
1 13:30:00 13:30:00 0
2 13:40:00 13:40:00 0
3 13:50:00 13:50:00
4 14:00:00 14:00:00 0

They seem to have assumed that since 1 or empty - Times are considered exact. (stop_times.timepoint description), the empty values mean the times are exact. This data doesn't contain any 1's in the timepoint column.

stop_sequence arrival_time departure_time timepoint
1 8:30:00 8:30:00 1
2 8:31:01 8:31:01 1
3
4
5
6
7
8
9
10 8:45:00 8:45:00 1
11 8:55:00 8:55:00 1

Here, they seem to have assumed that since times are Conditionally Required: Required for timepoint=1. Optional otherwise.(stop_times.departure_time and stop_times.arrival_time description), the empty values mean the times aren't exact. This data doesn't contain any 0's in the timepoint column.

trip_id stop_sequence arrival_time departure_time timepoint
1167963915753602:T1218:39:00 1 18:39:00 18:39:00 1
1167963915753602:T1218:39:00 2 18:42:27 18:42:27 0
1167963915753602:T1218:39:00 3 18:43:21 18:43:21 0
1167963915753602:T1218:39:00 4 18:44:36 18:44:36 0
1167963915753602:T1218:39:00 5 18:45:38 18:45:38 0
1167963915753602:T1218:39:00 6 18:47:34 18:47:34 0
1167963915753602:T1218:39:00 7 18:48:00 18:48:00 1
1167963915753602:T1218:39:00 8 18:50:37 18:50:37 0
1167963915753602:T1218:39:00 9 18:52:00 18:52:00 1
kcc_toCleElum_weekday_T01 1 5:30:00 5:30:00
kcc_toCleElum_weekday_T01 2
kcc_toCleElum_weekday_T01 3
kcc_toCleElum_weekday_T01 4
kcc_toCleElum_weekday_T01 5
kcc_toCleElum_weekday_T01 6
kcc_toCleElum_weekday_T01 7 6:10:00 6:15:00
kcc_toCleElum_weekday_T01 8
kcc_toCleElum_weekday_T01 9 6:20:00 6:20:00

This part of the data is at the end of the dataset. The timepoint values have been added for most of the lines, but after this point, all timepoint values are blank. Does this mean all times are exact? Or approximate? I believe neither, the values just haven't been added.

There are also datasets that added the timepoint column but all values are blank, and it seems obvious that it doesn't mean all values are timepoints, see the Via Rail Canada dataset below:

stop_sequence arrival_time departure_time timepoint
1 3:30:00 3:30:00
2
3 4:07:00 4:07:00
4
5
6
7 5:01:00 5:01:00
8 5:12:00 5:12:00
9 5:20:00 5:20:00
10

@barbeau
Copy link
Member

barbeau commented Oct 22, 2021

😱

Yikes - so there is real-world data all over the place. It would be nice to get this clarified in the spec ASAP so we can implement the original intent in the validator and get these datasets cleaned up. There seemed to be clear agreement on how the data was intended to be represented in the original proposal IIRC, the wording just apparently wasn't clear enough.

@lionel-nj
Copy link
Contributor

lionel-nj commented Oct 26, 2021

As agreed with @isabelle-dr and @barbeau, we will implement the following solutions:

  1. Emit a notice when timepoint = 1 AND times are missing.

This rule will consider legacy datasets as valid, and the validator will behave as follows:

sample data 1 (no timepoint column)

stop_sequence arrival_time departure_time
1 00:00:00 00:00:00
2    
3    
4 00:10:00 00:10:00

No notice

sample data 2 (timepoint column + no empty values)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00 0
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

No notice

sample data 3 (timepoint with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 1
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

A notice will be generated for the 2nd row.

*sample data 4 (approximate, with times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00
3 00:08:00 00:08:00
4 00:10:00 00:10:00 1

No notice

sample data 5 (approximate, with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

No notice

⚠️ Note that the notice will be considered as a WARNING, the severity will be changed in a separate PR to better assess the potential breaking changes.
2. Emit a notice (WARNING) when stop_times.timepoint value is missing for a single record.
3. Emit a notice (WARNING) when column stop_times.timepoint is missing (based on best practices)

@lionel-nj
Copy link
Contributor

@mcplanner-zz could you provide an URL to download the dataset you mentionned in the issue description please? I tried download the dataset linked to the URL you provided but could not: the operation timed out with error code ERR_SOCKET_NOT_CONNECTED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working (crash, a rule has a problem) GTFS Reference Used for Adding or changing rules that belong in the GTFS reference
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants