Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy issues with rider_trip.txt, recommend removing #34

Open
westontrillium opened this issue May 12, 2022 · 1 comment
Open

Privacy issues with rider_trip.txt, recommend removing #34

westontrillium opened this issue May 12, 2022 · 1 comment

Comments

@westontrillium
Copy link

Problem statements

  1. In our opinion, the rider_trip.txt file goes against the Mobility Data Privacy Principles of which Trillium is an endorsing organization. Specifically, it violates principles # 1, # 5, # 6, and in its current state is at risk of violating principles # 3 and # 7.
  2. The use cases for this file are not apparent, and the ones I can think of do not justify the undo surveillance of riders’ travel patterns. In general, we would like to hear a case for this file’s inclusion that outweighs its privacy issues.
  3. How feasible is it to implement this feature of the spec? How would alights be recorded? How would information about a rider (e.g. rider_type) be generated?
  4. MDS has similar components that deal with the collection of rider trip data. These components have caused some very public controversy resulting in a blow to the spec’s reputation. There are valuable lessons to be learned from that history. For a discussion on rider trip data generated by GBFS and MDS and the surrounding privacy concerns, see this article.

Solutions considered

  • (Recommended) Remove rider_trip.txt file from the spec altogether.
    • Pros
      • Solves all of the above problems.
      • Simplifies the spec.
      • Puts spec into a better position for adoption (removes the practical problem of implementing this file, less components to debate, less controversy)
      • rider_trips.txt is not currently in use, so there’s not much work lost if we were to remove it.
    • Cons
      • Would lose the ability for specific rider trip analysis (but as mentioned above, this is not necessarily desirable).
  • (Also considered, but not recommended) Require a unique rider_id for both boarding and alighting, allow only board or alight for a single record, so start and end points of a single rider’s trip cannot be collected. This is similar to a solution regarding vehicle ids that GBFS implemented as a response to privacy concerns.
    • Pros
      • Retains the file while somewhat reducing the impact to rider privacy (would still include boarding and alighting info of a rider, but those data would be disconnected from one another)
    • Cons
      • Amount of data collected about riders is still superfluous.
      • While more difficult, still reasonably easy to re-identify a rider because there could still be matching fields between the parsed board and alight fields (e.g. rider_type, transaction_type, fare_media, etc.)
      • Would lose the ability for analysis of start-to-finish individual rider trips.
  • (Also considered, but not recommended) Remove all of the alight fields, rename file to rider_boardings.txt
    • Pros
      • Retains the file while somewhat reducing the impact to rider privacy.
    • Cons
      • Amount of data collected about riders is still superfluous.
      • Would lose the ability for analysis of start-to-finish individual rider trips (would only show boarding information per rider).

Looking forward to discussing further!

@lrosenfield-uta
Copy link

I do think that the use cases for the information that can be found in this dataset do not justify the privacy implications should such data be made publicly available. However, part of the benefit that I would hope to get out of GTFS-ride should it be adopted by my agency is the ability to use and develop tools to work with non-public datasets that can, because of the common format, be shared with other agencies. I am more hesitant to say rider_trip is of no use whatsoever for intra-agency purposes. If it were eliminated, I would hope to see some kind of standardized point-to-point trip propensity dataset.

A use case example: we are planning to split up a longer, regional route within our service area into two routes so that we can increase the frequency of the more heavily used northern portion of the route and split up our blocking to reduce operator travel time. The planned split point is at a commuter rail station about halfway along the route.

I mock up the new service and run it through a publicly available comparison tool. The comparison tool uses the EFC and Point-in-Time survey data to identify that a substantial number of riders I'd assumed were transferring from commuter rail were actually traveling through the planned terminus, getting off at a transfer point three stops later, transferring to a cross-town route, and all arriving at a single employment center.

Instead of simply deviating the crosstown route to the commuter rail station, I propose extending the southern route to terminate at the employment center, and reduce the frequency of the crosstown route.

The way I see it there are two ways the information in this scenario would be usable:

  • a rider_trip dataset, anonymized, but with all the risks you've described,
    • Pros:
      • Hews closer in format to the raw data from which aggregation is likely derived, enabling a variety of different use cases, which seems to be a goal of this specification
      • Is more GTFS-ish than a crosstable, which would be unlike anything else in the specification or its extensions
    • Cons:
      • Has all the aforementioned privacy concerns. Even in terms of internal use, this form would require the retention of extremely sensitive information
  • an aggregated stop-stop propensity crosstable, which shows, for each boarding stop_id, the likelihood that a person will alight from each different stop_id.
    • Pros:
      • Would fulfill most use-cases for this data in a human-comprehensible way
      • Could be shared with the public so long as appropriate data-fuzzing methods were used
      • Would allow for the deletion of EFC & Point-in-Time data upon completion of the aggregation process
      • Is a more truthful way to represent trip data that is, in most cases, highly incomplete (in the case of our agency, that data could include point-in-time validated trips and EFC trips that have both tapped on and tapped off, but not cash trips, transfer trips, and EFC riders who don't tap off)
    • Cons:
      • Would break the chain where information from each file within GTFS-ride can be used to derive the information provided in the less detailed files

I'd love to hear your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants