Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow static file data storage for Provider /status_changes and /trips #357

Merged
merged 16 commits into from Oct 25, 2019

Conversation

hunterowens
Copy link
Collaborator

@hunterowens hunterowens commented Sep 17, 2019

Explain pull request

In #268, @johnpena discusses the need to make MDS queries more predictable. @babldev attempted to address the issue in #354, but during the discussion on 9-4, it sounded like folks wanted a different type of solution.

This builds on #354 and specifically @hyperknot's suggestion that we split the possible response with different query parameters.

It introduces static_end_time to both /trips and /status_changes. This would be an ISO month, day, or hour that could be cached on S3 or a similar static file store that could be returned as a single page or set of pages.

TBH, adding this complexity makes me want to also finish the openAPI work so that this can be made more clearly and testable by providers and clients alike. I'll take a stab at that shortly, but would love folks thoughts on this.

Is this a breaking change

A breaking change would require consumers or implementors of the API to modify their code for it to continue to function (ex: renaming of a required field or the change in data type of an existing field). A non-breaking change would allow existing code to continue to function (ex: addition of an optional field or the creation of a new optional endpoint).

  • Yes, breaking

Provider or agency

Which API(s) will this pull request impact:

  • provider

Additional context

Fixes #268.
Fixes #350.
Fixes #385.

@hunterowens hunterowens requested review from thekaveman and a team as code owners September 17, 2019 23:19
@hyperknot
Copy link
Contributor

Thanks for this PR, it looks really simple and clean to me! One tiny addition I'd include for trips is something like "month's worth of trips (filtered by end_time)", just to make it clear that it's the exact same filtering as in min-max end time, not start time.

@ascherkus
Copy link

Thanks for writing this up!

My three highest bits of feedback surround:

  1. Whether it's possible to reduce the different time periods (month/day/hour) into a single time period for requesting static data to lower specification complexity and implementation burden
  2. Whether the static_end_time response requires / assumes fully complete sets of immutable data (vs. how should we handle mutable data)
  3. Whether it's possible to fold {min,max}_end_time into static_end_time by combining ideas from (1) and (2)

For (1):

  • The argument (IIRC) for allowing different time periods is to simplify backfill consumption by requesting larger periods of time and thus reducing number of requests
  • That being said... as long as response pagination is allowed there's no guarantee the provider implementation will honor a single request/response for a particular time period
  • For example, if a provider implements the "month" response by paginating by day (which they're permitted to do), we haven't reduced the number of requests compared to the "day" method and thus added more complexity to the specification and each implementation with zero added benefit

In short, I believe pagination is in conflict with the desire to request larger time periods / lower the number of requests because the end result will always be subject to how different providers paginate their data, not the duration of the time period being requested.

That leads me to believe that either:

  • We pick a single sensible time period (since responses will get arbitrarily paginated anyway)
  • We disallow pagination on these responses (to respect the use case for minimizing # of requests)

Would love to hear what folks think and whether I missed different use cases for having different time periods!

For (2):

  • The argument for requiring complete and immutable sets of data would clarify expectations for returning errors (e.g., an error is guaranteed to be returned for the month "2019-09" until 2019-10-01 00:00:00 UTC at the earliest)
  • Conceptually, once the data "goes live" a consumer could feel more confident that the data set is complete
  • That being said... I'm highly skeptical that providers can guarantee historical data would never change once being published (e.g., a bug gets fixed, erroneous data gets corrected)
  • That means consumers need to check for changes in static data when backfilling to ensure they have the latest copy
  • Which leads me to believe we'd want to have requests and responses leverage If-Match / ETag / If-Modified-Since / Last-Modified to efficiently determine if changes have been made when backfilling

So I guess it boils down to how confident we all feel whether historical data can/should ever change, and if our collective confidence is not 100% how do we want providers and consumers to handle that scenario? How important is it to ensure we get this right?

For (3):

  • If we're in agreement on (1) and (2) that we only need to specify a single time period and that static data can change over time... then I see an opportunity to support "real-ish time use cases" by replacing {min,max}_end_time by having clients simply poll the most current static_end_time, using ETag / Last-Modified to determine when they've been updated
  • This would imply allowing incomplete data sets for the latest time period for static_end_time (e.g., requesting "2019-09" before 2019-10-01 00:00:00 UTC is permissible but the data / ETag / Last-Modified may keep changing until 2019-10-01 00:00:00 UTC at the earliest)
  • For example, a provider could re-generate the latest static_end_time period of data every N minutes, with real-time consumers checking every M minutes and fetching the latest data when needed
  • This would ease specification complexity for consumers and providers by folding the three different request/response methods (static_end_time, {min,max}_end_time < 1 month, {min,max}_end_time >= 1 month) into a single method

I understand this is a larger (and likely more controversial!) change, but if we're going to make a breaking change I'd like to entertain the idea of just how large of a breaking change we can make so that provider continues to scale for both consumers and implementors alike :)

@kheraankit
Copy link

What should be behavior if none of the date params are specified? I would expect it to return an error and not have any implicit logic.

Can the spec describe the default behavior for this condition?

@johnpena
Copy link

Overall this looks really good and I think is a huge step in the right direction, thanks for putting this together @hunterowens!

To expand on what @kheraankit said, I think we should take the opportunity for the next release to specify whether the time-bounding query parameters for the endpoints (min_end_time/max_end_time and start_time/end_time) are optional/required and what happens if they're not provided. Lime's feeds work such that when none of the parameters are provided, we return all of the data we have available. I think changing this behavior to return an error would be better, but we should indicate in the documentation that this is the case.

Copy link
Contributor

@jrheard jrheard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to other folks on what the exact correct approach is, just some feedback on the diff itself, hope this helps!

provider/README.md Outdated Show resolved Hide resolved
provider/README.md Outdated Show resolved Hide resolved
@hyperknot
Copy link
Contributor

hyperknot commented Oct 1, 2019

Based on my experience with APIs and how providers implement them, I think it'd be a good idea:

  • To name these two endpoints differently, for example static_trips or processed_trips. Architecturally these will be very different implementations, the normal endpoints will do a query in a database while these ones will proxy a static file from an S3 bucket for example.
  • Explicitly state that these endpoints contain "final" or processed data. Providers usually take 1-2 days till they finalize their data even today and this is a very buggy area currently: there is simply no way to know if a provider's data is "final" in an endpoint or not.
  • Also explicitly state an error message, like {message: "not_yet_processed"} or similar to tell clients when a data has not yet been processed.

About time buckets, I also believe it'd be better to ask only one time range, not 3. With 3, I believe it'll be very common that they'll be out-of-sync, something which will just introduce bugs again. Daily splits do not really work, as they are split by UTC days, whereas every processing happens in a region's local time, so it's not really useful. Monthly buckets will probably be too big to return them in a single request.

As a conclusion => I recommend only implementing hourly static endpoints, named /static_trips and /static_status_changes.

@hunterowens
Copy link
Collaborator Author

Hi all:

I've pushed a new draft, that has some pretty major changes.

  1. the current trips and status_changes become static file only backed, on hourly buckets only, per @hyperknot and others suggests.

  2. there is a new API endpoint, per @thekaveman others suggestion, called events, that represents an as real time as possible endpoint, and only contains the last 2 weeks worth of data. Currently, it inherits the status_changes schema, but could change in future versions.

This allows us to handle a lot of the GBFS like-use cases that @HeidiMG has been (correctly, IMO) talking about moving away from GBFS to MDS.

Thoughts? @johnpena @babldev @rf- @hyperknot @thekaveman @barbeau / anybody else I forgot to tag (apologies!)

Copy link

@ascherkus ascherkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Potentially silly question... is there any guidance or preference as to what providers should return when end_time is not part of the request?

Off the top of my head it seems like it'd fall back to the existing paginated view of all historical trips. Are there use cases better served by keeping this path around? Is it worth mandating end_time and returning an error if not provided?

Not sure how much of an issue the above would be in practice but wanted to flag for the group!

provider/README.md Outdated Show resolved Hide resolved
provider/README.md Outdated Show resolved Hide resolved
provider/README.md Outdated Show resolved Hide resolved
provider/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@thekaveman thekaveman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree with @ascherkus comments that we should be more clear that the time parameters here mean we are asking for trips that ended within that hour or status_changes that occurred within that hour, vs. before that hour / within the prior hour.

I also wonder if we can maintain the existing timestamp querying format (integer epoch milliseconds) vs. introducing ISO 8601 - this might make the change less impactful for clients using the existing query format. We could even go so far as saying if a query comes in that is not cutoff at the hour mark (e.g. end_time=1571250098290) then Providers should interpret this as a query for that hour (e.g. end_time=1571248800000)

I am also in favor of requiring end_time or whatever the time parameter is to avoid having to deal with ambiguously defined windows.

@hunterowens
Copy link
Collaborator Author

@thekaveman @ascherkus

I went with requiring end_time, otherwise error, since if we are making breaking changes, we might as well make it the ones that are the clearest to implement.

@ascherkus, I added language to describe what should be in the response payload that should clear up those comments / fixed the various typos you caught. LMK if that clears things up.

@thekaveman: I think I'd rather keep the time definition explicit, as introducing a snap to hour via any timestamps means there is the possibility of something going wrong on the server side as we make an implicit conversion, I think this keeps it cleaner for implementation and debugging between teams.

provider/README.md Outdated Show resolved Hide resolved
@hyperknot
Copy link
Contributor

Epoch seconds / milliseconds is a very highly bug prone part of the current specs. Each provider have their own implementation about which one to use, which is especially confusing as they need to support 0.2 and 0.3 clients concurrently. Some providers do an auto-detection and accept both formats (it can be done by saying comparing to year 2100's timestamp in seconds).

Also, unix seconds / milliseconds are not fit for describing a time "block", like an UTC hour, as it can be both placed in the beginning of the hour, somewhere in the middle or at the very end, all raising possible bug cases.

13 byte ISO 8601 strings ("2019-10-21T19") solve these issues and define an hourly block explicitly.

@hyperknot
Copy link
Contributor

hyperknot commented Oct 21, 2019

So to conclude, this PR is aiming to replace the current DB based trips and status changes endpoints with the static ones, instead of adding 2 more new endpoints? If so, I agree with the design, it's a clean choice!

+1 new endpoint for real-time-ish events (/events), whose future we can discuss in other PRs?

@hunterowens
Copy link
Collaborator Author

@hyperknot Yes, the idea is that /trips, /status_changes become backed by static files, and /events is what status_changes is today.

@hyperknot
Copy link
Contributor

@hyperknot Yes, the idea is that /trips, /status_changes become backed by static files, and /events is what status_changes is today.

I understand, that's a great change.

@hyperknot
Copy link
Contributor

hyperknot commented Oct 21, 2019

One more important point I'd like to raise is the removal of pagination from static-backed endpoints.

1. Pagination currently is the most "broken" part of the MDS specs. I work on MDS acquisition at Populus, we work with 10+ operators' MDS API, and pagination code is by far the most complex part of our MDS acquisition infrastructure. Most operators have slightly different behavior, and errors are not uncommon.

Some operators do not support pagination at all but require us to query data in hourly blocks (exactly like the new static endpoints), some do not add links to the last page, some require custom query parameters and the list goes on.

2. Since on the new endpoints, the time-period is fixed, I went ahead and analyzed all our historical data.

=> The average size for an hourly block of trips data, gzipped is 100 kB.
=> The maximum, extreme value for an hourly block of trips data is 7 MB, but this is very rare.

Since we are talking about servers in datacenters communicating with each other, even sending 7 MB in a single HTTPS request should be a non-issue today. Status changes are tiny in comparison, there is no point in analyzing them.

=> As a conclusion, I recommend removing all pagination and "links" related parts of the specs from the static-backed endpoints.

@ascherkus
Copy link

Clarity on hour interval text = LGTM! Thanks! Still find use of end_time for status_changes to be slightly confusing but can live with it given the clarity improvements.

Generally speaking I agree with @hyperknot ... removing pagination would make implementation more consistent across providers and the response sizes seem reasonable enough to warrant removal. On the flip side, I recall @bhandzo (maybe?) saying they'd still prefer to have pagination within the hour intervals so would be good to get some input as to what the use cases are there.

@thekaveman
Copy link
Collaborator

thekaveman commented Oct 25, 2019

Rebased this PR on the latest dev changes, and made a few language cleanups around query parameters. These should address comments from @ascherkus.

@thekaveman thekaveman added the Provider Specific to the Provider API label Oct 25, 2019
@thekaveman thekaveman added this to the 0.4.0 milestone Oct 25, 2019
@thekaveman thekaveman changed the title Allow for a Static File Store Backed Query Parameter Allow for static file backed data store for Provider /status_changes and /trips Oct 25, 2019
@thekaveman thekaveman changed the title Allow for static file backed data store for Provider /status_changes and /trips Allow static file data storage for Provider /status_changes and /trips Oct 25, 2019
@thekaveman thekaveman merged commit e60b003 into dev Oct 25, 2019
@thekaveman thekaveman deleted the provider-historical-endpoint branch February 13, 2020 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Provider Specific to the Provider API
Projects
None yet
7 participants