Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: make MDS queries more predictable #268

Closed
johnpena opened this issue Mar 20, 2019 · 28 comments · Fixed by #357
Closed

Discussion: make MDS queries more predictable #268

johnpena opened this issue Mar 20, 2019 · 28 comments · Fixed by #357
Milestone

Comments

@johnpena
Copy link

I've been maintaining the MDS provider API at Lime since earlier this year. We've seen the latency across our API endpoints creep up on us as more agencies have adopted MDS and as more trips have been taken and added to our trips and status changes datasets.

We'd like to decrease latency as much as possible, but we've had some issues doing this because the datasets returned by the API are difficult to precompute and cache. Particularly, being able to query across arbitrary start and end times and having down-to-the-second query values means that we can't reliably cache an entire trip or status change dataset ahead of time. Users can make a query for a minute or month's worth of data, and we have to generate the results on the fly.

I'd like to brainstorm ways we can make MDS queries more predictable. In particular, if we could present a way for a user to ask for a specific day or hour of data, it would allow us to resolve the query ahead of time and return the result to them much faster.

@hunterowens hunterowens added this to the 0.3.1 milestone Mar 21, 2019
@hunterowens
Copy link
Collaborator

@johnpena +1 to this idea!

@johnpena
Copy link
Author

Writing up some of the proposals from the webinar:

  1. Reduce the granularity of the query parameters to hourly instead of by second/minute. This would allow providers to properly pre-compute data for the queried hour(s). Depending on implementation, this approach could constitute a breaking change.
  2. Advise provider clients to use query arguments that are bounded by hour or day, explicitly call out that querying outside of these bounds could be more latent. This approach is "light touch" and would require agencies/providers to make minimal changes.
  3. On the provider side, advise providers to partition their datasets into smaller "hot" and larger "cold" partitions, and route queries accordingly. This approach is more technically advanced but can be adopted by providers without an API change.
  4. Introduce additional query parameters beyond start_time/end_time/etc that use stricter time bounds.

@lionel-panhaleux @hunterowens @ambarpansari please let me know if I missed anything important. I'll work on coming up with a proposal that takes all of these ideas into consideration. Thanks all for the brainstorm!

@johnpena
Copy link
Author

Unless this is a burning issue for anyone, I'd suggest we move this out to 0.4.0

@billdirks
Copy link
Contributor

Our customers rely on timely data to understand where scooters are in their city "right now". For example, there are a few use cases around where scooters are parked. Having query parameters be on hour boundaries would mean we would have to wait too long to get some events for these use cases to be meaningful. We would want to retain the ability to query for recent time periods and would want these to be low latency but could tolerate higher latencies for older data. We do not need millisecond resolution as currently specified.

@johnpena
Copy link
Author

johnpena commented Apr 8, 2019

What I want to suggest is that for queries that deal with data older than the current day (or maybe week, depending on use cases), that the provider API might round the timestamps for the queries to the nearest hour. This would be the default behavior for your typical provider. We could then introduce an additional parameter for the client to specify that they want to turn off this behavior and have their timestamps be used as-provided, e.g. &respect_exact_timestamps=true.

This would preserve the integrity of 'live' data (the use case @billdirks is talking about), and allow providers to opt-in to caching older data. It would also be (for the most part) non-breaking or minimally-breaking.

@dyakovlev
Copy link
Contributor

Depending on the size of datasets involved here, the simplest thing to do may be an alternate query parameter that specifies the desired day of data to retrieve.

One thing I've seen effectively used for problems this is Ordinal Date, which is an unambiguous way to refer to a specific day via a single integer.

An additional thing worth thinking about here is timezones; a city's data feed is hopefully in a specific timezone, and if that city queries for a specific day of data, the data should be cut on timezone boundaries. Timezone math is notoriously complicated to reason about, so an API that avoids having to do it for every request may be in the spec's interest. Ordinal dates (or any other non-time-specific day format) would let timezone conversion be handled entirely by provider implementations, which hopefully causes them to be more evidently consistent.

@morganherlocker
Copy link
Contributor

Our customers rely on timely data to understand where scooters are in their city "right now". For example, there are a few use cases around where scooters are parked. Having query parameters be on hour boundaries would mean we would have to wait too long to get some events for these use cases to be meaningful. We would want to retain the ability to query for recent time periods and would want these to be low latency but could tolerate higher latencies for older data. We do not need millisecond resolution as currently specified.

How does this square with the new 24 hour telemetry delay? Real time info does not seem predictable or reliable with the current design, since it is impossible to know if you are seeing a complete picture until the 24 hour threshold is hit.

@johnpena
Copy link
Author

@dyakovlev the problem with introducing an alternate query parameter is that clients would need to switch to using this parameter, without an added benefit to them. I'm skeptical that old clients would make the switch. And ultimately this doesn't leave providers with a way to optimize existing queries.

@billdirks
Copy link
Contributor

@morganherlocker Are you referring to the telemetry delay in the agency specification? My use case resolves around querying the provider api endpoints (ie pulling the data) vs waiting for a push from the provider to an agency endpoint.

@hunterowens hunterowens modified the milestones: 0.3.1, 0.3.2 Apr 29, 2019
@hunterowens
Copy link
Collaborator

Moving this to 0.4.0

@hunterowens hunterowens modified the milestones: 0.3.2, 0.4.0 Jun 7, 2019
@morganherlocker
Copy link
Contributor

@morganherlocker Are you referring to the telemetry delay in the agency specification? My use case resolves around querying the provider api endpoints (ie pulling the data) vs waiting for a push from the provider to an agency endpoint.

Correct, my understanding is that the delay in the agency specification is to account for inevitable latency in data processing, which means the data presumably would not exist in the provider store either. Without this allowance, the agency delay does not seem to have a purpose besides addressing the ~200-2000ms it takes to send the data to the agency. Given the wording in agency, I have been assuming anything on the historical endpoints (trips and status_changes) will be incomplete unless further than 24h in the past.

@bhandzo
Copy link
Contributor

bhandzo commented Jul 23, 2019

Now that we're really kicking off into 0.4.0 I'd love to see us make some progress on this topic. We're definitely experiencing the pain of trying to deliver large amounts of data, consistently, across different agency requirements/capabilities, so would love to see pagination for historical (and real time) data codified in a way that best gives agencies the combination of real-time and historical access they need, and allows providers to scale across many cities.

@geobir
Copy link
Contributor

geobir commented Jul 23, 2019

What if we do something like &aggregation=hour [hour, day, week, month, etc.] and if not defined "respect exact timestamps is true" or aggregation is second?
With that we will have a not breaking change we will just add this "feature" to the query, to make it more predictable.
We will also have to define all aggregation type.

@ascherkus
Copy link

My understanding of the issues thus far:

  • Given there is a processing delay, provider can't really be used for real-time/interactive applications
  • There isn't much support for sub-hour resolution
  • There is no way to definitively tell when a data set for a particular day is "complete"
  • There is no way to definitively tell when a previous data set has been updated/changed
  • Querying across arbitrary date ranges makes it difficult to cache

My $0.02 is that provider at its core is a bulk historical data fetching API with agency fulfilling the need for real-time use cases.

If there's consensus around that notion, I wonder if a breaking change for 0.4.0 is warranted to align the provider API closer to its core use case since as it stands today:

  • The API requires significant effort to build a long-term scalable implementation
  • Existing clients can't be broken by changing API semantics and will likely not adapt unless forced to

As a thought experiment, I find it helpful to imagine provider being built on top of static file serving architecture using UTC ordinal dates (e.g., /los_angeles/2019/07/24/trips.json). Existing HTTP semantics could be leveraged to determine version (setting a MIME type for Content-Type), when a day's dataset is complete (polling for HTTP 200 vs 404 not found) as well as updated (Last-Modified header). It then becomes the consumer's responsibility to do caching, indexing, and filtering on sub-day date ranges and vehicle/device IDs based on their use cases.

Circling back to the existing API, a possible breaking change proposal for 0.4.0 would be to drop support for vehicle_id/device_id/min_end_time/max_end_time/start_time/end_time query parameters and simply have a date=YYYY-MM-DD parameter. The provider frontend could then map the result to a static file serving architecture as described above. HTTP semantics around 404 not found and use of Last-Modified can be introduced to aid with detecting when data is complete/modified.

Until 0.3.x is fully deprecated, existing clients can be upgraded to the new "fast path" without any code change by altering their requests to align to single UTC day boundaries (e.g., ?min_end_time=1563926400000&max_end_time=1564012800000 to represent 2019-07-24). I assume provider implementors know all individual clients based on issued JWT tokens and can deal with them on a case-by-case basis.

I understand this is a big change... but I feel we're at a point where we can leverage the collective experience to tweak the API in a way that makes it easier to implement and maintain for the long haul.

Could provider consumers weigh in with specific use cases/concerns? Could provider implementors comment on what would work given given the traffic they're seeing today?

@billdirks
Copy link
Contributor

billdirks commented Jul 24, 2019

As a consumer, my experience is very different from @ascherkus:

  • We use the data for a realtime-ish application (minutes delayed)
  • Most operators provide the data within minutes. Cities want to have this type of insight.

It seems like the provider spec has been adopted for this purpose by more than just us. Also, I feel that there are some issues that may prevent the agency spec from becoming as widely adopted as the provider spec:

  • It takes a lot more infrastructure for a city to operate
  • There have been privacy concerns expressed that I haven’t seen a satisfactory response to.

@hyperknot
Copy link
Contributor

hyperknot commented Jul 24, 2019

I believe there are two separate issues here.

  1. The performance one I believe is unrelated to the MDS specs, and it's totally up to how the provider implements their pagination. At the moment, every provider has a very different idea about pagination, some use tiny time intervals like 5 minutes, some allow days, some offers server-side pagination, some require clients to "paginate" the query across time windows. As a consumer, it's a mess to work with all the custom pagination solutions, but right now providers are super free to implement whatever pagination is high performance for them, so I don't think this is a problem with the specs.

  2. The bigger point is to split status changes to "historical" and "live" endpoints. I believe this is very important. The problem with current status changes is that we have no idea how much processing time a provider needs until their status changes data can be called "reliable". For example, if a provider needs 30 minutes of processing for status changes, then a live map with 2-minute delay would definitely show false information.

A historical and live status_changes endpoints would solve this, as long as it'd be clearly written how much processing time is allowed for each.

@ascherkus
Copy link

Thanks for sharing your feedback @billdirks! It seems like we have a classic tradeoff between latency and aggregation/scalability of a system :)

I'm assuming this is a similar use case as the one you mentioned on April 4 -- but could you elaborate as to what the minimum latency (e.g., 5 minutes, 15 minutes, 30 minutes, an hour, a day) that is tolerable for this use case?

My goal is to tease out the different use cases for different consumers to see whether it's possible to determine a pre-defined level of aggregation/latency that works both for consumers and for query predictability.

Failing that, as @hyperknot mentions it does seem like there are two major separate use cases (historical vs. live) that might be better served with separate solutions vs. attempting to shoehorn both use cases into a single solution. The previous meeting notes [1,2] suggest there may need to be a distinction between historical vs. live as well. At a minimum, something like @johnpena's respect_exact_timestamps=true proposal on April 8 could fit the bill. As a general principle I do feel it's important to have the default behavior of the API be the "fast path".

[1] https://github.com/CityOfLosAngeles/mobility-data-specification/wiki/Web-conference-notes,-2019.03.28
[2] https://github.com/CityOfLosAngeles/mobility-data-specification/wiki/Web-conference-notes,-2019.07.11

@billdirks
Copy link
Contributor

Of course less latency is better for my use case but I can deal with minutes. 5 minutes seems long though and 10 is definitely too long. @johnpena's suggestion of an extra parameter would work for me. It terms of latency of the endpoint itself, I care more about response speed for the recent time points than historical data.

There are some other tickets about timeliness of data and a live endpoint. I'd invite us to have discussion on those point there.

@nselikoff
Copy link

Our clients are utilizing MDS data for both historical and near real-time purposes. The Provider API is simpler to get started with in terms of infrastructure and engineering effort, so we're utilizing it for both at the moment. For near real-time, I agree with @billdirks (5 minutes seems long, 10 is definitely too long).

The main thing I'm looking for are clear expectations about the timeliness of the data available in the Provider API as discussed by @hyperknot, so I can decide when and how often to ping it, ensure I'm not missing anything, and clearly communicate the limitations to our clients.

Not to detract from the main goal of this ticket, but I also wanted to tag additional tickets I've seen with related conversation on these tradeoffs: #307 (when are status_changes added), #341 (when are status_changes removed), #282 (privacy and possibility of forcing a minimum 24 hour delay on telemetry data).

@babldev
Copy link
Contributor

babldev commented Aug 7, 2019

Can we make the Provider API a little stricter without breaking the real-time tooling case in 0.4.0?

We could allow the Provider API to specify a "supported" time window. Let's say 1 hour. Then we require all requests to be on UTC hour boundaries.

So for Aug 1 12pm - 1pm UTC you'd have:
?start_time=1564660800000&end_time=1564664400000

If it's the current hour, providers would serve the realtime data. If the data is old, it could be cached.

Think of it as a logfile where you stream to the active one and then "rotate" logs on a regular basis.

@schnuerle
Copy link
Member

schnuerle commented Aug 8, 2019

Can you keep everything the same as it is now (eg, still support arbitrary start/end times), but add a new API parameters like 'timeframe_period' where you pass in the pre-defined text like hour, day, week, month, etc, and a second parameter like 'timeframe_moment' where you pass in a epoch timestamp?

Then you would return the pre-cached data for that hour/day/week that that timestamp_moment is within.

For example ?timeframe_period=day&timeframe_moment=1562610085 would return all data for June, 8 2019 UTC. This would be for historic backfilling mostly - if not complete then you would get partial data.

The provider would then generate flat files for every hour into the past, every day into the past, every week, and every month for each city. That file would be served when the API call is made, instead of hitting the database.

One provider kinda does this now in their online dashboard to export data in bulk - you pick a day and get a month's worth of data as a file that the day falls within.

Note I would recommend that these new pair parameters do not work in conjunction with other query parameters that filter. This is to get all bulk historic data for a chunk of time.

@nselikoff
Copy link

Some additional comments from the discussion on the MDS call to add to this:

In terms of understanding consumer use cases, I'm wondering how many consumers use the Provider API as essentially a data interchange format, to ingest it on a regular basis into some other pipeline? Versus an on-demand API that gets called in varied situations in response to actually interacting with the user-facing application?

With the log rotation model, there remains the open question of when the data can be considered "complete" (i.e. there's an expectation that no additional elements will be added or removed).

@fscottfoti
Copy link

fscottfoti commented Aug 8, 2019 via email

@thekaveman
Copy link
Collaborator

I'm wondering how many consumers use the Provider API as essentially a data interchange format, to ingest it on a regular basis into some other pipeline

Exactly the use case here in Santa Monica. Also happy to limit our queries to pre-defined windows.

@geobir
Copy link
Contributor

geobir commented Aug 8, 2019

Same Here, for data interchange format.

I think timeframe_period start_time and end_time are the only parameters needed to implement cached data.
To @nselikoff point, if people want to use on-demand calls, we need to make timeframe_period parameter not required, e.g. load the first historical minute first while loading the full day in the background instead of waiting to load and show the full day.

@babldev
Copy link
Contributor

babldev commented Aug 8, 2019

Will draft a pull request next week to continue the conversation around support for fixed time intervals. By scoping Provider API to data interchange, we should be able to build a more reliable solution that continues to work for historical and real-time use cases.

Some things to consider in advance:

  • Will fixed intervals be an optional feature that clients can leverage, or a required feature?
  • How will time intervals be defined on the provider side and understood by clients?
  • How will clients know when the "active" bucket is done being updated (there may be some data lag)? Will we use some sort of Last-Modifed header in the response?
  • What is the client impact of possibly larger file sizes?

@hyperknot
Copy link
Contributor

hyperknot commented Aug 9, 2019

I think we should keep it as simple as possible, the specs are already over complicated and the majority of the providers cannot even implement the current specs.

Here are my recommendations.

  1. For historical data, we should settle on hourly intervals (medium sized preprocessed files). That's both for trips and status changes. To make things very clear these could be two new endpoints historical_status_changes and historical_trips. Later on we can deprecate current trips endpoint and rename status_changes to live_status_changes.
  2. The "processing latency" could be a value in the JSON, so that we'd know that when is a given file available for each provider. For example it might take 4 hours before a given hour appears at the historical_status_changes endpoint.
  3. We should query based on ISO time format trimmed at the hour. For example 2019-08-09T12, in UTC. That's both human readable and uniquely defines an hour.
  4. Technically, these hourly preprocessed files don't need to be externally accessible buckets, there is no way that all currently used authentication options could be ported over to AWS S3 or Google Storage for example. Instead these could be static files / buckets internally and their APIs could just proxy them. Technically this is a few lines and would take all the load from the DB servers.
  5. live_status_changes endpoint could be discussed in a separate thread.

The only problem I cannot see being solved is the registration of lost vehicles. If it takes 3 days to register such a lost vehicle, I don't know how should we handle it. Should historical data be re-updated again, after 7 days for example, to contain the lost vehicles? If so, we might need to add a "lost_vehicles_processed" true/false boolean as well.

Also, as an added side effect, we could get rid of pagination of these endpoints, something which is again very chaotically implemented across providers. We could simply just download preprocessed hourly files, making life easy both for providers and for clients.

Zsolt (Populus)

@babldev
Copy link
Contributor

babldev commented Aug 22, 2019

Proposal here! #354

hunterowens added a commit that referenced this issue Sep 17, 2019
thekaveman pushed a commit that referenced this issue Oct 25, 2019
thekaveman pushed a commit that referenced this issue Oct 25, 2019
#357)

* modify /status_changes and /trips to require a single UTC hour for time querying

* introduce /events endpoint, returning status_changes in a required range no older than 2 weeks (for real-time use-case)

Fixes #268.

Fixes #350.

Fixes #385.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet