Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

jlowin · 2019-08-24T22:15:41Z

It's possible that if a Prefect flow is deployed in an environment with an incorrect clock, then it may not run (if it receives states from an external system like Prefect Cloud).

For example, say the clock in a cluster is running 5 minutes slow, and a flow is supposed to start at 9:00am UTC. At 9:00am UTC, the agent in that cluster will receive a payload including a Scheduled state with a start time of 9:00:00 UTC. Note that in the cluster, however, the time is 8:55am UTC. The agent will take that Scheduled state and submit it for execution, but the Runner classes will refuse to run it because it hasn't reached its start time yet -- and won't for 5 minutes.

We have two possible remedies for this situation:

A global time_adjust_ms config. This could be set by querying a known/trusted UTC time source and comparing the result to the local time. The difference (in milliseconds) could be used by the runner whenever a scheduled state was examined for execution. If this were exposed as a configuration, it could be set by the agent as an env var whenever it deployed a flow (the agent could do the checking, then set the variable appropriately). The downside of this approach is that it might create bad behavior for any state generated locally. For example, a local retry wouldn't want this adjustment to be applied. If the adjustment were positive (meaning it adds time to state start times because the the local clock is running fast), then the local retry wouldn't happen at all because it wouldn't have reached its presumed start time! However, if that's the only case that's problematic, we could handle it by only applying negative adjustments.
An alternative is for the agent to "hold" any runs that haven't reached their start time yet, rather than deploying them for execution immediately. (Note: this dovetails with an idea @joshmeek and I had to give scheduled runs to the agent slightly ahead of schedule but "hold" them so they started exactly on time, with no poll delay). The downside of this approach is that any run which involved a dynamic task, like a mapped task that gets its state directly from Cloud and consequently bypasses the agent, would fail to run.

Perhaps a combination of these approaches is possible? Perhaps there's a third?

cc @cicdw @joshmeek @zdhughes

The text was updated successfully, but these errors were encountered:

cicdw · 2019-08-24T22:46:40Z

This is a super interesting problem - I have a few initial thoughts:

truly solving this problem (where we decide we can't trust machine clocks, of which there could be multiple involved in the execution of a single flow, which could be both too fast and too slow) is going to be a can of worms, and I'm worried that we will be undertaking providing a complete solution to a notoriously difficult problem
perhaps there is a way to hack an offset into all datetime / pendulum calls based on a computed offset from Prefect Cloud, although we might still need to control for variable network latency (if the goal is to be correct on the order of seconds). Perhaps we could create dynamic "timezones" that all computation is performed in that account for the offset

jlowin · 2019-08-26T15:08:53Z

The complication in all situations is that states from an outside source (like Cloud) need to be handled; states generated internally (like retries < 1 minute) do not.

cicdw · 2019-09-06T21:35:14Z

Update on this issue for anyone watching: our getRunsInQueue graphql mutation now accepts an optional datetime input called before; when provided, this will only return work which is scheduled prior to the before time (which defaults to Cloud's "now"). This endpoint is what the Prefect Agent calls when looking for work -- after the next Cloud production release, we will update the Agents to use this keyword with whatever it thinks "now" is --> this way, the work the Agent requests reflects its own local clock.

* Update RELEASE-NOTES.md Co-authored-by: Michael Adkins <madkinszane@gmail.com> Co-authored-by: Andrew Brookins <andrew.b@prefect.io>

cicdw added this to the v0.6.4 milestone Sep 6, 2019

cicdw self-assigned this Sep 6, 2019

cicdw modified the milestones: v0.6.4, Future Sep 6, 2019

cicdw mentioned this issue Sep 13, 2019

Update Agents to provide their local time #1502

Merged

3 tasks

cicdw closed this as completed in #1502 Sep 13, 2019

cicdw reopened this Sep 13, 2019

cicdw closed this as completed Sep 18, 2019

abrookins added a commit that referenced this issue Mar 15, 2022

2.0b1 release notes (#1402)

82f17e5

* Update RELEASE-NOTES.md Co-authored-by: Michael Adkins <madkinszane@gmail.com> Co-authored-by: Andrew Brookins <andrew.b@prefect.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

jlowin commented Aug 24, 2019 •

edited

Loading

cicdw commented Aug 24, 2019 •

edited

Loading

jlowin commented Aug 26, 2019

cicdw commented Sep 6, 2019

Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

Comments

jlowin commented Aug 24, 2019 • edited Loading

cicdw commented Aug 24, 2019 • edited Loading

jlowin commented Aug 26, 2019

cicdw commented Sep 6, 2019

jlowin commented Aug 24, 2019 •

edited

Loading

cicdw commented Aug 24, 2019 •

edited

Loading