Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402

Closed
jlowin opened this issue Aug 24, 2019 · 3 comments · Fixed by #1502
Assignees

Comments

@jlowin
Copy link
Member

jlowin commented Aug 24, 2019

It's possible that if a Prefect flow is deployed in an environment with an incorrect clock, then it may not run (if it receives states from an external system like Prefect Cloud).

For example, say the clock in a cluster is running 5 minutes slow, and a flow is supposed to start at 9:00am UTC. At 9:00am UTC, the agent in that cluster will receive a payload including a Scheduled state with a start time of 9:00:00 UTC. Note that in the cluster, however, the time is 8:55am UTC. The agent will take that Scheduled state and submit it for execution, but the Runner classes will refuse to run it because it hasn't reached its start time yet -- and won't for 5 minutes.

We have two possible remedies for this situation:

  • A global time_adjust_ms config. This could be set by querying a known/trusted UTC time source and comparing the result to the local time. The difference (in milliseconds) could be used by the runner whenever a scheduled state was examined for execution. If this were exposed as a configuration, it could be set by the agent as an env var whenever it deployed a flow (the agent could do the checking, then set the variable appropriately). The downside of this approach is that it might create bad behavior for any state generated locally. For example, a local retry wouldn't want this adjustment to be applied. If the adjustment were positive (meaning it adds time to state start times because the the local clock is running fast), then the local retry wouldn't happen at all because it wouldn't have reached its presumed start time! However, if that's the only case that's problematic, we could handle it by only applying negative adjustments.

  • An alternative is for the agent to "hold" any runs that haven't reached their start time yet, rather than deploying them for execution immediately. (Note: this dovetails with an idea @joshmeek and I had to give scheduled runs to the agent slightly ahead of schedule but "hold" them so they started exactly on time, with no poll delay). The downside of this approach is that any run which involved a dynamic task, like a mapped task that gets its state directly from Cloud and consequently bypasses the agent, would fail to run.

Perhaps a combination of these approaches is possible? Perhaps there's a third?

cc @cicdw @joshmeek @zdhughes

@cicdw
Copy link
Member

cicdw commented Aug 24, 2019

This is a super interesting problem - I have a few initial thoughts:

  • truly solving this problem (where we decide we can't trust machine clocks, of which there could be multiple involved in the execution of a single flow, which could be both too fast and too slow) is going to be a can of worms, and I'm worried that we will be undertaking providing a complete solution to a notoriously difficult problem
  • perhaps there is a way to hack an offset into all datetime / pendulum calls based on a computed offset from Prefect Cloud, although we might still need to control for variable network latency (if the goal is to be correct on the order of seconds). Perhaps we could create dynamic "timezones" that all computation is performed in that account for the offset

@jlowin
Copy link
Member Author

jlowin commented Aug 26, 2019

The complication in all situations is that states from an outside source (like Cloud) need to be handled; states generated internally (like retries < 1 minute) do not.

@cicdw cicdw added this to the v0.6.4 milestone Sep 6, 2019
@cicdw cicdw self-assigned this Sep 6, 2019
@cicdw
Copy link
Member

cicdw commented Sep 6, 2019

Update on this issue for anyone watching: our getRunsInQueue graphql mutation now accepts an optional datetime input called before; when provided, this will only return work which is scheduled prior to the before time (which defaults to Cloud's "now"). This endpoint is what the Prefect Agent calls when looking for work -- after the next Cloud production release, we will update the Agents to use this keyword with whatever it thinks "now" is --> this way, the work the Agent requests reflects its own local clock.

@cicdw cicdw modified the milestones: v0.6.4, Future Sep 6, 2019
@cicdw cicdw reopened this Sep 13, 2019
@cicdw cicdw closed this as completed Sep 18, 2019
abrookins added a commit that referenced this issue Mar 15, 2022
* Update RELEASE-NOTES.md

Co-authored-by: Michael Adkins <madkinszane@gmail.com>
Co-authored-by: Andrew Brookins <andrew.b@prefect.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants