-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle cases where Prefect is deployed to multiple environments with differently-calibrated clocks #1402
Comments
This is a super interesting problem - I have a few initial thoughts:
|
The complication in all situations is that states from an outside source (like Cloud) need to be handled; states generated internally (like retries < 1 minute) do not. |
Update on this issue for anyone watching: our |
* Update RELEASE-NOTES.md Co-authored-by: Michael Adkins <madkinszane@gmail.com> Co-authored-by: Andrew Brookins <andrew.b@prefect.io>
It's possible that if a Prefect flow is deployed in an environment with an incorrect clock, then it may not run (if it receives states from an external system like Prefect Cloud).
For example, say the clock in a cluster is running 5 minutes slow, and a flow is supposed to start at 9:00am UTC. At 9:00am UTC, the agent in that cluster will receive a payload including a
Scheduled
state with a start time of9:00:00 UTC
. Note that in the cluster, however, the time is 8:55am UTC. The agent will take thatScheduled
state and submit it for execution, but theRunner
classes will refuse to run it because it hasn't reached its start time yet -- and won't for 5 minutes.We have two possible remedies for this situation:
A global
time_adjust_ms
config. This could be set by querying a known/trusted UTC time source and comparing the result to the local time. The difference (in milliseconds) could be used by the runner whenever a scheduled state was examined for execution. If this were exposed as a configuration, it could be set by the agent as an env var whenever it deployed a flow (the agent could do the checking, then set the variable appropriately). The downside of this approach is that it might create bad behavior for any state generated locally. For example, a local retry wouldn't want this adjustment to be applied. If the adjustment were positive (meaning it adds time to state start times because the the local clock is running fast), then the local retry wouldn't happen at all because it wouldn't have reached its presumed start time! However, if that's the only case that's problematic, we could handle it by only applying negative adjustments.An alternative is for the agent to "hold" any runs that haven't reached their start time yet, rather than deploying them for execution immediately. (Note: this dovetails with an idea @joshmeek and I had to give scheduled runs to the agent slightly ahead of schedule but "hold" them so they started exactly on time, with no poll delay). The downside of this approach is that any run which involved a dynamic task, like a mapped task that gets its state directly from Cloud and consequently bypasses the agent, would fail to run.
Perhaps a combination of these approaches is possible? Perhaps there's a third?
cc @cicdw @joshmeek @zdhughes
The text was updated successfully, but these errors were encountered: