You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Janitor uses this query to determine which runs are lost and marks them as so:
@doc""" Return all runs that have been claimed by a worker before the earliest acceptable start time (determined by the longest acceptable run time) but are still incomplete. This indicates that we may have lost contact with the worker that was responsible for executing the run. """@speclost(DateTime.t())::Ecto.Queryable.t()deflost(%DateTime{}=now)domax_run_duration_seconds=Application.get_env(:lightning,:max_run_duration_seconds)grace_period=Lightning.Config.grace_period()oldest_valid_claim=now|>DateTime.add(-max_run_duration_seconds,:second)|>DateTime.add(-grace_period,:second)final_states=Run.final_states()from(attinRun,where: is_nil(att.finished_at),where: att.statenot in^final_states,where: att.claimed_at<^oldest_valid_claim)end
Now that we have dynamic max_run_duration being passed to the worker and we allow some runs to run longer than others, we need to refactor this query to account for the variable time limits.
Possible solutions:
Is the allowed run duration for a run stored on the run itself as metadata? if so, we could re-write this query to check the run "against itself" --> i.e., where runs.metadata['max_duration'] + runs.started_at > time.now()
..?
The text was updated successfully, but these errors were encountered:
@stuartc , interesting one here. We do not store the options on the runs table. Instead they're generated on the fly by the worker when they claim a run. This has a couple of interesting impacts:
Nice: If I have 100 runs in the queue and they're failing because of timeouts, I can change the limits (new ENV if it's my deployment, upgrade plan if on a hosted deployment, etc.) and then they'll start to suceed.
Naughty: If I mark disable_console_log: true as one of the options for my workflow (note this still hasn't been ported from v1, but will be comings soon) and then execute a run but it's stuck in the queue, someone else might come along and enable console.log statements. Even though it was disabled when I created the run, the worker would come along and discover (at claim time) that they're totally allowed to use console.log.
A little frustrating, but neither inherently good nor bad: If there are a bunch of unfinished runs and I want to see which are actually lost, there's no way to check the runs table to see which are actually lost and which simply have extended durations compared to the instance default. @elias-ba points out that we could rewrite this Runs.Query.lost/1 to Runs.Query.lost/2 and query per project or at least per set of projects on the same plan (i.e., set of projects with identical run timeout limits) and that's probably the best near-term fix.
The Janitor uses this query to determine which runs are
lost
and marks them as so:Now that we have dynamic max_run_duration being passed to the worker and we allow some runs to run longer than others, we need to refactor this query to account for the variable time limits.
Possible solutions:
run
itself as metadata? if so, we could re-write this query to check the run "against itself" --> i.e.,where runs.metadata['max_duration'] + runs.started_at > time.now()
The text was updated successfully, but these errors were encountered: