Skip to content

Handle multi-leaf state in cron workflows #1375

@josephjclark

Description

@josephjclark

The Problem

When a cron job runs, it takes the state from the previous run and passes it into the next run.

We recently changed the cron behaviour in lightning: instead of using the state of the first job, we now give users the option to pick which step's state to use, defaulting to the final state.

So the workflow runs, returns {}, and then {} is passed as the input to the next state.

Final state works a bit differently for workflows with multiple leaf nodes. Instead of returning state, it returns an object with the state for each step, Like: { 'step-a': { x: 1}, 'step-b' : { x: 2 } }. This is also true for an empty workflow.

It's a bit hard to wrap the brain around but I think what happens is this:

first run input: {}
first run output: { a: {}, b: {} }
second run input: { a: {}, b: {} }
second run output:

{
 a:  { a: {}, b: {} },
 b: { a: {}, b: {} }
}

Because each step just returns its input, and we return multiple state objects.

That leads to this sort of thing in production:

Image

Eventually, the state object get huge, and costs a ton of memory and processing to ship the dataclip between the worker and app, resulting in performance degredation and even server crashs.

An escalating factor is that AI generated cron workflows with no code and multiple leaves (which is common!) will run infinitely on the platform, slowly building up bigger and bigger state objects until things start blowing up.

The temporary fix

In #1371 I added a rough fix which identifies empty state returned by a leaf node, and removes it from the final state object. So an empty workflow returns {} as its final state. This should neutralise the problem.

But it's not a great fix really.

Solutions

  1. The runtime will always return, at a minimum, { data: {} } from a step. Like the default state object include a data key. This was a decision made by default very early on in the new runtime. I think we should drop this. Now steps will naturally return an empty state object, which is a bit easier to return no leaf state for
  2. We need to think holistically about final state for leaf nodes and how this affects cron workflows. What I think we really need is a thing called a state reconciler: this takes multiple objects and merges them together (the simplest being a basic spread (probably a deep spread actually). You then attach a reconciler to the workflow and that gives you a single final state, which sort of removes this whole problem
  3. Can we do something to detect state recursion generally? This feels like a problem that might affect users, even with single-exit workflows, who happen to be building state objects poorly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Product Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions