Skip to content

Comments

YAML Based Broker SDK workflows#436

Merged
khaliqgant merged 27 commits intomainfrom
sdk-workflows
Feb 18, 2026
Merged

YAML Based Broker SDK workflows#436
khaliqgant merged 27 commits intomainfrom
sdk-workflows

Conversation

@khaliqgant
Copy link
Collaborator

@khaliqgant khaliqgant commented Feb 18, 2026

khaliqgant and others added 17 commits February 17, 2026 21:59
…low definitions

Three major additions to the workflows spec:

1. Reflection Protocol — event-driven reflection inspired by the Generative
   Agents paper (Park et al., 2023). Importance-weighted message accumulation
   triggers focal point generation, synthesis, and course correction.
   Includes ReflectionEngine implementation, REFLECT message protocol,
   and per-pattern reflection behavior.

2. Trajectory Integration — formal integration with the agent-trajectories
   SDK (v0.4.0). Workflows auto-record messages, reflections, and decisions
   as trajectory events. Auto-generates retrospectives on completion.
   Enables cross-workflow learning and compliance/attribution.

3. YAML Workflow Definitions — portable YAML schema for defining workflows,
   compatible with relay-cloud's relay.yaml (PR #94). Supports template
   variables, DAG-based step parallelism, built-in templates, and
   progressive configuration (one-liner to full custom).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New patterns (6-10):
- handoff: dynamic routing with circuit breaker (max hops)
- cascade: cost-aware LLM escalation (cheap → capable)
- dag: directed acyclic graph with parallel execution
- debate: adversarial refinement with structured rounds + judge
- hierarchical: multi-level delegation tree (lead → coordinators → workers)

New primitives required:
- DAG Scheduler (topological sort, parallel dispatch, join tracking)
- Handoff Controller (active agent tracking, context transfer)
- Round Manager (debate rounds, turn order, convergence detection)
- Confidence Parser (extract [confidence=X.X] from DONE messages)
- Tree Validator (structural validation, sub-team computation)

New message protocol signals:
- HANDOFF, CONFIDENCE, ARGUMENT, CONCEDE, VERDICT, TEAM_DONE

Includes pattern × primitive matrix showing what each pattern needs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decision framework and reference for fan-out, pipeline, hub-spoke,
consensus, mesh, handoff, cascade, dag, debate, and hierarchical
patterns. Includes reflection protocol, YAML workflow definitions,
and common mistakes guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DAG-based execution plan with 9 nodes covering shared types, DB migration,
workflow runner, swarm coordinator, templates, API endpoints, CLI commands,
dashboard panel, and integration tests. Uses broker SDK for agent lifecycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tor script

Adds stigmergic state store, agent pool manager, auction engine, branch
pruner, and gossip disseminator to WORKFLOWS_SPEC.md (Phase 5). These
bring coverage from 67% to 88% of the 42 swarm techniques catalogued
from multi-agent orchestration literature.

Also adds executable broker SDK script (scripts/run-swarm-implementation.ts)
that uses a DAG pattern to coordinate 9 work nodes implementing relay-cloud
PR #94, with dependency-aware parallel execution and convention injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Script fixes:
- Use Promise.allSettled instead of Promise.race for batch execution
- Add --resume support with state persistence to .relay/swarm-impl-state.json
- Propagate failures to downstream nodes immediately (mark as "blocked")
- Add readFirst field to DAG nodes so agents read existing code first
- Require detailed DONE messages with type signatures and file paths
- Add resolved guard to prevent double-resolution in polling loop
- Add "blocked" status to NodeResult for better reporting

Skill updates:
- Add "DAG Executor Pitfalls" section with 6 common implementation mistakes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated spawn/send/release/logs commands to match actual CLI syntax
(positional args, not --flag format). Verified with --dry-run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import AgentRelayClient, getLogs, and BrokerEvent directly from the
broker SDK sub-paths (client, logs, protocol) which avoid the
@relaycast/sdk transitive dependency. Replaces all execSync calls
with proper SDK methods: spawnPty, release, listAgents, onEvent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The AgentRelayClient expects the Rust broker binary which has
init --name --channels for protocol mode. The Node.js CLI binary
has a different init command (setup wizard). Built Rust binary with
cargo build and pointed binaryPath to target/debug/agent-relay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Relaycast API returns 409 when creating a workspace with a name
that already exists. Without cached credentials the broker can't
recover. Use a timestamped broker name to ensure uniqueness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…olling

The Rust broker doesn't write worker-logs/ files — that's a Node.js
CLI feature. Switch watchForDone to use broker events:
- worker_stream: accumulate PTY output chunks, scan for DONE/ERROR
- relay_inbound: relay messages from agents
- agent_exited: detect agent termination

Remove unused getLogs import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of parsing PTY output for DONE signals (which matched the
prompt template text), agents now:
1. Do their work
2. Send a relay message with "DONE: <summary>" to the workflow channel
3. Exit naturally

The orchestrator watches for:
- relay_inbound: captures DONE/ERROR summaries for downstream deps
- agent_exited: definitive completion signal (code 0 = success)

Removed all "DONE: <detailed summary>" template text from task
prompts to prevent false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spawned PTY agents don't have MCP relay tools, so they can't send
relay messages. Instead, agents now write their summary to
.relay/summaries/{nodeId}.md before exiting. The orchestrator waits
for agent_exited, then reads the summary file for downstream deps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pitfalls from running the swarm implementation script:
- PTY prompt echo matching signal keywords (false DONE completion)
- Assuming agent capabilities (PTY agents lack MCP tools)
- Rust broker vs Node.js CLI binary confusion
- Log polling assumes Node.js daemon (Rust broker doesn't write logs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

- runner.ts: executeStep now throws after marking step failed, enabling
  fail-fast/continue error strategies to trigger via Promise.allSettled
- cli/index.ts: runScriptFile now only catches ENOENT errors, properly
  propagating script execution failures instead of trying next runner

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Similar to Wrangler's telemetry.md, this document explains:
- What data is collected and why
- What is explicitly NOT collected
- How to opt out (CLI, env var, config file)
- How to view telemetry events for debugging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 25 additional findings in Devin Review.

Open in Devin Review

Comment on lines +604 to +608
if (strategy === 'fail-fast') {
// Mark all pending downstream steps as skipped
await this.markDownstreamSkipped(step.name, workflow.steps, stepStates, runId);
throw new Error(`Step "${step.name}" failed: ${error}`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 fail-fast with parallel step failures leaves downstream steps of subsequent failures in 'pending' state

When multiple steps run in parallel and more than one fails under the fail-fast strategy, only the first failed step's downstream dependents are marked as skipped. The loop throws immediately after processing the first failure, so downstream steps of the second (and subsequent) failed steps remain in pending state instead of skipped.

Root Cause and Impact

In executeSteps at packages/sdk-ts/src/workflows/runner.ts:593-614, the results of Promise.allSettled are iterated. When the first rejected result is encountered with fail-fast strategy, markDownstreamSkipped is called for that step, and then an error is thrown at line 607. This means subsequent rejected results in the same batch are never processed — their downstream steps are never marked as skipped.

For example, if steps A and B run in parallel and both fail:

  1. Step A's failure is processed: A's downstream steps are marked skipped, then throw
  2. Step B's failure is never processed in this loop (B itself is already marked failed by executeStep)
  3. Step B's downstream steps remain in pending state in the DB

The run correctly ends in failed status (via the catch block at line 437), but the step state in the database is inconsistent — some steps that should be skipped are left as pending. This affects any UI or API that reads step states to show workflow progress, and it affects resume() which would attempt to re-run those pending steps even though their upstream dependency failed.

Prompt for agents
In packages/sdk-ts/src/workflows/runner.ts, in the executeSteps method around lines 593-614, the fail-fast strategy throws immediately after the first rejected result, skipping processing of subsequent rejected results in the same batch. To fix this, process ALL rejected results before throwing. Specifically, change the loop so that it: 1) Iterates through all results and marks each failed step and its downstream as skipped, 2) Collects the first error, 3) After the loop, throws the collected error. This ensures all downstream steps of all failed parallel steps are properly marked as skipped before the throw.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@khaliqgant khaliqgant merged commit cf26336 into main Feb 18, 2026
37 checks passed
@khaliqgant khaliqgant deleted the sdk-workflows branch February 18, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant