feat(gastown): Workflows — visual DAG builder with scheduled and event-triggered execution

## Summary

Add a workflow system to Gastown that lets users define multi-step automation pipelines. Workflows can run on a schedule (cron), on events (bead closed, convoy landed, PR merged), on webhooks, or manually. Each node in the workflow is an action — sling a bead, run a convoy, call an API, send a notification, run a script, or talk to the Mayor.

This system is built around the canonical Gastown/Gascity mental model: **formula** (authored reusable definition) → **molecule** (durable runtime instance) → **step runs** (per-step tracked work). The implementation treats the existing in-repo molecule internals (`createMolecule`, `advanceMoleculeStep`, `rig_molecules` table) as disposable legacy and rebuilds from a clean definition/run model.

> **Architecture Review**: This spec addresses all 15 findings from the [Architecture Review comment](#issuecomment-review). See the [Critique Findings Tracker](#critique-findings-tracker) at the bottom for a mapping of each finding to its resolution.

---

## Motivation

Convoys are great for one-shot batches of related work. But many real-world patterns are **recurring** or **event-driven**:
- "Every Monday, scan for dependency updates and create PRs for outdated packages"
- "When a PR is merged to main, run integration tests and deploy to staging"
- "Every night, generate a changelog from the last 24h of merged PRs"
- "When an issue is labeled `gastown`, sling it as a bead automatically"
- "Weekly: audit code for security vulnerabilities, create beads for anything critical"

Workflows turn Gastown from a task executor into an **automation platform**.

---

## Naming / Canonical Compatibility

| Canonical Term | Cloud Product Label | Data Model Name | Description |
|---|---|---|---|
| **Formula** | Workflow Definition | `workflow_definition` | Authored, reusable, versioned graph definition |
| **Molecule** | Workflow Run | `workflow_run` | Instantiated durable execution of a definition revision |
| **Step** | Run Step / Task | `step_run` | Per-step runtime state within a run |
| **Wisp** | Ephemeral Run | (future) | Lightweight run that doesn't persist long-term |
| **Digest** | Run Summary | `workflow_run.summary` | Final output/result of a completed run |

**Important:** Avoid bare `workflow` as the sole object name in APIs where we actually mean definition or run — that ambiguity compounds quickly. Use `definition` vs `run` explicitly in the data model and API names.

Dual-label in UI: e.g. "Workflow Definitions (Formulas)", "Run Details (Molecule)". Preserve Gastown aliases in docs/tooltips/API descriptions.

---

## Architecture Decision: One WorkflowDO Per Town

> **[Addresses P0-1]** — Revised from "one WorkflowDO per definition" to "one WorkflowDO per town."

A new `WorkflowDO` Durable Object namespace. **One WorkflowDO instance per town**, keyed by `townId`. All workflow definitions, runs, and step states for a town live in a single DO's SQLite database.

```
Routing:  getWorkflowDOStub(env, townId)  // trivial — same as TownDO lookup
```

**Why one-per-town, not one-per-definition?**

- **Event matching is local**: When TownDO writes a durable notification (bead closed, convoy landed), WorkflowDO reads it on the next alarm tick via a local SQLite query over all definitions' trigger configs. No fan-out RPC to N definition-scoped DOs.
- **Trivial routing**: tRPC handlers just need `townId` to find the right WorkflowDO. One-per-definition would require a lookup table or definition→town mapping on every request.
- **SQLite is more than enough**: Even a busy town will have dozens of definitions and thousands of run records — well within DO SQLite capacity.
- **Single alarm loop**: One alarm manages all cron evaluations, event matching, and step progression for the town. No per-definition alarm overhead.

**Why not TownDO?** The TownDO alarm is already doing too much (#1855 — 47+ ops/tick). Workflows have fundamentally different scheduling needs (cron-based, potentially long-running waits). A dedicated DO keeps concerns separated and lets workflows scale independently.

---

## Cross-DO Communication: Durable Outbox Pattern

> **[Addresses P0-2]** — Replaced lossy cross-DO RPC with a durable outbox, mirroring the `town_events` pattern.

Cross-DO RPC is fire-and-forget. If the call fails, or either DO is evicted between the event and the notification, the workflow step stays `waiting` forever. **We do not use direct RPC for event delivery.**

Instead, we use a **durable outbox** pattern — the same pattern used by `town_events` for the TownDO reconciler (see `events.ts`: `insertEvent` → `drainEvents` → `markProcessed`).

### How it works

**TownDO side** — when a workflow-relevant event fires (bead closed, convoy landed, PR merged), TownDO writes a row into a `workflow_notifications` table in its own SQLite:

```sql
INSERT INTO workflow_notifications (
  notification_id, event_type, town_id, bead_id, convoy_id, payload, created_at, processed_at
) VALUES (?, ?, ?, ?, ?, ?, ?, NULL)
```

This is a synchronous SQL write inside the existing alarm tick — zero latency, zero failure risk. TownDO also re-arms the WorkflowDO alarm to ensure timely processing.

**WorkflowDO side** — on each alarm tick, the engine drains unprocessed notifications:

```sql
SELECT * FROM workflow_notifications WHERE processed_at IS NULL ORDER BY created_at ASC
```

For each notification, it checks all enabled definitions' trigger configs for a match. On match, it starts a run. After processing, it marks the notification as processed (`markProcessed` pattern).

### Sequence

```
TownDO                                    WorkflowDO
  │                                          │
  ├── bead.closed fires                      │
  ├── INSERT INTO workflow_notifications     │
  ├── workflowStub.ping() ─────────────────→│  (best-effort, just re-arms alarm)
  │                                          │
  │                                          ├── alarm tick
  │                                          ├── drain workflow_notifications
  │                                          ├── match event → definition triggers
  │                                          ├── start run(s)
  │                                          ├── mark notification processed
  │                                          └── re-arm alarm
```

The `ping()` RPC is purely an optimization to reduce latency — it re-arms the WorkflowDO alarm. If it fails, the WorkflowDO's own alarm (ticking every 60s when idle) will pick up the notification on the next cycle. **No data is lost.**

### Notification table (in TownDO's SQLite)

```sql
CREATE TABLE IF NOT EXISTS workflow_notifications (
  notification_id TEXT PRIMARY KEY,
  event_type TEXT NOT NULL,            -- 'bead_closed', 'convoy_landed', 'pr_merged', etc.
  town_id TEXT NOT NULL,
  bead_id TEXT,
  convoy_id TEXT,
  payload TEXT NOT NULL DEFAULT '{}',  -- JSON: additional context
  created_at TEXT NOT NULL,
  processed_at TEXT                     -- NULL = unprocessed
);
CREATE INDEX idx_wf_notif_unprocessed ON workflow_notifications (processed_at) WHERE processed_at IS NULL;
```

**Pruning**: Processed notifications older than 7 days are cleaned up in TownDO's housekeeping phase (same pattern as `pruneOldEvents`).

---

## Alarm Loop: Async Execution Model

> **[Addresses P0-3]** — Node execution is async, not blocking. Mirrors TownDO's Phase 1 (SQL) → Phase 2 (side effects) separation.

Cloudflare DOs have a 30-second execution limit per alarm invocation. Node execution (Mayor Prompts, HTTP calls, script runs) **must not** run synchronously in the alarm tick.

The WorkflowDO alarm follows the same proven pattern as TownDO's reconciler:

### Phase 0: Drain notifications

Read unprocessed notifications from TownDO (via shared SQLite or via the notification drain). Match against enabled definitions' triggers. For matches, create new `workflow_run` records and seed initial step states.

### Phase 1: Reconcile — compute ready nodes, apply SQL mutations

Walk all active runs. For each run, evaluate the DAG:
- Identify nodes whose dependencies are satisfied (all upstream nodes completed)
- Transition ready nodes to `dispatched` status (SQL mutation)
- Transition completed wait nodes based on received events
- Collect async side effects as deferred functions (same as `applyAction` returning `(() => Promise<void>) | null`)

All SQL mutations happen synchronously in this phase. No I/O.

### Phase 2: Execute side effects (async, best-effort)

```typescript
// Identical to TownDO's Phase 2 pattern
if (sideEffects.length > 0) {
  const results = await Promise.allSettled(sideEffects.map(fn => fn()));
  for (const r of results) {
    if (r.status === 'fulfilled') metrics.sideEffectsSucceeded++;
    else metrics.sideEffectsFailed++;
  }
}
```

Side effects include:
- `townStub.slingBead(...)` — create a bead on TownDO for `action_sling_bead` nodes
- `fetch(url, ...)` — execute HTTP calls for `action_http` nodes
- `mayorStub.prompt(...)` — invoke Mayor for `action_mayor_prompt` nodes

### Phase 3: Housekeeping

Prune old completed runs, prune processed notifications, emit metrics.

### Result delivery: self-notification pattern

When a side effect completes, it writes the result back as a **workflow event** (into WorkflowDO's own `workflow_events` table) and re-arms the alarm:

```typescript
// Inside a side effect for an action_http node:
async () => {
  try {
    const response = await fetch(url, options);
    const body = await response.text();
    workflowEvents.insertEvent(sql, 'step_completed', {
      run_id: runId,
      step_run_id: stepRunId,
      payload: { status: response.status, body },
    });
  } catch (err) {
    workflowEvents.insertEvent(sql, 'step_failed', {
      run_id: runId,
      step_run_id: stepRunId,
      payload: { error: String(err) },
    });
  }
  // Re-arm alarm so next tick processes the result
  ctx.storage.setAlarm(Date.now() + 100);
}
```

The next alarm tick drains these events, updates step status, and continues DAG traversal. This means:
- **No blocking**: The alarm tick returns quickly. Side effects run concurrently.
- **Durable**: Results are written to SQLite. If the DO is evicted mid-execution, the next alarm retries any `dispatched` steps that never wrote a result (with exponential backoff).
- **Observable**: Every state transition is recorded.

### Step status lifecycle

```
pending → ready → dispatched → completed
                            → failed
                            → timed_out
         ready → skipped    (upstream failure with on_failure: 'skip')
```

---

## Failure Handling: `on_failure` Semantics

> **[Addresses P0-4]** — Diamond pattern and configurable failure propagation.

Each edge in the graph can specify an `on_failure` policy (default: `'skip'`):

```typescript
type WorkflowEdge = {
  id: string;
  source: string;       // source node ID
  target: string;       // target node ID
  condition?: EdgeCondition | null;
  on_failure: 'skip' | 'continue' | 'fail_run';
};
```

**Semantics:**

| Policy | Behavior |
|---|---|
| `skip` (default) | If the source node failed, downstream nodes reachable **only** through this edge are marked `skipped`. Other paths to the same node are unaffected. |
| `continue` | Downstream nodes execute regardless. The failed step's output is `null`. |
| `fail_run` | The entire run is marked `failed` immediately. All pending/ready/dispatched steps are cancelled. |

**Diamond resolution**: In a diamond (A→B, A→C, B→D, C→D), if B completes and C fails:
- Edge C→D has `on_failure: 'skip'` → D is skipped (both paths must succeed)
- Edge C→D has `on_failure: 'continue'` → D executes with C's output as `null`
- Edge C→D has `on_failure: 'fail_run'` → entire run fails

The engine evaluates all incoming edges to a node. A node is **ready** when:
- All incoming edges with `on_failure: 'skip'` or `'continue'` have their source completed or failed
- No incoming edge with `on_failure: 'fail_run'` has a failed source
- At least one incoming edge has a completed (not failed) source, OR all failed sources have `on_failure: 'continue'`

---

## Wrangler Configuration

Add to `wrangler.jsonc`:
```jsonc
// In durable_objects.bindings:
{ "name": "WORKFLOW", "class_name": "WorkflowDO" }

// In migrations (new tag):
{ "tag": "v3", "new_sqlite_classes": ["WorkflowDO"] }
```

File: `src/dos/Workflow.do.ts` with sub-modules in `src/dos/workflow/`:
```
dos/
  Workflow.do.ts              # Class definition, RPC surface, alarm loop
  workflow/
    definitions.ts            # Definition CRUD, versioning
    runs.ts                   # Run lifecycle, step state management
    engine.ts                 # DAG traversal, node readiness, side effect dispatch
    scheduling.ts             # Cron evaluation, trigger matching
    nodes.ts                  # Node type implementations (side effect factories)
    events.ts                 # Workflow-scoped event recording (outbox pattern)
```

---

## SQLite Schema (WorkflowDO)

### `workflow_definitions`

```sql
CREATE TABLE IF NOT EXISTS workflow_definitions (
  definition_id TEXT PRIMARY KEY,
  town_id TEXT NOT NULL,
  name TEXT NOT NULL,
  description TEXT DEFAULT '',
  revision INTEGER NOT NULL DEFAULT 1,
  enabled INTEGER NOT NULL DEFAULT 0,           -- 0 = disabled, 1 = enabled
  graph TEXT NOT NULL DEFAULT '{}',              -- JSON: nodes (without positions), edges (with on_failure)
  layout TEXT NOT NULL DEFAULT '{}',             -- JSON: React Flow positions only (visual, non-execution)
  variables_schema TEXT NOT NULL DEFAULT '{}',   -- JSON Schema for run-time variables
  triggers TEXT NOT NULL DEFAULT '[]',           -- JSON array of trigger configs
  schedule TEXT,                                 -- Cron expression (null = no schedule)
  timezone TEXT NOT NULL DEFAULT 'UTC',          -- IANA timezone for cron evaluation (SINGLE source of truth)
  durability TEXT NOT NULL DEFAULT 'standard' CHECK(durability IN ('standard', 'ephemeral')),
  max_concurrent_runs INTEGER NOT NULL DEFAULT 5,
  last_run_id TEXT,
  last_run_status TEXT,
  last_run_at TEXT,                              -- [P1-5] Timestamp of last run start
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL
);
CREATE INDEX idx_wf_def_town ON workflow_definitions (town_id);
CREATE INDEX idx_wf_def_enabled ON workflow_definitions (enabled) WHERE enabled = 1;
CREATE INDEX idx_wf_def_schedule ON workflow_definitions (schedule) WHERE schedule IS NOT NULL;
```

> **[P1-5]**: `last_run_at` column added.
> **[P1-7]**: `graph` and `layout` are separate columns. `graph` contains execution-relevant data (nodes without positions, edges with `on_failure` config). `layout` contains React Flow position data. Layout-only changes don't bump the revision.
> **[P2-15]**: `timezone` is defined at the definition level only (single source of truth). Removed from trigger config.

### `workflow_definition_revisions` — deferred

> **[Addresses P1-6]**: Revision history is **deferred to Phase 3** (not MVP). For now, only the current graph/layout is stored. The `revision` field tracks the version number. When we add revision history, it will be:

```sql
-- DEFERRED: not in MVP or Phase 2
CREATE TABLE IF NOT EXISTS workflow_definition_revisions (
  revision_id TEXT PRIMARY KEY,
  definition_id TEXT NOT NULL,
  revision INTEGER NOT NULL,
  graph TEXT NOT NULL,
  layout TEXT NOT NULL,
  variables_schema TEXT NOT NULL,
  triggers TEXT NOT NULL,
  schedule TEXT,
  created_at TEXT NOT NULL,
  created_by TEXT,                              -- user/agent who made the change
  UNIQUE(definition_id, revision)
);
```

### `workflow_runs`

```sql
CREATE TABLE IF NOT EXISTS workflow_runs (
  run_id TEXT PRIMARY KEY,
  definition_id TEXT NOT NULL,
  definition_revision INTEGER NOT NULL,         -- pinned at run creation
  graph_snapshot TEXT NOT NULL,                  -- full graph at time of run (immutable)
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK(status IN ('pending', 'running', 'completed', 'failed', 'cancelled', 'timed_out')),
  trigger_type TEXT NOT NULL,                   -- 'manual', 'schedule', 'event', 'webhook'
  trigger_data TEXT NOT NULL DEFAULT '{}',       -- JSON: trigger context
  variables TEXT NOT NULL DEFAULT '{}',          -- JSON: resolved runtime variables
  summary TEXT,                                  -- Final output/digest
  error TEXT,                                    -- Error message if failed
  started_at TEXT NOT NULL,
  completed_at TEXT
);
CREATE INDEX idx_wf_run_def ON workflow_runs (definition_id);
CREATE INDEX idx_wf_run_status ON workflow_runs (status) WHERE status IN ('pending', 'running');
```

### `workflow_step_runs`

```sql
CREATE TABLE IF NOT EXISTS workflow_step_runs (
  step_run_id TEXT PRIMARY KEY,
  run_id TEXT NOT NULL,
  node_id TEXT NOT NULL,                        -- reference to graph node
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK(status IN ('pending', 'ready', 'dispatched', 'completed', 'failed', 'skipped', 'timed_out')),
  output TEXT,                                   -- JSON: step output
  error TEXT,
  bead_id TEXT,                                  -- if step created a bead
  dispatch_attempts INTEGER NOT NULL DEFAULT 0,
  last_dispatch_at TEXT,
  started_at TEXT,
  completed_at TEXT
);
CREATE INDEX idx_wf_step_run ON workflow_step_runs (run_id);
CREATE INDEX idx_wf_step_bead ON workflow_step_runs (bead_id) WHERE bead_id IS NOT NULL;
CREATE INDEX idx_wf_step_status ON workflow_step_runs (run_id, status);
```

### `workflow_events` (WorkflowDO internal outbox)

```sql
CREATE TABLE IF NOT EXISTS workflow_events (
  event_id TEXT PRIMARY KEY,
  event_type TEXT NOT NULL,                     -- 'step_completed', 'step_failed', 'notification_received', etc.
  run_id TEXT,
  step_run_id TEXT,
  payload TEXT NOT NULL DEFAULT '{}',
  created_at TEXT NOT NULL,
  processed_at TEXT
);
CREATE INDEX idx_wf_evt_unprocessed ON workflow_events (processed_at) WHERE processed_at IS NULL;
```

---

## Graph Definition Model

### Separate `graph` from `layout`

> **[Addresses P1-7]**

The `graph` column stores execution-relevant structure:

```typescript
type WorkflowGraph = {
  nodes: WorkflowNode[];    // type, config — NO position data
  edges: WorkflowEdge[];    // source, target, condition, on_failure
};
```

The `layout` column stores visual-only data:

```typescript
type WorkflowLayout = {
  positions: Record<string, { x: number; y: number }>;  // nodeId → position
  viewport?: { x: number; y: number; zoom: number };
};
```

Layout changes (dragging nodes, zooming) update `layout` without bumping `revision`. Graph changes (adding/removing nodes, changing edges, editing node config) bump `revision`.

### Node types

**Phase 1 (MVP):** 3 node types
- `trigger_manual` — manual run trigger
- `action_sling_bead` — create a bead on TownDO
- `control_condition` — if/else branching

**Phase 2:** +5 node types
- `trigger_schedule` — cron-based trigger
- `trigger_event` — event-based trigger (bead closed, convoy landed)
- `action_sling_convoy` — create a convoy
- `control_wait_bead` — wait for a bead to reach a status
- `control_delay` — wait for a duration

**Phase 3:** +7 node types
- `trigger_webhook` — HTTP webhook trigger
- `action_mayor_prompt` — LLM inference via Mayor
- `action_http` — arbitrary HTTP request
- `action_script` — run a script (see security note below)
- `control_wait_convoy` — wait for convoy to land
- `output_notify` — send a notification
- `output_update_bead` — update an existing bead

### Node Type Config Shapes

Each node type has a specific config schema validated at save time:

| Node Type | Config Fields |
|---|---|
| `trigger_schedule` | `{ cron: string }` (timezone is definition-level) |
| `trigger_event` | `{ event_type: 'bead_closed' \| 'convoy_landed' \| 'pr_merged', filter?: Record<string, string> }` |
| `trigger_webhook` | `{ secret?: string }` (URL is auto-generated) |
| `trigger_manual` | `{ input_schema?: Record<string, VariableSpec> }` |
| `action_sling_bead` | `{ rig_id: string, title: string, body?: string, priority?: string, labels?: string[] }` |
| `action_sling_convoy` | `{ rig_id: string, title: string, beads: ConvoyBeadSpec[] }` |
| `action_mayor_prompt` | `{ prompt_template: string }` — supports `{{step.<nodeId>.output}}` interpolation |
| `action_http` | `{ url: string, method: string, headers?: Record<string,string>, body_template?: string, allowed_domains?: string[] }` |
| `action_script` | `{ script: string, timeout_ms?: number }` — runs in own container |
| `control_wait_bead` | `{ bead_ref: string, target_status: 'closed' \| 'failed', timeout_ms?: number }` |
| `control_wait_convoy` | `{ convoy_ref: string, timeout_ms?: number }` |
| `control_condition` | `{ expression: string }` — evaluates to boolean, branches via edge conditions |
| `control_delay` | `{ duration_ms: number }` |
| `output_notify` | `{ channel: 'slack' \| 'webhook', target: string, message_template: string }` |
| `output_update_bead` | `{ bead_ref: string, fields: Record<string, unknown> }` |

### Template interpolation

Use `{{step.<nodeId>.output.<path>}}` syntax for referencing outputs from upstream steps.

> **[Addresses P2-13]**: Template interpolation in HTTP URL and body templates could exfiltrate data. **Mitigation (deferred to implementation):**
> - `action_http` node config includes an optional `allowed_domains` list. If set, the resolved URL must match.
> - Template values are sanitized (no control characters, length limits).
> - Full sandboxing of template interpolation is tracked as a follow-up item.

### `action_script` security

> **[Addresses P2-14]**: `action_script` nodes run in their **own short-lived container invocation**, NOT in an agent's container. They get a fresh, isolated environment with:
> - Read-only access to the repo
> - No access to agent state or other containers
> - Strict timeout (30s default, configurable up to 5min)
> - Resource limits (memory, CPU)
>
> **This node type is deferred to Phase 4.** The container strategy will be finalized during implementation.

---

## DAG Execution Engine

### Readiness evaluation

On each alarm tick, for each active run:

1. Query all `step_run` records for the run
2. For each `pending` step, check if all incoming edges' source nodes are in a terminal state (completed, failed, skipped)
3. Apply `on_failure` semantics per edge to determine if the step should be `ready`, `skipped`, or if the run should fail
4. Transition ready steps to `dispatched` (SQL mutation)
5. Return side effect functions for dispatched steps

### Parallel execution

Steps with no dependency between them execute in parallel — they're all transitioned to `dispatched` in the same alarm tick, and their side effects run concurrently via `Promise.allSettled`.

**Phase 1 (MVP):** Linear execution only (no parallel branches). Simplifies the engine significantly.

**Phase 2+:** Full parallel branch support.

### Cycle prevention

- **At save time**: Topological sort validation. If the graph has a cycle, the save is rejected with a validation error.
- **At runtime**: Max step count guard (1000 steps per run). If exceeded, run is marked `timed_out`.

### Run cancellation

When a run is cancelled:
1. All `pending`, `ready`, and `dispatched` steps are marked `skipped`
2. In-flight side effects (already dispatched) will complete but their results are ignored (step is already `skipped`)
3. Run status → `cancelled`

### Concurrent run limits

`max_concurrent_runs` (default: 5) on the definition. If a trigger fires while at the limit, the run is queued with `pending` status and starts when a slot opens.

---

## Cron Scheduling

### Evaluation

On each alarm tick, the engine queries all enabled definitions with a non-null `schedule`:

```sql
SELECT * FROM workflow_definitions
WHERE enabled = 1 AND schedule IS NOT NULL
```

For each, evaluate the cron expression against the definition's `timezone`. If the current time matches the cron and `last_run_at` is before the previous matching tick, start a new run.

### Catch-up policy

> **[Addresses P2-11]**: **Skip missed ticks.** On DO re-wake after eviction, only fire if the NEXT cron tick is in the future and `last_run_at` is before it. Do not backfill. This prevents a DO waking up after 4 hours of eviction from firing 240 runs for a per-minute schedule.

```typescript
function shouldFireCron(schedule: string, timezone: string, lastRunAt: string | null): boolean {
  const now = Date.now();
  const prevTick = getPreviousCronTick(schedule, timezone, now);
  const nextTick = getNextCronTick(schedule, timezone, now);

  // Only fire if we haven't already run for this tick window
  if (lastRunAt && new Date(lastRunAt).getTime() >= prevTick) return false;

  // Only fire if the next tick is in the future (we're not waking up to ancient backlog)
  return nextTick > now;
}
```

---

## Permissions

> **[Addresses P1-9]**

Follows the `resolveTownOwnership` pattern:

| Action | Who |
|---|---|
| Create / edit / delete definitions | Town owner, org admin |
| Enable / disable definitions | Town owner, org admin |
| Trigger manual runs | Town owner, org admin, org members |
| View definitions and runs | All org members |
| Webhook triggers | Authenticated via webhook secret (per-definition), rate-limited |

Webhook secrets are generated per-definition and stored in the `triggers` JSON config. The webhook endpoint validates the secret before creating a run.

---

## Billing & Rate Limiting

> **[Addresses P1-10]**

| Limit | Free | Pro | Enterprise |
|---|---|---|---|
| Workflow definitions per town | 3 | 20 | Unlimited |
| Runs per month | 50 | 1,000 | 10,000 |
| Steps per run | 10 | 50 | 200 |
| LLM tokens per `mayor_prompt` step | 4K | 16K | 64K |
| LLM token budget per workflow per month | 100K | 1M | 10M |
| Min cron interval | 1 hour | 5 min | 1 min |
| Webhook triggers per minute | 5 | 30 | 100 |
| Concurrent runs per definition | 1 | 5 | 20 |

Rate limiting is enforced at:
1. **Run creation**: Check monthly run count before starting
2. **Step dispatch**: Check step count per run before transitioning to `dispatched`
3. **Webhook endpoint**: Token bucket rate limiter per definition
4. **Cron evaluation**: Min interval check against tier

Tier info is read from the town's billing context (same pattern as existing feature gates).

---

## Auto-save & Draft Model

> **[Addresses P2-12]**

The visual editor uses a **draft model**:

- **Auto-save** writes to a `draft_graph` and `draft_layout` column (or a separate `workflow_drafts` table). No revision bump, no validation. The editor reads from the draft on load.
- **"Save & Publish"** validates the graph (cycle detection, required fields, node config validation), copies draft → `graph`/`layout`, bumps `revision`, and optionally updates the `enabled` state.
- If there's no draft (draft columns are null), the editor loads from the published `graph`/`layout`.

This prevents half-edited graphs from executing while still giving users a seamless editing experience.

---

## tRPC API

### Definition Management

```typescript
workflows.definitions.list    // { townId } → WorkflowDefinition[]
workflows.definitions.get     // { townId, definitionId } → WorkflowDefinition
workflows.definitions.create  // { townId, name, graph, layout?, triggers?, schedule? } → WorkflowDefinition
workflows.definitions.update  // { townId, definitionId, ...partial } → WorkflowDefinition
workflows.definitions.delete  // { townId, definitionId } → void
workflows.definitions.setEnabled  // { townId, definitionId, enabled } → void
workflows.definitions.validate    // { townId, graph } → ValidationResult
workflows.definitions.saveDraft   // { townId, definitionId, graph, layout } → void
workflows.definitions.publishDraft // { townId, definitionId } → WorkflowDefinition (validates + bumps revision)
```

### Run Management

```typescript
workflows.runs.list     // { townId, definitionId?, status? } → WorkflowRun[]
workflows.runs.get      // { townId, runId } → WorkflowRun & { steps: StepRun[] }
workflows.runs.trigger  // { townId, definitionId, variables? } → WorkflowRun
workflows.runs.cancel   // { townId, runId } → void
```

### Webhook HTTP endpoint

```
POST /api/webhooks/workflow/:definitionId
Headers: X-Webhook-Secret: <secret>
Body: JSON payload (passed as trigger_data)
```

---

## UI: Phased Approach

### Phase 1 (MVP): Form-based editor

4 routes:
- `/towns/:townId/workflows` — list of workflow definitions with status badges
- `/towns/:townId/workflows/new` — create form (name, description, linear step list)
- `/towns/:townId/workflows/:defId/edit` — edit form (same as create, plus enable/disable toggle)
- `/towns/:townId/workflows/:defId/runs` — run history list with status, duration, trigger type

The MVP editor is a **form-based linear step editor**, not React Flow. Users add steps in sequence. Each step has a type selector and config form. This is sufficient for linear workflows and dramatically reduces implementation complexity.

### Phase 3: React Flow Visual Editor

Upgrade to a full visual DAG editor:
- Three-panel layout: node palette (left), canvas (center), inspector (right)
- Custom node components with status badges
- Drag-and-drop from palette
- Edge drawing with `on_failure` config in the inspector
- Auto-save → draft, explicit publish
- Run visualization overlay (step statuses on the graph)

---

## Event Triggers / TownDO Integration

When Gastown events occur, TownDO writes them to the `workflow_notifications` outbox table (see [Cross-DO Communication](#cross-do-communication-durable-outbox-pattern) above). WorkflowDO drains these on each alarm tick and matches against trigger configs.

### Event Types Available as Triggers

Using the existing `TownEventType` enum plus new workflow-specific events:

| Event | Payload | When |
|---|---|---|
| `bead_closed` | `{ bead_id, bead_type, rig_id, title }` | Any bead reaches `closed` status |
| `bead_failed` | `{ bead_id, bead_type, rig_id, title, error }` | Any bead reaches `failed` status |
| `convoy_landed` | `{ convoy_id, title, bead_count }` | All beads in a convoy complete |
| `convoy_started` | `{ convoy_id, title }` | Convoy begins execution |
| `pr_merged` | `{ bead_id, rig_id, pr_url }` | PR associated with a bead is merged |
| `pr_created` | `{ bead_id, rig_id, pr_url }` | PR is created for a bead |

---

## Agent Ergonomics / Compatibility Bridge

Preserve `gt_mol_current` / `gt_mol_advance` semantics, backed by the new run model:

- `gt_mol_current` → query WorkflowDO for the current ready step(s) in the active run
- `gt_mol_advance` → mark a step as completed and advance the DAG

These are compatibility shims maintained as plugin tools, NOT structural constraints on the data model. The new API is the primary interface; these exist for agents that already use the molecule protocol.

---

## Implementation Plan

### Phase 0: Legacy Cleanup (1 week)

**Goal:** Mark current molecule internals as disposable legacy.

- [ ] Audit all references to `rig_molecules` table, `createMolecule`, `advanceMoleculeStep`, `MoleculeStepper.tsx`, `FormulaLibrary`
- [ ] Classify each as: keep (conceptually), shim (temporarily), or delete
- [ ] Fix the known formula shape mismatch bug: `createMolecule()` treats `formula` as array but API sends `{ steps: [...] }`
- [ ] Remove or mark stale schema/docs as superseded
- [ ] Document the canonical mapping (formula→definition, molecule→run, step→step_run) in AGENTS.md or a spec doc

### Phase 1: MVP (2 weeks)

> **[Addresses P1-8]**: Scoped to a realistic 2-week deliverable.

**Goal**: End-to-end workflow execution with manual trigger, linear steps, form-based editor.

**Backend:**
- [ ] WorkflowDO skeleton (one per town, keyed by `townId`)
- [ ] `workflow_definitions` table + CRUD
- [ ] `workflow_runs` + `workflow_step_runs` tables
- [ ] `workflow_events` table (internal outbox)
- [ ] Alarm loop with Phase 0/1/2 pattern
- [ ] Linear DAG engine (no parallel branches)
- [ ] Manual trigger only
- [ ] 3 node types: `trigger_manual`, `action_sling_bead`, `control_condition`
- [ ] tRPC API: definition CRUD + run trigger/list/get/cancel
- [ ] `workflow_notifications` table in TownDO + write path for `bead_closed` events (wiring only — event triggers activate in Phase 2)

**Frontend:**
- [ ] Workflow list page with create button
- [ ] Form-based linear editor (add/remove/reorder steps)
- [ ] Run list with status badges
- [ ] Run detail view (step list with status, output, errors)

**Not in Phase 1:** Cron, event triggers, webhooks, parallel branches, React Flow, wait nodes, Mayor/HTTP/script nodes, revision history, draft model.

### Phase 2: Event Triggers & Scheduling (2-3 weeks)

- [ ] Event trigger support (bead_closed, convoy_landed, pr_merged)
- [ ] Cron scheduling with timezone support + catch-up policy (skip missed ticks)
- [ ] `trigger_schedule` and `trigger_event` node types
- [ ] `control_wait_bead`, `control_delay` node types
- [ ] `action_sling_convoy` node type
- [ ] Parallel branch execution in the DAG engine
- [ ] Concurrent run limiting
- [ ] Billing/rate limiting enforcement
- [ ] Permissions enforcement

### Phase 3: Visual Editor & Advanced Nodes (3-4 weeks)

- [ ] React Flow visual editor (three-panel layout)
- [ ] Auto-save draft model + publish flow
- [ ] `trigger_webhook` + webhook HTTP endpoint
- [ ] `action_mayor_prompt` node type
- [ ] `action_http` node type (with `allowed_domains`)
- [ ] `output_notify`, `output_update_bead` node types
- [ ] Revision history (`workflow_definition_revisions` table)
- [ ] Run visualization overlay on the graph

### Phase 4: Polish & Templates (1-2 weeks)

- [ ] `action_script` node type (own container, sandboxed)
- [ ] Workflow templates (pre-built definitions for common patterns)
- [ ] Import/export definitions as JSON
- [ ] Run summary/digest generation
- [ ] Metrics dashboard (runs/day, success rate, avg duration)
- [ ] `gt_mol_current` / `gt_mol_advance` compatibility shims

### Total realistic estimate: 9-12 weeks (including Phase 0)

> **[Addresses P1-8]**: Previous estimate of 5-6 weeks was too aggressive. The reconciler (simpler scope) took ~3 weeks. This system adds a new DO + DAG engine + editor + cron + cross-DO communication + 15 node types + tRPC API + run visualization.

### Future scope (not in this issue)

- Wisps (ephemeral runs that don't persist)
- Sub-workflow composition (a node that triggers another workflow)
- Approval gates (human-in-the-loop nodes)
- Workflow marketplace / sharing

---

## References

- [React Flow](https://reactflow.dev/) — visual DAG editor library
- #1855 — TownDO alarm optimization (rationale for separate WorkflowDO)
- #1002 — Auto-resolve PR feedback (workflow template candidate)
- #1739 — Optional refinery (pr_only mode is a workflow-friendly pattern)
- #1940 — Slack integration (Notify node target)
- Reconciler spec: `reconciliation-spec.md`
- TownDO event pattern: `src/dos/town/events.ts`
- TownDO alarm loop: `src/dos/Town.do.ts` (Phase 0-2 pattern)
- TownDO action model: `src/dos/town/actions.ts` (SQL mutations → deferred side effects)
- TownDO reconciler: `src/dos/town/reconciler.ts` (read-only state → Action[] → side effects)

---

## Critique Findings Tracker

All 15 findings from the Architecture Review comment have been addressed:

| # | Priority | Finding | Resolution | Section |
|---|---|---|---|---|
| 1 | **P0** | Instance model: one WorkflowDO per town, not per definition | ✅ Rewritten — one WorkflowDO per town, keyed by `townId` | [Architecture Decision](#architecture-decision-one-workflowdo-per-town) |
| 2 | **P0** | Event delivery: durable outbox, not lossy RPC | ✅ Rewritten — `workflow_notifications` outbox in TownDO, drain pattern | [Cross-DO Communication](#cross-do-communication-durable-outbox-pattern) |
| 3 | **P0** | Alarm loop: async execution, no blocking I/O | ✅ Rewritten — Phase 0/1/2 pattern, side effects via `Promise.allSettled`, self-notification | [Alarm Loop](#alarm-loop-async-execution-model) |
| 4 | **P0** | Diamond failure: `on_failure` semantics | ✅ Added — `skip`, `continue`, `fail_run` per edge | [Failure Handling](#failure-handling-on_failure-semantics) |
| 5 | **P1** | Schema missing `last_run_at` | ✅ Added to `workflow_definitions` schema | [SQLite Schema](#workflow_definitions) |
| 6 | **P1** | No revision history storage | ✅ Acknowledged — deferred to Phase 3, schema sketched | [SQLite Schema](#workflow_definition_revisions--deferred) |
| 7 | **P1** | Layout leaks into execution graph | ✅ Separated — `graph` and `layout` are distinct columns | [Graph Definition Model](#separate-graph-from-layout) |
| 8 | **P1** | Effort estimates too aggressive | ✅ Revised — 9-12 weeks total, 2-week MVP scoped | [Implementation Plan](#implementation-plan) |
| 9 | **P1** | Permissions model missing | ✅ Added — follows `resolveTownOwnership` pattern | [Permissions](#permissions) |
| 10 | **P1** | Billing/rate limiting missing | ✅ Added — per-tier limits table, enforcement points | [Billing & Rate Limiting](#billing--rate-limiting) |
| 11 | **P2** | Cron catch-up policy | ✅ Explicit "skip missed ticks" with pseudocode | [Cron Scheduling](#catch-up-policy) |
| 12 | **P2** | Auto-save + validation conflict | ✅ Draft model — auto-save writes draft, publish validates | [Auto-save & Draft Model](#auto-save--draft-model) |
| 13 | **P2** | Template interpolation security | ✅ Acknowledged — `allowed_domains`, deferred to impl | [Graph Definition Model](#template-interpolation) |
| 14 | **P2** | `action_script` container strategy | ✅ Stated — own container, deferred to Phase 4 | [Graph Definition Model](#action_script-security) |
| 15 | **P2** | Timezone in two places | ✅ Fixed — definition-level only, removed from triggers | [SQLite Schema](#workflow_definitions) |


Policy	Behavior
`skip` (default)	If the source node failed, downstream nodes reachable only through this edge are marked `skipped`. Other paths to the same node are unaffected.
`continue`	Downstream nodes execute regardless. The failed step's output is `null`.
`fail_run`	The entire run is marked `failed` immediately. All pending/ready/dispatched steps are cancelled.

Node Type	Config Fields
`trigger_schedule`	`{ cron: string }` (timezone is definition-level)
`trigger_event`	`{ event_type: 'bead_closed' \| 'convoy_landed' \| 'pr_merged', filter?: Record<string, string> }`
`trigger_webhook`	`{ secret?: string }` (URL is auto-generated)
`trigger_manual`	`{ input_schema?: Record<string, VariableSpec> }`
`action_sling_bead`	`{ rig_id: string, title: string, body?: string, priority?: string, labels?: string[] }`
`action_sling_convoy`	`{ rig_id: string, title: string, beads: ConvoyBeadSpec[] }`
`action_mayor_prompt`	`{ prompt_template: string }` — supports `{{step.<nodeId>.output}}` interpolation
`action_http`	`{ url: string, method: string, headers?: Record<string,string>, body_template?: string, allowed_domains?: string[] }`
`action_script`	`{ script: string, timeout_ms?: number }` — runs in own container
`control_wait_bead`	`{ bead_ref: string, target_status: 'closed' \| 'failed', timeout_ms?: number }`
`control_wait_convoy`	`{ convoy_ref: string, timeout_ms?: number }`
`control_condition`	`{ expression: string }` — evaluates to boolean, branches via edge conditions
`control_delay`	`{ duration_ms: number }`
`output_notify`	`{ channel: 'slack' \| 'webhook', target: string, message_template: string }`
`output_update_bead`	`{ bead_ref: string, fields: Record<string, unknown> }`

Event	Payload	When
`bead_closed`	`{ bead_id, bead_type, rig_id, title }`	Any bead reaches `closed` status
`bead_failed`	`{ bead_id, bead_type, rig_id, title, error }`	Any bead reaches `failed` status
`convoy_landed`	`{ convoy_id, title, bead_count }`	All beads in a convoy complete
`convoy_started`	`{ convoy_id, title }`	Convoy begins execution
`pr_merged`	`{ bead_id, rig_id, pr_url }`	PR associated with a bead is merged
`pr_created`	`{ bead_id, rig_id, pr_url }`	PR is created for a bead

Canonical Term	Cloud Product Label	Data Model Name	Description
Formula	Workflow Definition	`workflow_definition`	Authored, reusable, versioned graph definition
Molecule	Workflow Run	`workflow_run`	Instantiated durable execution of a definition revision
Step	Run Step / Task	`step_run`	Per-step runtime state within a run
Wisp	Ephemeral Run	(future)	Lightweight run that doesn't persist long-term
Digest	Run Summary	`workflow_run.summary`	Final output/result of a completed run

Action	Who
Create / edit / delete definitions	Town owner, org admin
Enable / disable definitions	Town owner, org admin
Trigger manual runs	Town owner, org admin, org members
View definitions and runs	All org members
Webhook triggers	Authenticated via webhook secret (per-definition), rate-limited

Limit	Free	Pro	Enterprise
Workflow definitions per town	3	20	Unlimited
Runs per month	50	1,000	10,000
Steps per run	10	50	200
LLM tokens per `mayor_prompt` step	4K	16K	64K
LLM token budget per workflow per month	100K	1M	10M
Min cron interval	1 hour	5 min	1 min
Webhook triggers per minute	5	30	100
Concurrent runs per definition	1	5	20

#	Priority	Finding	Resolution	Section
1	P0	Instance model: one WorkflowDO per town, not per definition	✅ Rewritten — one WorkflowDO per town, keyed by `townId`	Architecture Decision
2	P0	Event delivery: durable outbox, not lossy RPC	✅ Rewritten — `workflow_notifications` outbox in TownDO, drain pattern	Cross-DO Communication
3	P0	Alarm loop: async execution, no blocking I/O	✅ Rewritten — Phase 0/1/2 pattern, side effects via `Promise.allSettled`, self-notification	Alarm Loop
4	P0	Diamond failure: `on_failure` semantics	✅ Added — `skip`, `continue`, `fail_run` per edge	Failure Handling
5	P1	Schema missing `last_run_at`	✅ Added to `workflow_definitions` schema	SQLite Schema
6	P1	No revision history storage	✅ Acknowledged — deferred to Phase 3, schema sketched	SQLite Schema
7	P1	Layout leaks into execution graph	✅ Separated — `graph` and `layout` are distinct columns	Graph Definition Model
8	P1	Effort estimates too aggressive	✅ Revised — 9-12 weeks total, 2-week MVP scoped	Implementation Plan
9	P1	Permissions model missing	✅ Added — follows `resolveTownOwnership` pattern	Permissions
10	P1	Billing/rate limiting missing	✅ Added — per-tier limits table, enforcement points	Billing & Rate Limiting
11	P2	Cron catch-up policy	✅ Explicit "skip missed ticks" with pseudocode	Cron Scheduling
12	P2	Auto-save + validation conflict	✅ Draft model — auto-save writes draft, publish validates	Auto-save & Draft Model
13	P2	Template interpolation security	✅ Acknowledged — `allowed_domains`, deferred to impl	Graph Definition Model
14	P2	`action_script` container strategy	✅ Stated — own container, deferred to Phase 4	Graph Definition Model
15	P2	Timezone in two places	✅ Fixed — definition-level only, removed from triggers	SQLite Schema

feat(gastown): Workflows — visual DAG builder with scheduled and event-triggered execution #1943

Description

Summary

Motivation

Naming / Canonical Compatibility

Architecture Decision: One WorkflowDO Per Town

Cross-DO Communication: Durable Outbox Pattern

How it works

Sequence

Notification table (in TownDO's SQLite)

Alarm Loop: Async Execution Model

Phase 0: Drain notifications

Phase 1: Reconcile — compute ready nodes, apply SQL mutations

Phase 2: Execute side effects (async, best-effort)

Phase 3: Housekeeping

Result delivery: self-notification pattern

Step status lifecycle

Failure Handling: on_failure Semantics

Wrangler Configuration

SQLite Schema (WorkflowDO)

workflow_definitions

workflow_definition_revisions — deferred

workflow_runs

workflow_step_runs

workflow_events (WorkflowDO internal outbox)

Graph Definition Model

Separate graph from layout

Node types

Node Type Config Shapes

Template interpolation

action_script security

DAG Execution Engine

Readiness evaluation

Parallel execution

Cycle prevention

Run cancellation

Concurrent run limits

Cron Scheduling

Evaluation

Catch-up policy

Permissions

Billing & Rate Limiting

Auto-save & Draft Model

tRPC API

Definition Management

Run Management

Webhook HTTP endpoint

UI: Phased Approach

Phase 1 (MVP): Form-based editor

Phase 3: React Flow Visual Editor

Event Triggers / TownDO Integration

Event Types Available as Triggers

Agent Ergonomics / Compatibility Bridge

Implementation Plan

Phase 0: Legacy Cleanup (1 week)

Phase 1: MVP (2 weeks)

Phase 2: Event Triggers & Scheduling (2-3 weeks)

Phase 3: Visual Editor & Advanced Nodes (3-4 weeks)

Phase 4: Polish & Templates (1-2 weeks)

Total realistic estimate: 9-12 weeks (including Phase 0)

Future scope (not in this issue)

References

Critique Findings Tracker

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Failure Handling: `on_failure` Semantics

`workflow_definitions`

`workflow_definition_revisions` — deferred

`workflow_runs`

`workflow_step_runs`

`workflow_events` (WorkflowDO internal outbox)

Separate `graph` from `layout`

`action_script` security