fix(loader): drain broker stdout on spawn (was wired to 'pause' which never fires)#98
Conversation
… never fires) The NODE_OPTIONS register script monkeypatches child_process.spawn so that agent-relay-broker children have their stdout drained — preventing the broker from blocking on write() once the OS pipe buffer fills. The previous wiring attached a 'data' listener inside a callback that only ran when child.stdout emitted 'pause'. Node Readable streams never emit 'pause' on internal buffer fill — that event only fires when something explicitly calls .pause() (which nothing in this code path ever does), so the stream stayed in paused mode, libuv stopped draining the kernel pipe at high-water mark, and a chatty broker would block in write() once ~64KB of stdout queued up. Symptom: overnight proactive-runtime runs (Ricky-driven, M1 fans out to 9 PTY workers) froze within seconds of fanout with every worker log stuck at the same mtime, broker process parked in write() or _pthread_cond_wait, M1's step.run awaiting a never-arriving drain signal. Reproduced twice (~14h apart) with diagnostic bundles capturing the same shape. Changes - Attach `data` listener and call `resume()` synchronously at spawn time for both `init` and `pty` broker invocations. This matches what SDK 6.0.15's `drainBrokerStdoutAfterStartup` does for direct SDK consumers. - Expand the argv guard from `argv[0]==="init"` to also include `"pty"`, so per-worker PTY brokers (M1's lead + impl-* fanout) are protected, not just the channel-multiplexer init broker. - Update the explanatory comment block above `registerSource` to capture the new semantics and the prior bug. Verification - npm run typecheck — clean - npm test — 1077 / 1077 pass, including the existing "drains broker stdout after SDK startup so event floods cannot wedge the workflow node" regression at entrypoint.test.ts:3122. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe SDK runtime loader's child-process stdout draining workaround is updated to fix broker stdout blockage. Instead of relying on pause-event hooks, the injected spawn patch now immediately attaches a no-op ChangesStdout Draining Workaround
Possibly Related PRs
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Ricky Eval ReviewRun: Passed: 8 | Needs human: 42 | Reviewable: 42 | Missing output: 0 | Failed: 1 | Skipped: 0 Blocking CasesFAIL
|
| Gate | What it checks | Common failure cause |
|---|---|---|
| SchemaGate | Input/Output JSON schema compliance | Missing required field, type mismatch |
| IdempotenceGate | Deterministic output for identical inputs | Non‑deterministic timestamps, random IDs |
| ResourceGate | External resource a | |
| ...[truncated] |
**Must**
- Preserve one-attempt behavior when auto-fix is disabled.
- Return the classified blocker, diagnosis, recovery steps, and non-zero exit code.
- Make clear that the user chose manual inspection over repair/resume automation.
**Must Not**
- Start a repair loop despite `--no-auto-fix`.
- Suppress the diagnosis because no repair was attempted.
- Present the failure as a completed repair attempt.
**Deterministic checks**
- PASS `status`: case executed
- PASS `maxToolCalls`: expected <= 0 tool calls, got 0
</details>
<details>
<summary><strong>REVIEW</strong> <code>runtime-recovery.in-process-local-runner</code> (runtime-recovery/manual)</summary>
**User message**
> Explain how Ricky should execute a local TypeScript workflow artifact in the primary local path.
**Ricky output**
```text
**Ricky’s Standard Procedure for Running a Local TypeScript Workflow Artifact**
| Phase | Action | Rationale / Artifact |
|-------|--------|----------------------|
| 1️⃣ | **Locate the primary workflow directory** | The workflow root is always `./workflows/` (repo‑wide convention). The primary local path for a new artifact is `./workflows/<workflow‑name>/`. |
| 2️⃣ | **Validate the contract file** | Every workflow must contain a `workflow.contract.json` that declares: <br>• `name` <br>• `entry` (relative path to the TS file) <br>• `inputs` schema <br>• `outputs` schema <br>• `verificationGate` (optional). <br>Ricky aborts if the contract is missing or malformed and logs a **BLOCKER**. |
| 3️⃣ | **Install dependencies** | ```bash\ncd ./workflows/<workflow‑name>\nnpm ci # deterministic install, lockfile required\n``` <br>Ensures reproducible environment; any missing `package.json` or lockfile is reported as a **BLOCKER**. |
| 4️⃣ | **Compile‑time check** | Run TypeScript’s type‑checker without emitting code: <br>```bash\ntsc --noEmit\n``` <br>If `tsc` returns non‑zero, Ricky creates a **Verification Gate Failure** report (includes `tsc` diagnostics) and stops execution. |
...[truncated]
Must
- Prefer the Node strip-types route or equivalent SDK/programmatic route over requiring the
agent-relaybinary on PATH. - Precheck that Node and
@agent-relay/sdkare resolvable for the workflow. - Record the actual spawn command in execution evidence.
Must Not
- Fail solely because
agent-relayis not on PATH when the SDK route is available. - Hide the actual runtime command from evidence.
- Conflate the user-facing reproduction command with the primary internal spawn route.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.escalation-is-not-generic-failure (runtime-recovery/manual)
User message
Ricky reaches a boundary after a structural failure persists after a fix attempt.
Ricky output
**Ricky – Boundary‑Escalation Workflow for Persistent Structural Failures**
| **Phase** | **Goal** | **Inputs** | **Outputs / Artifacts** | **Verification Gate** |
|-----------|----------|------------|--------------------------|-----------------------|
| **1️⃣ Detect & Log** | Capture the failure that survived the first fix. | - Failure ticket (e.g., `FAIL‑1234`) <br> - Original fix PR (`fix/struct‑leak‑v2`) <br> - Test run logs, stack traces, metrics | - `failure_report.md` (structured JSON/YAML) <br> - Updated ticket with `status: “persistent”` | **Gate A** – Automated parser confirms required fields (`failure_id`, `original_pr`, `symptoms`, `timestamp`). |
| **2️⃣ Re‑run Verification Suite** | Ensure the failure is reproducible and not a fluke. | - `failure_report.md` <br> - Full CI pipeline for the affected component | - `repro_run.log` <br> - `repro_result.json` (`pass: false`) | **Gate B** – CI must report a deterministic failure on ≥ 2 clean agents. |
| **3️⃣ Scoped Branch Creation** | Isolate further investigation without contaminating the main line. | - `main` (or current release branch) <br> - `failure_report.md` | - New branch `boundary/FAIL‑1234‑investigate` <
...[truncated]
Must
- Escalate with the attempted fix, failed validation, classified blocker, and recommendation.
- Distinguish escalation from a generic product failure.
- Preserve enough context for a human operator to continue.
Must Not
- Retry speculative fixes indefinitely.
- Collapse to "something went wrong" without the attempted actions.
- Discard evidence from failed repair attempts.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.analytics-from-structured-evidence (runtime-recovery/manual)
User message
Produce a workflow health digest from many Ricky workflow runs.
Ricky output
**Ricky Workflow Health Digest – Summary of Recent Runs**
*(Generated per the “Workflow Health Digest” contract in AGENTS.md – deterministic, scoped, and ready for PR review)*
| Run ID | Workflow | Branch | Trigger | Start ⏱ | End ⏱ | Duration | Status | Exit‑Code | Key Metrics | Notable Logs / Alerts |
|--------|----------|--------|---------|----------|-------|----------|--------|-----------|-------------|------------------------|
| `wf‑20240528‑001` | **Deploy‑Prod** | `main` | `push` (v2.3.1) | 2024‑05‑28 02:13 UTC | 2024‑05‑28 02:27 UTC | 14 min | ✅ **Success** | 0 | CPU ≈ 68 % (peak) <br>Memory ≈ 1.2 GB | No alerts |
| `wf‑20240527‑014` | **Data‑Ingest** | `feature/ingest‑v5` | `schedule` (nightly) | 2024‑05‑27 00:00 UTC | 2024‑05‑27 00:42 UTC | 42 min | ⚠️ **Partial‑Fail** | 2 | Rows ≈ 4.2 M (processed) <br>Failed ≈ 1.3 % | “Row‑validation‑error” at step 3 (see Log‑A) |
| `wf‑20240526‑009` | **Deploy‑Prod** | `release/2024‑05‑26` | `manual` (approval) | 2024‑05‑26 14:05 UTC | 2024‑05‑26 14:23 UTC | 18 min | ❌ **Failed** | 1 | No deployment artifacts | “Docker‑pull‑timeout” in step 2 |
| `wf‑20240525‑022` | **Smoke‑Test** | `main` | `push` (v2.3.0) | 2024‑05‑25 09:12 UTC
...[truncated]
Must
- Consume normalized
WorkflowRunEvidencerather than raw logs as the primary input. - Identify recurring failure classes, weak validation, oversized steps, and runtime duration patterns.
- Produce concrete recommendations tied to specific workflows, steps, or metrics.
Must Not
- Mutate evidence while analyzing it.
- Return generic advice like "improve your workflow" without references.
- Mix environment failures and workflow-logic failures into one undifferentiated bucket.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.co-equal-surfaces (surfaces-ingress/manual)
User message
Design a new Ricky Slack surface for workflow debugging.
Ricky output
## Ricky Slack Surface – “Workflow Debugger”
### 1. Overview
The **Ricky Slack Debugger** is a dedicated Slack app surface that lets engineers and product owners inspect, step‑through, and repair Ricky‑managed workflows in real time. It follows the **Ricky repository conventions** (AGENTS.md) and the **workflow standards** (deterministic contracts, verification gates, review artifacts, 80‑100 validation loops, blocker reporting, scoped PR boundaries).
| Component | Purpose |
|-----------|---------|
| **Debug Panel** (modal) | Visual tree of the active workflow, node status, input/output payloads. |
| **Step Controls** (buttons) | *Pause*, *Resume*, *Step‑Into*, *Step‑Over*, *Jump‑To* a specific node. |
| **Live Log Stream** (section) | Timestamped, filtered logs with severity tags. |
| **Patch Submit** (form) | Inline edit of node configuration + deterministic contract diff. |
| **Verification Gate** (modal) | Auto‑run pre‑commit checks before a patch is merged. |
| **Review Artifact** (thread) | Auto‑generated summary of changes, test results, and blocker report. |
| **Loop Counter** (badge) | Shows current validation‑loop iteration (target 80‑100). |
---
### 2. Interaction Fl
...[truncated]
Must
- Treat Slack as a co-equal product surface, not a wrapper around CLI.
- Route domain work through the same normalization, executor, evidence, and specialist stages.
- Keep Slack-specific formatting, thread handling, and interactive components in the surface layer.
Must Not
- Put workflow generation or diagnosis domain logic directly in the Slack handler.
- Degrade Slack to a developer shortcut with weaker routing than CLI.
- Skip signature verification, dedup, or thread handling when the surface is implemented.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.normalizer-is-only-translation-boundary (surfaces-ingress/manual)
User message
Add a new web handoff type that submits a workflow spec and mode.
Ricky output
**New Web‑Handoff Type – `workflowSubmit`**
| Element | Definition |
|---------|------------|
| **Name** | `workflowSubmit` |
| **Purpose** | Allows a web client (e.g., UI, external service) to hand off a complete **Workflow Specification** together with an execution **mode** (e.g., `dryRun`, `run`, `preview`). The receiving AgentWorkforce service validates, stores, and optionally triggers the workflow. |
| **Direction** | **Client → AgentWorkforce** (POST) |
| **Content‑Type** | `application/json` |
| **Schema** | See **JSON‑Schema** below. |
| **Response** | Deterministic status object with `requestId`, `validationResult`, and optional `triggerId`. |
| **Idempotency** | `X‑Idempotency‑Key` header required for safe retries. |
| **Auth** | Bearer token (OAuth2) with `workflow:submit` scope. |
| **Rate‑limit** | 30 req/min per token. |
---
## 1. JSON‑Schema for the Request Payload
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "WorkflowSubmitPayload",
"type": "object",
"required": ["workflowSpec", "mode"],
"properties": {
"workflowSpec": {
"type": "object",
"description": "Full workflow definition adhering to the internal Workf
...[truncated]
Must
- Add a hand
Comment truncated to stay within GitHub limits. Download the ricky-eval-run artifact for the full human-review.md.
Summary
The loader's
NODE_OPTIONSregister script monkeypatcheschild_process.spawnso thatagent-relay-brokerchildren have their stdout drained — preventing the broker from blocking inwrite()once the OS pipe buffer fills. The previous wiring was a no-op: it hooked the stream's'pause'event, which Node'sReadablestreams never emit on internal buffer fill —'pause'only fires when something explicitly calls.pause(), which nothing in this code path does. The stream stayed in paused mode, libuv stopped draining the kernel pipe at the high-water mark, and a chatty broker would block inwrite()once ~64KB queued up.Symptom
Overnight
proactive-runtime-m1runs (Ricky-driven, M1 fans out to 9 PTY workers) froze within seconds of fanout with every worker log stuck at the same mtime, broker process parked inwrite()or_pthread_cond_wait, M1'sstep.runawaiting a never-arriving drain signal. Reproduced twice ~14 hours apart on independent runs, with diagnostic bundles capturing the same freeze-at-fanout shape both times.Root cause (decoded from the running NODE_OPTIONS)
vs. what SDK 6.0.15's
drainBrokerStdoutAfterStartup(relay#838) does for direct SDK consumers:What changed (`src/local/entrypoint.ts`)
datalistener andresume()synchronously at spawn time. No more waiting on'pause'. Stream goes into flowing mode at birth and stays there for the lifetime of the broker.argv[0]===\"init\"to also include\"pty\". M1 spawns oneinitbroker (channel multiplexer) plus oneptybroker per worker (1 lead + 9 impl-* workers). All of them write events back through stdout. Covering onlyinitleft the PTY brokers vulnerable.registerSourceto capture the new semantics and the prior bug.Verification
npm run typecheck— cleannpm test— 1077 / 1077 pass, including the existing\"drains broker stdout after SDK startup so event floods cannot wedge the workflow node\"regression atentrypoint.test.ts:3122(which previously passed only because the bundled SDK 6.0.15 has its own drain — this fix makes the loader-level protection actually do what it advertises for consumers still on older SDKs).Related
@agent-relay/sdk@6.0.15'pause'wiring)^6.0.15(necessary but not sufficient when consumers have their own local SDK pin)