Skip to content

fix(master-renderer): scope final-hard-validation tests to spec workspaces#91

Merged
khaliqgant merged 2 commits intomainfrom
fix/master-renderer-scoped-test-command
May 10, 2026
Merged

fix(master-renderer): scope final-hard-validation tests to spec workspaces#91
khaliqgant merged 2 commits intomainfrom
fix/master-renderer-scoped-test-command

Conversation

@kjgbot
Copy link
Copy Markdown
Contributor

@kjgbot kjgbot commented May 10, 2026

Summary

Master-rendered final-hard-validation hardcoded npm test, which walks the entire repo's test suite from cwd. For monorepo specs that scope work to a few packages/<pkg>/ files, any pre-existing or transient failure in an unrelated workspace package then blocks the workflow's final gate — work no agent in the generated workflow can sensibly repair because it isn't in the spec's declared scope.

Repro

The user's proactive-pr-remediation spec declares targets in packages/backend/ and packages/shared/ only. After PRs #84 / #86 / #90 unblocked the artifact-write, module-load, and typecheck-command paths, the workflow ran end-to-end and reached final-hard-validation. That step ran npm test from MSD root, which npm test --workspaces expanded to all packages. A flaky e2e test in tests/msd-assistant/e2e/proactive-followup.e2e.test.ts (passes 12/12 in isolation, including alongside the rest of tests/msd-assistant/e2e/) failed once during the workflow run, blocking final-hard-validation. Auto-fix then "repaired" 7×, all failing identically because the failing test wasn't in any child workflow's scope.

Fix

Compute the test command via the existing deriveTestCommand(spec) helper and use it in all three master-render sites:

  • master-workflow-renderer.ts:287 — final-hard-validation step body
  • master-workflow-renderer.ts:489 — final-hard-validation gate command
  • master-workflow-renderer.ts:491 — regression-gate

deriveTestCommand already handled spec-driven scoping for explicit acceptance gates and test-file targets. This PR adds workspace-aware scoping as a new fallback for source-file targets that share packages/<pkg>/, apps/<pkg>/, or services/<pkg>/ prefixes — the three monorepo conventions npm/pnpm/yarn workspaces overwhelmingly use. When such prefixes are detected, emit npm test --workspace=<pkg> chained with && for each unique workspace. When no prefix matches, fall back to the previous unscoped npx vitest run behavior.

deriveTestCommand is now exported from template-renderer so the master renderer can share the same logic instead of forking another implementation. Same approach as PR #90's TYPECHECK_COMMAND extraction.

Test plan

  • npm run typecheck — clean
  • npm test — 1060 / 1060 pass (was 1059)
  • New regression case in pipeline.test.ts: scopes master-rendered final-hard-validation tests to workspaces touched by the spec. Asserts the rendered workflow:
    • Contains npm test --workspace='packages/backend' and npm test --workspace='packages/shared' (the spec's actual scope)
    • Does NOT contain --workspace='packages/webapp' or other unrelated workspaces
    • Never emits the bare-npm test whole-suite pattern that produced the bug
  • Existing 46 pipeline tests still pass (the previous tests use expect.stringContaining or hand-craft input fixtures, so they survive the scoping logic unchanged).

Out of scope (worth follow-ups)

  • repairAgent on master-level onError doesn't appear to invoke an actual repair agent for deterministic-step failures during the user's run — the retries were dumb reruns of the same command, no agent invocation in logs. Whether that's an agent-relay SDK gap or a missing agent role definition (master-lead's declared role is "Plans child workflow boundaries", not "fixes failing tests") needs separate investigation.
  • Auto-fix burns 7/7 attempts on user-actionable errors (MISSING_ENV_VAR, real test failures, scope mismatches) instead of classifying them and escalating immediately. Worth a tracking issue. The fix here makes scope mismatches less likely but doesn't address the broader "loop on unfixable thing" pattern.
  • Master-level final-hard-validation is a deterministic step, not an agent fix-loop. When validation fails for a fixable reason, the master has no agent empowered to repair (only its child workflows do). Restructuring final-hard-validation as an agent step with its own fix-loop would be a meaningful design change beyond this PR.

Relationship to other PRs

Fourth in a stack of independent fixes that together unblock ricky --mode local --spec-file <md> --run against monorepo specs:

  1. fix(generation): fall back to deterministic renderer when persona writer throws #84 — persona-writer fallback (artifact actually gets written)
  2. fix(auto-fix): detect rickyWorkflow* alias imports via TypeScript AST #86 — env-loader alias detection via TS AST (artifact actually loads under Node)
  3. fix(master-renderer): use workspace-aware typecheck command #90 — workspace-aware typecheck (validation step's typecheck succeeds in monorepos)
  4. this PR — workspace-aware test scoping (validation step's tests don't fail on unrelated packages)

Each is independently shippable.

🤖 Generated with Claude Code

…paces

Master-rendered final-hard-validation hardcoded `npm test`, which walks
the entire repo's test suite from cwd. For monorepo specs that scope
work to a few `packages/<pkg>/` files, any pre-existing or transient
failure in an *unrelated* workspace package then blocks the workflow's
final gate — work no agent in the generated workflow can sensibly
repair because it isn't in the spec's declared scope.

Concrete trigger: the proactive-pr-remediation spec's targets are all
under `packages/backend/` and `packages/shared/`, but `npm test` from
the MSD root also ran `packages/webapp` / `packages/mobile` /
`packages/cli` / `packages/specialists-service` / `packages/api` test
suites. A flaky e2e test in `tests/msd-assistant/e2e/` (passes 12/12
in isolation) failed once during the workflow run, blocking
final-hard-validation. The auto-fix loop then "repaired" 7×, all
failing identically because the failing test wasn't in any child
workflow's scope.

Fix: compute the test command via the existing
`deriveTestCommand(spec)` helper and use it in all three master-render
sites (final-hard-validation step body at line 287, the same gate
command at line 489, and the regression-gate at line 491). The helper
already handled spec-driven scoping for explicit acceptance gates and
test-file targets; this commit adds workspace-aware scoping as a new
fallback for source-file targets that share `packages/<pkg>/`,
`apps/<pkg>/`, or `services/<pkg>/` prefixes (the three monorepo
conventions npm/pnpm/yarn workspaces overwhelmingly use). When such
prefixes are detected, emit `npm test --workspace=<pkg>` chained with
`&&` for each unique workspace. When no prefix matches, fall back to
the previous unscoped `npx vitest run` behavior.

`deriveTestCommand` is now exported from template-renderer so the
master renderer can share the same logic instead of forking another
implementation.

Tests:
- New regression case in `pipeline.test.ts`:
  "scopes master-rendered final-hard-validation tests to workspaces
  touched by the spec" — asserts the rendered workflow contains
  `npm test --workspace='packages/backend'` and
  `npm test --workspace='packages/shared'` (the spec's actual scope),
  does NOT contain `--workspace='packages/webapp'` or other unrelated
  workspaces, and never emits the bare-`npm test` whole-suite pattern.
- Existing 46 pipeline tests still pass; full ricky suite at 1060 / 1060.

Out of scope (worth follow-ups noted in earlier conversation):
- The `repairAgent` declared on master-level `onError` did not appear
  to invoke an actual repair agent during the user's failed run; the
  retries were dumb reruns. Whether that's an SDK gap or a missing
  agent role definition needs separate investigation.
- Auto-fix burning 7/7 attempts on user-actionable errors
  (MISSING_ENV_VAR, real test failures, scope mismatches) instead of
  classifying them and escalating immediately. Worth a tracking issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 18dd906a-ee61-4988-8110-5dcadf6f57ab

📥 Commits

Reviewing files that changed from the base of the PR and between aa40c54 and dafcb1a.

📒 Files selected for processing (2)
  • src/product/generation/pipeline.test.ts
  • src/product/generation/template-renderer.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/product/generation/template-renderer.ts

📝 Walkthrough

Walkthrough

Master workflow generation now derives workspace-scoped test commands from the workflow spec using an exported deriveTestCommand. The master gates and final-hard-validation step use the derived command instead of hardcoded npm test; a helper extracts unique workspace roots and tests validate scoped and unscoped behaviors.

Changes

Workspace-Aware Test Command Derivation

Layer / File(s) Summary
Test Command Derivation Logic
src/product/generation/template-renderer.ts
Export deriveTestCommand and add uniqueWorkspacesFromTargetFiles helper to extract affected workspaces (packages/, apps/, services/, including scoped forms) from spec.targetFiles and generate npx vitest run or chained npm test --workspace=... commands; fallback to npx vitest run when no workspaces detected.
Master Workflow Integration
src/product/generation/master-workflow-renderer.ts
Import deriveTestCommand, pass spec to buildMasterGates, compute testCommand from spec, and replace hardcoded npm test in final-hard-validation and regression-gate with the derived command.
Regression Test
src/product/generation/pipeline.test.ts
Add Vitest tests asserting that final-hard-validation includes only workspace-scoped npm test --workspace='...' commands for affected packages (including scoped names), excludes unrelated packages and unscoped/truncated patterns.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • AgentWorkforce/ricky#90: Related modification to master workflow validation steps; PR #90 made TypeScript typecheck workspace-aware while this PR makes test execution workspace-aware within final-hard-validation.

Poem

🐰 I hop through changed file paths with cheer,
I gather workspaces that now appear.
No flat npm test to loudly shout—
Each scoped package runs its tests throughout.
nibbles a carrot and twitches whiskers

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: scoping final-hard-validation tests to spec workspaces instead of running unscoped npm test.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the problem, reproduction steps, implementation details, and test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/master-renderer-scoped-test-command

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/product/generation/pipeline.test.ts`:
- Around line 88-100: The test is brittle because it asserts on raw rendered
content (artifact(result).content) and regexes rather than the structured gate
command; change assertions to read the gate command via gate(artifact,
'final-hard-validation').command and assert that the command string contains the
expected workspace-specific npm test invocations for "packages/backend" and
"packages/shared" and does not contain "packages/webapp", "packages/mobile" or
the whole-suite pattern; update the positive checks to assert inclusion of "npm
test --workspace='packages/backend'" and "npm test
--workspace='packages/shared'" against gate(...).command, and replace the
negative checks and the regex not-to-match with direct string/not-string
assertions on gate(...).command to target semantics instead of rendered
formatting.

In `@src/product/generation/template-renderer.ts`:
- Around line 1263-1268: The function uniqueWorkspacesFromTargetFiles builds
workspace paths but its regex only handles unscoped names (e.g., packages/pkg)
and thus mis-parses scoped workspaces like packages/@scope/pkg; update the regex
in uniqueWorkspacesFromTargetFiles to allow an optional scope segment (e.g.,
change /^((?:packages|apps|services)\/[^\/]+)\// to
/^((?:packages|apps|services)\/(?:@[^\/]+\/)?[^\/]+)\//) so it captures both
packages/pkg and packages/@scope/pkg, ensure you add the final return
Array.from(workspaces) (or equivalent) so the function returns the deduplicated
workspace array.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 66f3010b-8e07-4284-8c63-cf5b7225278a

📥 Commits

Reviewing files that changed from the base of the PR and between 45ea489 and aa40c54.

📒 Files selected for processing (3)
  • src/product/generation/master-workflow-renderer.ts
  • src/product/generation/pipeline.test.ts
  • src/product/generation/template-renderer.ts

Comment thread src/product/generation/pipeline.test.ts Outdated
Comment thread src/product/generation/template-renderer.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Two actionable comments from CodeRabbit's review of #91:

1. uniqueWorkspacesFromTargetFiles regex misparsed npm-scoped workspace
   paths. The previous pattern
       /^((?:packages|apps|services)\/[^\/]+)\//
   stopped at the first slash so `packages/@agentworkforce/runtime/src/x.ts`
   captured only `packages/@agentworkforce`. Real npm workspaces with
   scope prefixes (overwhelmingly common in the AgentWorkforce repos
   this PR targets) would have produced a wrong --workspace=… arg and
   failed at runtime. Regex now allows an optional `@scope/` segment:
       /^((?:packages|apps|services)\/(?:@[^\/]+\/)?[^\/]+)\//
   New regression case asserts scoped workspaces resolve end-to-end and
   guards against the truncated-at-scope output.

2. The original regression test asserted on
   `artifact(result).content` with regex matching, which is brittle
   against rendered formatting changes (whitespace, JSON-string
   escaping inside `.step({ command: ... })` literals, etc.). Tests now
   use the structured helpers already present in this file
   (`gate(artifact, name).command` for the gate, `renderedStepCommand`
   for the parsed step body) and assert on the parsed command strings
   directly. Both surfaces are checked in a single loop so the
   step-body and gate-command stay in sync.

Plus: added JSDoc to `deriveTestCommand` and
`uniqueWorkspacesFromTargetFiles` so the docstring-coverage warning on
the PR clears, and so future readers see the resolution-order rules
without spelunking through the implementation.

Tests: pipeline 48/48 (was 47/47); full ricky suite 1061/1061. The
new "uniqueWorkspacesFromTargetFiles handles npm-scoped workspace
paths" case fails on the pre-fix regex with
`--workspace='packages/@AgentWorkforce'` instead of
`--workspace='packages/@agentworkforce/runtime'`, confirming the
regex bug was real and is now fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 10, 2026

Ricky Eval Review

Run: .ricky/evals/runs/2026-05-10T15-27-23-207Z-HEAD-provider
Mode: provider
Git SHA: 3f77197

Passed: 8 | Needs human: 43 | Reviewable: 43 | Missing output: 0 | Failed: 0 | Skipped: 0

Human Review Cases

These cases passed deterministic checks and include captured Ricky output for a human verdict against their Must / Must Not rubric.

REVIEW agent-assistant-boundary.real-reuse-not-rhetorical (agent-assistant-boundary/manual)

User message

Update Ricky docs and code to say it uses Agent Assistant more deeply.

Ricky output

**Scope of the change**  
- Extend the *Ricky* documentation to state that Ricky now leverages the **Agent Assistant** library for deeper orchestration (state‑tracking, intent routing, and deterministic verification).  
- Add a thin wrapper in `ricky/assistant.py` that imports the shared `agent_assistant` package and exposes the key helper functions used throughout the repo.  
- Update the workflow contract (`ricky/workflow_contract.yaml`) to include the new `assistant_*` verification gates.  
- Add a unit‑test suite (`tests/test_assistant_integration.py`) that validates the deterministic behavior of the new wrapper.  

All changes are confined to the `ricky/` package and its test directory; no cross‑repo impact.

---

## 1. Documentation updates (`docs/README.md` & `docs/ARCHITECTURE.md`)

### README.md – “Core capabilities” section
```markdown
## Core capabilities

- **Deterministic workflow contracts** – YAML‑defined contracts with 80‑100 validation loops.
- **Agent Assistant integration** – Ricky now embeds the **Agent Assistant** library for:
  - State‑aware intent routing
  - Automatic verification gate generation (`assistant_verify_*`)
  - Centralized logging & telemetry
- *
...[truncated]

Must

  • Ground claims in real package imports and runtime paths.
  • Distinguish current implementation from target architecture.
  • Identify which Agent Assistant primitive is actually exercised.

Must Not

  • Rename local code to sound Agent Assistant aligned and count that as adoption.
  • Claim broad Agent Assistant native behavior from documentation-only alignment.
  • Blur target architecture with landed behavior.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW agent-assistant-boundary.turn-context-preserves-ricky-envelope (agent-assistant-boundary/manual)

User message

Evaluate the current Ricky @agent-assistant/turn-context adoption.

Ricky output

**Ricky @agent‑assistant/turn‑context Adoption Review (as of 2026‑05‑10)**  

| Metric | Current State | Target / SLA | Gap / Action |
|--------|---------------|--------------|--------------|
| **Workflow contracts defined** | 12 of 18 active workflows include a `turn‑context` contract (67 %). | 100 % of production workflows must declare a contract. | Identify the 6 missing workflows, add `turn‑context` schema, and open PRs. |
| **Deterministic verification gates** | 8 of 12 contracts have automated gate tests in CI (66 %). | Every contract must have a gate that validates `turn‑context` shape and required fields. | Write gate tests for the remaining 4 contracts; add them to `workflow‑verify.yml`. |
| **Review artifacts** | 4 of 12 PRs contain a “Turn‑Context Review” checklist item. | 100 % of PRs that modify a contract must include the checklist. | Update the PR template (`.github/PULL_REQUEST_TEMPLATE.md`) to enforce the checklist. |
| **Validation loops** | Average of 42 validation loops per contract (target 80‑100). | 80‑100 loops per release cycle to ensure stability across edge cases. | Expand test matrix (different user roles, language locales, async‑vs‑sync calls) to reach t
...[truncated]

Must

  • Preserve request id, source metadata, structured spec, invocation root, mode, stage mode, spec path, metadata, and spec text.
  • Record compact provenance through generation decisions or coordinator metadata.
  • Keep the shared turn context internal to the adapter boundary.

Must Not

  • Move LocalResponse, blocker taxonomy, recovery wording, or execution semantics into the shared turn-context package.
  • Drop Ricky-specific workflow metadata during envelope assembly.
  • Treat turn context as a product decision engine.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW agent-assistant-boundary.product-core-stays-ricky-owned (agent-assistant-boundary/manual)

User message

Decide whether workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording should move into Agent Assistant.

Ricky output

**Recommendation:** Move **all** of the listed capabilities—workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording—into the **Agent Assistant** component.

---

## Rationale (aligned with AGENTS.md and workflow standards)

| Capability | Why it belongs in Agent Assistant | Impact on reliability & coordination |
|------------|-----------------------------------|--------------------------------------|
| **Workflow Generation** | Agent Assistant already owns *authoring* responsibilities (template selection, contract scaffolding). Adding generation keeps the creation path end‑to‑end in a single service, eliminating hand‑off latency. | Guarantees deterministic contracts (JSON‑Schema + OpenAPI) are produced under a single verification gate. |
| **Workflow Validation** | Validation is the deterministic gate that follows generation. Embedding it in Agent Assistant allows immediate feedback loops (80‑100 validation cycles) without external round‑trips. | Reduces false‑positive passes; failures are reported as concrete blocker artifacts attached to the same PR. |
| **Debugging** | Debugging logic (trace replay, state diff, sandbox execution) requires intimate
...[truncated]

Must

  • Keep product-defining workflow generation, validation, debugging, local UX, and evidence wording Ricky-owned until proof says otherwise.
  • Reuse shared runtime primitives where they reduce duplication without weakening Ricky.
  • Make extraction follow typed, tested, live product proof.

Must Not

  • Generalize workflow-specific behavior prematurely.
  • Adopt moving shared seams merely for architectural purity.
  • Lose the precise local-first staged workflow UX.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW agent-assistant-boundary.one-slice-at-a-time (agent-assistant-boundary/manual)

User message

Plan the next Agent Assistant adoption slice for Ricky.

Ricky output

**Agent Assistant Adoption Slice – “Ricky v2.0 Roll‑out”**  
*Target: Core Customer‑Success (CS) teams, 3 months, 2 sprints per month (6 sprints total)*  

---

## 1. Slice Goal & Success Metrics
| Metric | Target (by end of slice) |
|--------|--------------------------|
| **Adoption** | ≥ 70 % of CS agents actively using Ricky in daily ticket triage |
| **Accuracy** | ≥ 92 % correct workflow recommendation (verified against ground‑truth test set) |
| **Latency** | ≤ 1.2 s average response time per Ricky query |
| **User Satisfaction** | ≥ 4.2 /5 (post‑interaction survey) |
| **Support Load** | ≤ 15 % reduction in escalations to Tier‑2 (baseline = 30 %) |

---

## 2. Scope & Boundaries
| In‑Scope | Out‑Of‑Scope |
|----------|--------------|
| • Pre‑defined CS ticket categories (Billing, Access, Product‑Usage) | • Custom, ad‑hoc queries outside defined taxonomy |
| • Integration with ServiceNow ticket view (read‑only) | • Write‑back actions (ticket updates) – deferred to v2.1 |
| • Inline chat widget inside ServiceNow UI | • Mobile‑only UI (planned for later slice) |
| • Automated “confidence‑threshold” fallback to human SME | • Full‑language model fine‑tuning (use base model) |

Al
...[truncated]

Must

  • Pick exactly one real shared seam to evaluate or adopt.
  • Define a live Ricky product path that will prove the adoption.
  • Include regression checks that product messaging, blocker output, and evidence remain truthful.

Must Not

  • Bundle sessions, memory, policy, proactive behavior, and execution extraction into one vague migration.
  • Skip the comparison/evaluation step for mature Ricky-local seams.
  • Treat adoption as successful without a live product-path proof.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW agent-assistant-boundary.future-surfaces-use-shared-runtime (agent-assistant-boundary/manual)

User message

Design future Slack or web support for Ricky using Agent Assistant packages.

Ricky output

## Ricky – Future Slack / Web Support Design  
*Agent Assistant Package (AAP)‑based, deterministic, review‑ready, and scoped for incremental delivery.*

---

### 1. High‑Level Architecture

| Component | Responsibility | AAP Package | Runtime |
|-----------|----------------|------------|---------|
| **Ricky‑Core** | Business‑logic for workflow reliability, coordination, and authoring. | `ricky-core` | Python 3.11 (FastAPI) |
| **Slack‑Adapter** | Bi‑directional Slack events → Ricky‑Core, command parsing, response formatting. | `ricky-slack` | Bolt for Python |
| **Web‑UI** | React SPA for ticket view, validation dashboards, and manual overrides. | `ricky-web` | React 18 + Vite |
| **Auth‑Gateway** | OAuth2 (Slack) + OIDC (Web) token validation, RBAC. | `ricky-auth` | Node 20 (Express) |
| **Persistence** | PostgreSQL (workflow state, audit logs) + Redis (locks, rate‑limit). | `ricky-db` | Docker‑compose |
| **Observability** | OpenTelemetry traces, Prometheus metrics, Loki logs. | `ricky-obs` | Side‑car containers |
| **CI/CD** | Deterministic build, contract tests, 80‑100 validation loops per PR. | `ricky-ci` | GitHub Actions |

All packages are versioned independently (semantic‑r
...[truncated]

Must

  • Prefer shared surfaces, webhook-runtime, sessions, and routing primitives for future non-CLI interaction where mature.
  • Keep local CLI behavior product-local unless shared adoption is proven harmless.
  • Explain which behavior is future/target architecture versus implemented today.

Must Not

  • Preemptively add memory, policy, or proactive packages without a real Ricky product requirement.
  • Let future surface abstractions distort the current CLI contract.
  • Duplicate a mature Agent Assistant capability locally without justification.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW cli.generation-default-not-execution (cli-behavior/manual)

User message

A user runs ricky --mode local --spec "generate a workflow for package checks" without --run.

Ricky output

**What happens when you run**

```bash
ricky --mode local --spec "generate a workflow for package checks"

without the --run flag

Step Description Outcome
1️⃣ Argument parsing – Ricky reads --mode local and the free‑form --spec string. The spec is stored as the desired workflow contract for later processing.
2️⃣ Specification‑only mode – Because --run is absent, Ricky enters dry‑run / validation mode. No workflow files are written, no containers are started, and no external actions are executed.
3️⃣ Contract generation – Ricky translates the natural‑language spec into a concrete workflow contract (YAML/JSON) that follows the Ricky workflow schema (see schemas/workflow.json). You get a deterministic, version‑controlled representation of the workflow (e.g., generated_workflow.yaml).
4️⃣ Deterministic verification gates – The generated contract passes through the built‑in validation pipeline:
Schema validation (JSON‑Schema)
Naming conventions (snake_case, max 50 chars)
Dependency sanity check (no circular imports)
Resource limits
...[truncated]

**Must**

- Say generation is the default and execution was not requested.
- Print the generated artifact path, workflow id, spec digest, and next run command.
- Avoid showing execution evidence for a generation-only request.

**Must Not**

- Imply the workflow ran automatically.
- Present a generation-only result as execution success.
- Hide the opt-in commands for running the artifact.

**Deterministic checks**

- PASS `status`: case executed
- PASS `maxToolCalls`: expected <= 0 tool calls, got 0

</details>

<details>
<summary><strong>REVIEW</strong> <code>cli.first-run-copy-is-compact-and-truthful</code> (cli-behavior/manual)</summary>

**User message**

> Render Ricky's first-run CLI onboarding for a new user.

**Ricky output**

```text
**Ricky – First‑Run CLI Onboarding**

Below is the complete, ready‑to‑run onboarding workflow for a brand‑new user of the **Ricky** AgentWorkforce CLI. Follow the steps in order; each step is a deterministic gate that either succeeds (green) or fails (red) with an explicit error message. If any gate fails, stop, fix the reported blocker, and re‑run the failed step before proceeding.

---

## 1️⃣ Prerequisites (Verification Gate)

| Check | Command | Expected Result |
|-------|---------|-----------------|
| **Node ≥18** | `node -v` | `v18.x` or higher |
| **npm ≥9** | `npm -v` | `9.x` or higher |
| **Git** | `git --version` | `git version 2.x` |
| **Docker** (optional, for containerised agents) | `docker version --format '{{.Server.Version}}'` | `>=20.10` |

*If any check is red, install/upgrade the missing tool before continuing.*

---

## 2️⃣ Install the Ricky CLI (Deterministic Installation)

```bash
# Global install (preferred)
npm i -g @ricky/agentworkforce-cli
# Verify installation
ricky --version

Success output: ricky-cli v1.0.0 (or later).
Failure: see the npm error log, resolve dependency conflicts, then re‑run.


3️⃣ Initialise a New Workspace (Contract

...[truncated]


**Must**

- Show compact Ricky branding and clear Local / BYOH, Cloud, Both, and Just explore choices.
- End every branch with a concrete next step.
- Advertise only commands that are currently implemented.

**Must Not**

- Sound like a launch page or documentation dump.
- Claim Ricky runs workflows by default when generation is the default path.
- Require web or Slack onboarding before CLI use.

**Deterministic checks**

- PASS `status`: case executed
- PASS `maxToolCalls`: expected <= 0 tool calls, got 0

</details>

<details>
<summary><strong>REVIEW</strong> <code>cli.recovery-guidance-no-stack-traces</code> (cli-behavior/manual)</summary>

**User message**

> A user gives Ricky an empty spec or a missing spec file.

**Ricky output**

```text
**Ricky’s Standard Response to an Empty or Missing Spec**

1. **Immediate Detection & Blocker Report**  
   - **Detect**: As soon as the workflow‑authoring command is invoked, check for the presence of the spec file (`workflow.yaml` or the file declared in the command).  
   - **Blocker**: If the file is absent **or** its content is empty (zero bytes or only whitespace/comments), raise a deterministic blocker:  

     ```
     🚨 BLOCKER: Spec file “<path>” is missing or empty.
     ```

   - **Log**: Add the blocker to the workflow’s `ricky.log` with a timestamp and the invoking user ID.

2. **Create a Minimal Placeholder Spec**  
   - Generate a temporary placeholder in the same directory named `workflow.placeholder.yaml` with the required top‑level keys so the CI pipeline can continue past the detection stage:

     ```yaml
     # Placeholder – replace with real spec
     name: <YOUR_WORKFLOW_NAME>
     version: 0.0.1
     description: "Placeholder generated by Ricky – awaiting user‑provided spec."
     steps: []
     ```

   - Commit this placeholder on a **dedicated branch** `ricky/placeholder‑<timestamp>` and open a PR titled **“Placeholder spec – awaiting user input”**.  
  
...[truncated]

Must

  • Return a user-facing failure or guidance message with a real recovery command.
  • Distinguish generation failure from execution failure.
  • Show stack traces only when verbose diagnostic mode is requested.

Must Not

  • Crash with an uncaught exception in normal mode.
  • Suggest commands that do not exist.
  • Pretend a missing spec was accepted.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW cli.status-does-not-invent-provider-state (cli-behavior/manual)

User message

Render ricky status when no provider checks have proven Google or GitHub are connected.

Ricky output

**`ricky status` output (no provider checks have proven Google or GitHub are connected)**  

=== AgentWorkforce Status Report ===

🟡 Provider Connections
• Google : NOT CONNECTED (no successful health check)
• GitHub : NOT CONNECTED (no successful health check)

🔧 Workflow Engine
• Scheduler : RUNNING
• Queue : EMPTY
• Workers : 0 ACTIVE / 0 IDLE

📦 Repository State
• Branches : master (up‑to‑date)
• Pending PRs : none
• Staged Changes: none

⚙️ System Health
• CPU Usage : 12%
• Memory Usage: 38%
• Disk Space : 73% free

🚨 Alerts
• Provider connectivity missing for Google and GitHub.
→ Run ricky provider connect --google or ricky provider connect --github
→ Verify OAuth tokens / API keys are configured in ~/.ricky/config.yaml.

📝 Next Steps

  1. Authenticate the missing providers.
  2. Re‑run ricky status to confirm connections.
  3. Once connected, workflow validation gates will resume normal operation.

--- End of Report ---

Must

  • Report unknown or not-connected provider state honestly.
  • Update provider status only from explicit provider checks or Cloud status results.
  • Give concrete setup guidance for Cloud when relevant.

Must Not

  • Mark Google or GitHub connected because guidance text was shown.
  • Invent a provider connection URL or OAuth flow.
  • Show empty fields with no recovery guidance when config is missing.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW generation-quality.skill-matcher-registry-backed (generation-quality/manual)

User message

Generate a workflow for a GitHub primitive change and explain which skills Ricky selected.

Ricky output

**Workflow: “GitHub Primitive Change”**

| Step | Owner | Action | Input | Output | Verification Gate |
|------|-------|--------|-------|--------|--------------------|
| 1️⃣ | **Ricky (Coordination)** | **Create a scoped branch** | `main` (or target base) | New branch `feat/<short‑description>` | Branch name follows `feat|fix|chore/<ticket‑id>-<slug>` pattern; CI lint passes. |
| 2️⃣ | **Developer** | **Implement the primitive change** (e.g., edit a single file, bump a version, add a comment) | Branch from Step 1 | Modified files, unit‑test updates (if any) | Local `git diff` shows only intended files; `npm run lint` (or language‑specific linter) succeeds. |
| 3️⃣ | **Ricky (Verification)** | **Run deterministic CI checks** | Push to remote | CI pipeline results (unit tests, static analysis, security scan) | All jobs report **PASS**; any failure aborts the workflow and is reported as a blocker. |
| 4️⃣ | **Developer** | **Open a Pull Request (PR)** | Pushed branch | PR titled `feat: <short description>` with template filled (description, related issue, checklist) | PR passes *automated* “PR‑template completeness” gate (all checklist items checked). |
| 5️⃣ | **Ricky (Review Coordin
...[truncated]

Must

  • Select skills from the actual registry rather than a hardcoded static set.
  • Record ranked skill matches with confidence and trigger evidence.
  • Fall back gracefully when the registry is missing or no skill clears the threshold.

Must Not

  • Claim runtime agents embody skills when only generation-time guidance was applied.
  • Fail generation solely because optional skill files are missing.
  • Hide skill selection evidence from artifacts or JSON output.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW generation-quality.tool-selector-honors-spec-hints (generation-quality/manual)

User message

Generate a workflow where the spec says "use Claude to review and Codex to implement".

Ricky output

**Workflow Title:** `review‑and‑implement‑claude‑codex`

**Purpose**  
Automate a two‑stage development cycle where Claude performs a design/code‑review of a feature request and Codex generates the implementation. The workflow enforces deterministic verification gates and produces review artifacts for auditability.

---

## 1. Contract (YAML)

```yaml
name: review-and-implement-claude-codex
description: |
  Takes a feature specification, has Claude produce a design/review,
  then passes the design to Codex for code generation. Includes
  deterministic verification, 80‑100 validation loops, and blocker reporting.
inputs:
  - name: spec
    type: string
    description: Feature specification (markdown)
  - name: language
    type: enum
    enum: [python, javascript, go, java]
    default: python
outputs:
  - name: design_report
    type: string
    description: Claude’s design and review (markdown)
  - name: implementation
    type: string
    description: Generated source code
  - name: test_suite
    type: string
    description: Auto‑generated pytest/mocha/etc. based on design
  - name: verification_log
    type: string
    description: Summary of all verification gate results
env
...[truncated]

Must

  • Assign per-step runner/model decisions from explicit spec hints where possible.
  • Let skill metadata or project defaults fill gaps when the spec is silent.
  • Write tool-selection.json or equivalent audit metadata.

Must Not

  • Use one generic runtime default for every agent despite explicit hints.
  • Let skill defaults override explicit user/spec runner hints.
  • Omit the reason each tool or model was chosen.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW generation-quality.refine-is-opt-in-and-bounded (generation-quality/manual)

User message

Use --refine to sharpen a generated workflow's step task descriptions and acceptance gates.

Ricky output

## Using `--refine` to Sharpen Step Descriptions & Acceptance Gates  

`--refine` is a deterministic post‑processing flag that takes a **generated workflow** (JSON/YAML) and returns a **tighter, review‑ready version**. It focuses on two things:

| Target | What `--refine` does | Why it matters |
|--------|----------------------|----------------|
| **Task descriptions** | • Rewrites each `task.description` into an **imperative, testable sentence**.<br>• Adds explicit **inputs**, **outputs**, and **owner** fields if missing.<br>• Limits length to 120 chars (max) for readability. | Guarantees every step is unambiguous and can be traced to a concrete artifact. |
| **Acceptance gates** | • Converts free‑form “verify X” notes into a **structured `gate` object** (`type`, `criteria`, `timeout`).<br>• Inserts a **deterministic verification command** (`cmd`) that can be executed in CI.<br>• Adds a `fallback` policy (`retry`, `escalate`, `skip`). | Enables automated gating, reduces human guesswork, and surfaces blockers early. |

---

### 1. Invoke the Refiner

```bash
ricky generate workflow.yaml --model=gpt‑4o --refine > workflow.refined.yaml
  • workflow.yaml – the raw output from a g
    ...[truncated]

**Must**

- Keep the deterministic unrefined path as the default.
- Bound the refinement pass by timeout, token budget, and editable regions.
- Re-run validation after refinement and fall back to the deterministic artifact on unsafe edits or provider failure.

**Must Not**

- Change the workflow graph, agent assignments, or side-effect scope during refinement.
- Fail the whole generation if optional refinement times out.
- Hide warnings when the deterministic artifact is returned unchanged.

**Deterministic checks**

- PASS `status`: case executed
- PASS `maxToolCalls`: expected <= 0 tool calls, got 0

</details>

<details>
<summary><strong>REVIEW</strong> <code>generation-quality.behavior-grounded-gates</code> (generation-quality/manual)</summary>

**User message**

> Generate a workflow for the `ricky --version` spec.

**Ricky output**

```text
**Workflow: `ricky --version` Specification & Delivery**  
*Repository:* `github.com/agentworkforce/ricky`  
*Target Branch:* `main` → `feature/ricky-version-spec`  

---

## 1. Scope & Acceptance Criteria  

| ID | Requirement | Success Metric |
|----|-------------|----------------|
| V‑001 | The CLI must output the semantic version string of the compiled binary (e.g., `ricky v2.3.1`). | Exact match to `ricky v<MAJOR>.<MINOR>.<PATCH>` on all supported platforms. |
| V‑002 | The version flag must be case‑insensitive (`--version`, `-V`, `-v`). | All three forms produce identical output. |
| V‑003 | The version string must be sourced from a single source of truth (`git tag` or `VERSION` file). | No hard‑coded strings in source; version is injected at build time. |
| V‑004 | The command must exit with status `0`. | `echo $?` after execution equals `0`. |
| V‑005 | The command must be discoverable via `ricky --help`. | Help text contains the line `--version, -V, -v   Show version information`. |
| V‑006 | The binary must embed build metadata (commit SHA, build date) accessible via `ricky --version --detail`. | Additional fields appear in output when `--detail` is supplied. |
| V‑007 | 
...[truncated]

Must

  • Build gates from the stated acceptance behavior, such as checking ricky --version output.
  • Avoid generic source-shape grep checks when the spec asks for CLI behavior.
  • Keep generated validation meaningful for the current repo shape.

Must Not

  • Treat grep -Eq 'export|function|class|workflow(' dist/ricky.js as proof of version behavior.
  • Claim the workflow is proven by source syntax alone.
  • Ignore the package-json version resolution order in the spec.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW generation-quality.pattern-selection-deliberate (generation-quality/manual)

User message

Generate a workflow for many independent artifacts with a validation/fix/rerun loop.

Ricky output

**Workflow: “Batch‑Artifact Validation‑Fix‑Rerun Loop”**  
*Purpose*: Process a large collection of independent artifacts (e.g., data files, code modules, model checkpoints) through a deterministic validation pipeline, automatically apply fix‑scripts where possible, and rerun until all artifacts are either **PASS** or **BLOCKED**. The workflow is scoped to a single branch (`batch‑validation`) and produces a single PR that aggregates all results.

---

## 1. Repository Layout (contract)

/batch-validation/

├─ artifacts/ # source artifacts (read‑only)
│ ├─ artifact‑001.json
│ ├─ artifact‑002.json
│ └─ … # up to N files

├─ validators/ # deterministic validators (pure functions)
│ ├─ schema_validator.py
│ ├─ checksum_validator.py
│ └─ custom_rules.py

├─ fixers/ # idempotent fix scripts
│ ├─ normalize_dates.py
│ ├─ repair_checksum.py
│ └─ enrich_metadata.py

├─ workflow/
│ ├─ batch_validation.yml # GitHub Actions definition (this file)
│ ├─ contracts/
│ │ └─ artifact_contract.yaml # JSON‑Schema for a valid artifact
│ └─ scripts/
│ ├─ run_validators.py
│ ├─ app
...[truncated]


**Must**

- Choose `dag`, `supervisor`, or `pipeline` deliberately based on the work shape.
- Explain the pattern choice in artifact metadata or a rationale.
- Use `dag` for validation/fix/rerun loops when dependencies matter.

**Must Not**

- Default blindly to `dag` for every workflow.
- Collapse independent artifact work into one vague agent task.
- Omit verification gates because the chosen pattern seems obvious.

**Deterministic checks**

- PASS `status`: case executed
- PASS `maxToolCalls`: expected <= 0 tool calls, got 0

</details>

<details>
<summary><strong>REVIEW</strong> <code>generation-quality.skill-boundary-copy</code> (generation-quality/manual)</summary>

**User message**

> Describe how selected workflow-writing skills affected a generated Ricky workflow.

**Ricky output**

```text
### How Selected Workflow‑Writing Skills Shaped the Generated Ricky Workflow  

| Skill | What the Skill Enforces | Direct Effect on the Generated Workflow |
|-------|------------------------|------------------------------------------|
| **Concrete Contract Definition** | Every step must declare its **input schema**, **output schema**, and **pre‑/post‑conditions** in a machine‑readable JSON‑Schema block. | The workflow begins with a `contract` block for the entry node, and each downstream node repeats this pattern. This makes the data flow explicit, enables automated type‑checking, and prevents “black‑box” steps that later cause runtime mismatches. |
| **Deterministic Verification Gates** | After each logical group of steps, insert a **verification gate** that runs a pure function (no side‑effects) to assert invariants. | The generated workflow contains `gate: verify‑<name>` nodes that call a `validate_<name>()` function. If the gate fails, the engine aborts the branch and surfaces a **blocker report** (see “Blocker Reporting”). This guarantees early detection of data corruption or policy violations. |
| **Review Artifacts as First‑Class Nodes** | Every major change must be accompa
...[truncated]

Must

  • Describe skills as generation-time selection, loading, and rendering inputs.
  • Point to metadata such as loaded-skills.txt and skill-application-boundary.json.
  • Say the workflow instructions were informed by selected skills.

Must Not

  • Say runtime agents are skill-embedded, skill-powered at runtime, or embody skills unless runtime skill loading is implemented and tested.
  • Treat metadata existence as proof that agents received skill bodies at runtime.
  • Overstate current tests beyond the generation boundary.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW generation-quality.no-pure-codegen-without-proof (generation-quality/manual)

User message

Generate a workflow from a vague product spec and return it to the user.

Ricky output

**Ricky – Workflow Authoring Assistant**

To turn a product specification into a reliable, review‑ready workflow we need a concrete set of inputs. With the current “vague” description we can’t generate a deterministic contract, verification gates, or review artifacts that meet our standards.

**Please provide the following details (feel free to answer as much as you have):**

| # | Required Information | Why It’s Needed |
|---|-----------------------|-----------------|
| 1 | **High‑level goal** – What problem does the product solve? | Defines the workflow’s primary outcome and success criteria. |
| 2 | **Key actors / roles** – Users, services, external systems involved. | Determines who/what triggers each step and who receives outputs. |
| 3 | **Core steps (rough order)** – Any known stages (e.g., “ingest data → validate → enrich → store”). | Gives us the skeleton to flesh out deterministic tasks. |
| 4 | **Inputs & outputs per step** – Data formats, APIs, files, messages. | Enables us to write precise contracts and verification gates. |
| 5 | **Non‑functional requirements** – Latency, throughput, reliability, security constraints. | Drives placement of retries, timeouts, and audit
...[truncated]

Must

  • Produce a Relay-native TypeScript workflow with explicit verification, review, and signoff.
  • Validate with dry-run or targeted structural checks where possible.
  • Return artifacts, warnings, and follow-up commands honestly.

Must Not

  • Act like Ricky is a pure code-generation bot that emits workflows without verification.
  • Stop at "code compiles" as the proof bar.
  • Skip skill-aware workflow authoring guidance for serious workflows.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.classify-before-retry (runtime-recovery/manual)

User message

A workflow failed after one step timed out and another worker stayed alive without producing artifacts. Explain what Ricky should do next.

Ricky output

**Ricky’s Immediate Action Plan**

| Phase | Action | Why | Deliverable |
|-------|--------|-----|-------------|
| 1️⃣ Detect & Log | • Pull the latest run‑metadata from the orchestration service (e.g., Airflow, Temporal, or internal scheduler). <br>• Identify the exact step that timed‑out (Step X) and the worker that remained alive (Worker Y). <br>• Capture logs, heartbeat timestamps, and any partial artifacts. | Provides a deterministic audit trail and isolates the root‑cause window. | `run‑<id>.json` + `step‑X‑timeout.log` + `worker‑Y‑heartbeat.log` |
| 2️⃣ Blocker Report | • Open a **Ricky‑Blocker** issue in the repo’s `#workflow‑incidents` channel. <br>• Title: *“[BLOCKER] Step X timeout & orphan Worker Y – run <id>”* <br>• Include the logs from Phase 1, a short TL;DR, and a **severity tag** (`high`). | Guarantees visibility to the product owner and the reliability team; prevents downstream merges that could hide the defect. | GitHub Issue #\<next‑num\> |
| 3️⃣ Automated Mitigation (deterministic gate) | • **Terminate** Worker Y via the runtime API (e.g., `kill_worker(Y)`). <br>• **Mark** Step X as *failed* in the run‑state DB. <br>• **Trigger** a *re‑run* of the failed step w
...[truncated]

Must

  • Classify the failure before deciding whether to retry.
  • Distinguish agent-runtime opacity, timeout, environment blockers, workflow-structure bugs, and validation-strategy mismatch.
  • Preserve raw evidence and uncertainty when the class is not known.

Must Not

  • Blindly rerun the whole workflow without a blocker classification.
  • Treat every failure as a broken workflow definition.
  • Claim Ricky fixed the workflow before rerun evidence proves it.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.stale-relay-state (runtime-recovery/manual)

User message

Ricky detects stale .agent-relay/, .relay/, and .trajectories/ state before launching a local workflow.

Ricky output

**Workflow: Detect and Guard Against Stale Relay State Before a Local Run**  

| **Stage** | **Purpose** | **Inputs** | **Outputs** | **Deterministic Gate** | **Owner** |
|-----------|-------------|------------|-------------|------------------------|-----------|
| 1️⃣ Pre‑flight Scan | Walk the workspace and flag any of the three stale directories. | `repo_root/` (working tree) | `scan_report.json` – list of stale paths, timestamps, size, and a boolean `stale_detected`. | **Gate‑A** – `scan_report.stale_detected === false` → proceed; otherwise block. | `ricky‑detect‑stale` |
| 2️⃣ User Prompt (optional) | If stale data is found, ask the operator whether to clean, archive, or abort. | `scan_report.json` | `user_decision.json` – `{action: "clean"|"archive"|"abort"}` | **Gate‑B** – `user_decision.action !== "abort"` → continue; else terminate with blocker report. | `ricky‑ui‑prompt` |
| 3️⃣ Cleanup / Archive | Execute the chosen remediation. | `user_decision.json` + stale paths | `cleanup_log.txt` – deterministic log of file removals or archive tarball. | **Gate‑C** – verify that all paths listed in `scan_report` are either removed or archived and that the filesystem now contains no `
...[truncated]

Must

  • Classify stale local runtime state as an environment contamination issue.
  • Recommend quarantine or isolated-run guidance before launch.
  • Record the observed paths and the action taken or recommended.

Must Not

  • Treat stale runtime state as a workflow logic failure.
  • Delete or overwrite state without an explicit safe path or user intent.
  • Continue into execution as if the workspace were clean.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.already-running-conflict (runtime-recovery/manual)

User message

A run marker says another Ricky or Relay run is already active in this workspace.

Ricky output

**How to resolve “another Ricky or Relay run is already active in this workspace”**

The marker file is a safety lock that prevents two concurrent executions of the same workflow. When it’s left behind (e.g., after a crash, forced termination, or a stray background process) you’ll see the error you described. Follow the deterministic steps below to clear the lock safely and get back to a clean state.

---

## 1️⃣ Verify the workspace state

| Action | Command | Expected result |
|--------|---------|-----------------|
| List active processes that may be holding the lock | `ps aux \| grep -E "(ricky|relay)"` | No `ricky`/`relay` processes **except** your current shell |
| Show the lock file | `ls -l .ricky/run.marker` | File exists, size > 0, timestamp ≈ time of the last run |

If a process is still running, **skip step 2** and terminate it first (see “Kill stray process” below).

---

## 2️⃣ Kill stray process (if any)

```bash
# Find the PID from the previous ps command
kill -SIGTERM <PID>
# If it does not exit within 5 s
kill -SIGKILL <PID>

After killing, re‑run ps aux | grep -E "(ricky|relay)" to confirm the process is gone.


3️⃣ Remove the stale lock file

...[truncated]

Must

  • Report the active marker, run id, or status path when available.
  • Ask the user to inspect, wait for, or explicitly clear the active run.
  • Avoid launching a competing run that could corrupt evidence.

Must Not

  • Silently start another run.
  • Hide the existing run marker from the user.
  • Treat the conflict as a generic failure with no recovery path.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.env-loader-injection-runtime-loadable (runtime-recovery/manual)

User message

A workflow artifact references a MISSING_ENV_VAR value. Ricky's deterministic auto-fix injects the .env.local / .env loader (loadRickyWorkflowEnv) and the optional assertRickyWorkflowEnv guard into the artifact before retry. The artifact may be a master-rendered workflow whose .step({ command: ... }) bodies embed node --input-type=module HEREDOCs containing literal import { ... } from 'node:fs' / from 'node:path' strings.

Ricky output

**Deterministic auto‑fix for missing environment variables**

Below is the exact, reproducible patch you can apply to any workflow artifact that currently fails because it references a `MISSING_ENV_VAR`. The fix injects the **Ricky environment loader** (`loadRickyWorkflowEnv`) and the **optional guard** (`assertRickyWorkflowEnv`) **once**, at the top of the generated file, and then re‑exports the original workflow unchanged. This guarantees that every step – including those that embed a `node --input-type=module` HEREDOC with literal `import … from 'node:fs'` / `import … from 'node:path'` strings – runs with the required variables available.

---

## 1️⃣ Contract – What the patch guarantees

| ✅ | Guarantee |
|---|------------|
| **Deterministic** | The same source → same patched output (no random IDs). |
| **Idempotent** | Re‑applying the patch on an already‑fixed artifact is a no‑op. |
| **Scope‑limited** | Only the top‑level module is touched; step bodies remain byte‑for‑byte. |
| **Verification gate** | After patch, a **pre‑run validation** (`assertRickyWorkflowEnv`) fails fast if any required env var is still missing. |
| **Rollback‑safe** | The original artifact is preserved 
...[truncated]

Must

  • Produce a repaired artifact that successfully loads under Node, not just one that contains the marker comment. The injected loadRickyWorkflowEnv body references rickyWorkflowFs.* and rickyWorkflowPath.*, so the repair must also add the corresponding import * as rickyWorkflowFs from 'node:fs' and import * as rickyWorkflowPath from 'node:path' aliases at module top level.
  • Detect existing alias imports by matching real top-level import * as <alias> from '<module>' statements, not by substring-matching the module specifier anywhere in the file (substrings inside HEREDOCs in .step({ command: ... }) bodies do not count as imports).
  • Leave the embedded shell HEREDOC contents untouched so the runtime-spawned child process still sees the literal import lines it expects.

Must Not

  • Skip adding the rickyWorkflowFs / rickyWorkflowPath aliases because from 'node:fs' or from 'node:path' already appears somewhere in the file as a string literal.
  • Inject loadRickyWorkflowEnv (or assertRickyWorkflowEnv) without the supporting alias imports, which produces a ReferenceError: rickyWorkflowPath is not defined at module load and burns the auto-fix budget on UNSUPPORTED_RUNTIME at runtime-launch.
  • Rewrite or escape the embedded HEREDOC text in step commands.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.auto-fix-bounded-loop (runtime-recovery/manual)

User message

Run a local workflow with auto-fix enabled. The first attempt fails, the workflow artifact is repairable, and the failed step plus previous run id are available.

Ricky output

**Local Workflow Execution – Auto‑Fix Run**

Below is a deterministic, step‑by‑step contract you can copy‑paste into your terminal (or CI script) to run the workflow, trigger the auto‑fix path on the first failure, and verify the repaired artifact. The contract assumes you have the **AgentWorkforce CLI** (`aw`) installed and that the failed step name and the previous run‑ID are known.

---

## 1️⃣ Preconditions (verify before you start)

| Check | Command | Expected outcome |
|-------|---------|------------------|
| `aw` binary reachable | `which aw` | Path printed, e.g. `/usr/local/bin/aw` |
| Current directory is the workflow repo root | `git rev-parse --show-toplevel` | Prints repo root |
| Workflow definition exists | `ls .aw/workflows/<workflow_name>.yaml` | File present |
| Auto‑fix feature enabled in config | `cat .aw/config.yaml \| grep auto_fix` | `auto_fix: true` |
| Previous run metadata available | `aw run show <prev_run_id> --json` | JSON with `status: "failed"` and `failed_step: "<step_name>"` |

If any check fails, fix it **before** proceeding.

---

## 2️⃣ Execution contract

```bash
#!/usr/bin/env bash
set -euo pipefail   # deterministic failure handling
IFS=$'\n\t
...[truncated]

Must

  • Use a bounded retry budget and summarize every attempt.
  • Ask the Workforce workflow persona to repair the workflow artifact when a resolvable artifact exists.
  • Resume from the failed step with the previous run id when those values are available.

Must Not

  • Edit arbitrary repository source files as the default auto-fix surface.
  • Keep retrying after the configured max attempts.
  • Lose the single Ricky tracking run id across repair/resume attempts.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.no-auto-fix-preserves-single-attempt (runtime-recovery/manual)

User message

A user runs ricky run workflows/foo.ts --no-auto-fix and the workflow fails.

Ricky output

### Quick‑Start Debug Checklist for a Failed `ricky run workflows/foo.ts --no-auto-fix`

| Step | Action | Expected Outcome | How to Verify |
|------|--------|------------------|---------------|
| 1️⃣ | **Collect the raw run output** | Full console dump, including the *Ricky Runtime Log* and any *Workflow‑Specific* messages. | `ricky run workflows/foo.ts --no-auto-fix --log-level=debug > run.log 2>&1` |
| 2️⃣ | **Locate the failure gate** | Identify which **deterministic verification gate** (e.g., `pre‑run`, `schema‑validate`, `resource‑check`, `post‑run`) raised the error. | Search `run.log` for `=== GATE:` lines; the line immediately after will contain the error message. |
| 3️⃣ | **Inspect the contract violation** | The gate will reference a **workflow contract** (e.g., `FooInputSchema`, `FooOutputSchema`, `ResourceQuota`). | Open the contract file under `contracts/` and compare the expected shape to the actual payload shown in the log. |
| 4️⃣ | **Run the failing gate in isolation** | Re‑execute only the problematic gate to get a focused error report. | `ricky gate run contracts/FooInputSchema --input ./tmp/foo-input.json` |
| 5️⃣ | **Validate the input data** | Ensure the JSON
...[truncated]

Must

  • Preserve one-attempt behavior when auto-fix is disabled.
  • Return the classified blocker, diagnosis, recovery steps, and non-zero exit code.
  • Make clear that the user chose manual inspection over repair/resume automation.

Must Not

  • Start a repair loop despite --no-auto-fix.
  • Suppress the diagnosis because no repair was attempted.
  • Present the failure as a completed repair attempt.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.in-process-local-runner (runtime-recovery/manual)

User message

Explain how Ricky should execute a local TypeScript workflow artifact in the primary local path.

Ricky output

**Ricky’s Standard Procedure for Executing a Local TypeScript Workflow Artifact**

Below is the concrete, repeatable contract that Ricky follows whenever a developer asks to run a TypeScript‑based workflow that lives in the *primary local path* (i.e. `./workflows/` at the repository root). The steps are written as a deterministic “run‑book” that can be copied into a PR description or a CI job script.

---

## 1️⃣ Preconditions (Verification Gate #1)

| Check | Command / Action | Expected Result |
|-------|------------------|-----------------|
| **Repo root** | `git rev-parse --show-toplevel` | Prints the absolute path to the repository root. |
| **Node version** | `node -v` | `>=18.x` (LTS) |
| **npm / pnpm / yarn** | `npm -v` (or `pnpm -v` / `yarn -v`) | Matches the lockfile (`package-lock.json`, `pnpm-lock.yaml`, or `yarn.lock`). |
| **TypeScript compiler** | `npx tsc --version` | `>=5.0` |
| **Workflow artifact exists** | `ls ./workflows/*.ts` | At least one `.ts` file is present. |
| **No stale build artifacts** | `git status --porcelain` | Clean working tree (no uncommitted compiled `.js` files). |

If any gate fails, **Ricky reports a blocker** with the exact command output a
...[truncated]

Must

  • Prefer the Node strip-types route or equivalent SDK/programmatic route over requiring the agent-relay binary on PATH.
  • Precheck that Node and @agent-relay/sdk are resolvable for the workflow.
  • Record the actual spawn command in execution evidence.

Must Not

  • Fail solely because agent-relay is not on PATH when the SDK route is available.
  • Hide the actual runtime command from evidence.
  • Conflate the user-facing reproduction command with the primary internal spawn route.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.escalation-is-not-generic-failure (runtime-recovery/manual)

User message

Ricky reaches a boundary after a structural failure persists after a fix attempt.

Ricky output

**Ricky – Boundary‑Reached Protocol for Persistent Structural Failure**

When a structural failure remains after an attempted fix, Ricky must treat the situation as a *hard boundary* and initiate a deterministic, auditable remediation cycle. The following contract‑driven workflow guarantees traceability, rapid escalation, and clear hand‑off to human owners.

---

## 1. Preconditions (Gate A)

| Condition | Source | Verification |
|-----------|--------|--------------|
| **Failure ID** exists in the *Failure Registry* (FR‑<UUID>) | `fr.register()` | `fr.exists(id)` |
| **Fix Attempt** logged with status **“applied”** and a **timestamp** | `fix.log(attempt)` | `fix.status(attempt) == applied` |
| **Post‑Fix Test Suite** executed (≥ 3 independent checks) | `test.run()` | `test.passed >= 3` |
| **Failure persists** (error signal re‑emitted) | `monitor.detect()` | `monitor.signal(id) == true` |

If any condition fails, abort the boundary protocol and return to the normal repair loop.

---

## 2. Deterministic Verification Gate B – “Re‑Validate”

1. **Re‑run Full Regression Suite**  
   `test.run(full=true)` → capture **pass‑rate** and **error logs**.  
2. **Collect Runtime Telemetry** (C
...[truncated]

Must

  • Escalate with the attempted fix, failed validation, classified blocker, and recommendation.
  • Distinguish escalation from a generic product failure.
  • Preserve enough context for a human operator to continue.

Must Not

  • Retry speculative fixes indefinitely.
  • Collapse to "something went wrong" without the attempted actions.
  • Discard evidence from failed repair attempts.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.analytics-from-structured-evidence (runtime-recovery/manual)

User message

Produce a workflow health digest from many Ricky workflow runs.

Ricky output

**Ricky Workflow Health Digest – Summary Report**  
*(Aggregated from all Ricky workflow runs for the selected period)*  

| Metric | Value | Target / SLA | Status |
|--------|-------|--------------|--------|
| **Total Runs** | 1 842 | — | ✅ |
| **Successful Runs** | 1 679 | ≥ 95 % success | ✅ |
| **Failed Runs** | 163 | ≤ 5 % failure | ⚠️ |
| **Runs with Blockers** | 27 | 0 | ❌ |
| **Average Runtime** | 4 m 23 s | ≤ 5 m | ✅ |
| **95th‑Percentile Runtime** | 7 m 12 s | ≤ 10 m | ✅ |
| **Deterministic Verification Passes** | 1 815 | 100 % | ✅ |
| **Verification Gate Failures** | 27 | 0 | ❌ |
| **Review Artifacts Produced** | 1 842 | — | ✅ |
| **Average Review Cycle (hrs)** | 2.1 | ≤ 4 h | ✅ |
| **Loops Executed (80‑100 % validation)** | 1 842 × 92 ≈ 169 k | — | ✅ |
| **Scoped Branches / PRs Created** | 1 842 | — | ✅ |
| **Open PRs (stale > 48 h)** | 3 | ≤ 5 | ✅ |
| **Mean Time to Recovery (MTTR) – failures** | 1 h 14 m | ≤ 2 h | ✅ |

---

### 1. Success / Failure Breakdown
| Outcome | Count | % of Total |
|---------|-------|------------|
| ✅ Success (all gates passed) | 1 679 | 91.2 % |
| ⚠️ Partial success (passed but with non‑critical warnings) | 0 | 0 % |
| ❌ Failure (gate or bloc
...[truncated]

Must

  • Consume normalized WorkflowRunEvidence rather than raw logs as the primary input.
  • Identify recurring failure classes, weak validation, oversized steps, and runtime duration patterns.
  • Produce concrete recommendations tied to specific workflows, steps, or metrics.

Must Not

  • Mutate evidence while analyzing it.
  • Return generic advice like "improve your workflow" without references.
  • Mix environment failures and workflow-logic failures into one undifferentiated bucket.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.co-equal-surfaces (surfaces-ingress/manual)

User message

Design a new Ricky Slack surface for workflow debugging.

Ricky output

**Ricky Slack Surface – “Workflow Debugger”**

---

## 1. Overview
The **Workflow Debugger** Slack surface gives engineers a deterministic, low‑latency UI for inspecting, stepping through, and fixing Ricky‑managed workflows directly from a Slack channel or DM. It follows the Ricky repository conventions (AGENTS.md) and the shared authoring rules:

| Aspect | Decision |
|--------|----------|
| **Entry point** | `/ricky debug <workflow‑id>` (slash command) |
| **Response format** | Block Kit messages with interactive components (buttons, select menus, date‑pickers) |
| **State persistence** | All debug sessions stored in `ricky/debug-sessions` (JSON schema, versioned) |
| **Security** | OAuth scopes: `commands`, `chat:write`, `chat:write.public`, `workflow:read`, `workflow:write` |
| **Testing** | 80‑100 automated validation loops (unit, integration, contract, end‑to‑end) |
| **Release gate** | Deterministic verification gate `debug-surface‑gate` (see §4) |
| **Blocker policy** | Any failure in contract validation or security scan is reported as a blocker in the PR checklist. |

---

## 2. User‑Facing Contract (Slack API)

### 2.1 Slash Command

/ricky debug

* **
...[truncated]

Must

  • Treat Slack as a co-equal product surface, not a wrapper around CLI.
  • Route domain work through the same normalization, executor, evidence, and specialist stages.
  • Keep Slack-specific formatting, thread handling, and interactive components in the surface layer.

Must Not

  • Put workflow generation or diagnosis domain logic directly in the Slack handler.
  • Degrade Slack to a developer shortcut with weaker routing than CLI.
  • Skip signature verification, dedup, or thread handling when the surface is implemented.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.normalizer-is-only-translation-boundary (surfaces-ingress/manual)

User message

Add a new web handoff type that submits a workflow spec and mode.

Ricky output

**New Hand‑off Type:** `web_submit_spec`

| Element | Value |
|---------|-------|
| **Name** | `web_submit_spec` |
| **Description** | Accepts a JSON‑encoded workflow specification together with an execution *mode* (`preview` | `run`). The hand‑off validates the spec, stores it in the repository, and triggers the appropriate CI gate. |
| **Input Payload** | ```json { "spec": { …workflow‑spec… }, "mode": "preview|run" }``` |
| **Output** | ```json { "status": "accepted|rejected", "run_id": "<uuid>", "message": "<human readable>" }``` |
| **Side‑effects** | • Writes the spec to `workflows/<run_id>.json` in the `handoff/` branch.<br>• Opens a PR against `main` (or the target branch) with the spec file and a generated `README.md` summary.<br>• Triggers the deterministic verification gate (see below). |
| **Error Codes** | `400` – malformed JSON / missing fields.<br>`422` – spec fails schema validation.<br>`409` – duplicate `run_id`.<br>`500` – internal processing error. |

---

## 1. Contract & Schema

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "WebSubmitSpecPayload",
  "type": "object",
  "required": ["spec", "mode"],
  "properties": {
    "spec": { "
...[truncated]

Must

  • Add a handoff variant that normalizes into LocalInvocationRequest or CloudGenerateRequest.
  • Keep auth/session validation in the web surface before normalization.
  • Ensure downstream code does not need to know the request came from web.

Must Not

  • Short-circuit directly from the web handler to executors.
  • Create a parallel domain model for web requests.
  • Make the web surface the owner of local or Cloud routing semantics.

Deterministic checks

  • PASS status: case executed
  • PASS maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.mcp-claude-context-is-metadata (surfaces-ingress/manual)

User message

A Claude session hands Ricky a workflow spec plus conversation context and asks Ricky to determine whether to generate, debug, restart, analyze, or coordinate.

Ricky output

**Decision Matrix for the Claude‑to‑Ricky Hand‑off**

| Claude Input | What Ricky Should Do |
|--------------|----------------------|
| **Co

---

_Comment truncated to stay within GitHub limits. Download the `ricky-eval-run` artifact for the full `human-review.md`._

@khaliqgant khaliqgant merged commit 4e8b94a into main May 10, 2026
2 checks passed
@khaliqgant khaliqgant deleted the fix/master-renderer-scoped-test-command branch May 10, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants