Skip to content

Execute Tactus local dispatch and optimizer fixes#262

Merged
endymion merged 41 commits intodevelopfrom
feature/plx-07dc0d-execute-tactus-mcp-tool
May 1, 2026
Merged

Execute Tactus local dispatch and optimizer fixes#262
endymion merged 41 commits intodevelopfrom
feature/plx-07dc0d-execute-tactus-mcp-tool

Conversation

@endymion
Copy link
Copy Markdown
Contributor

@endymion endymion commented May 1, 2026

Summary

  • fixes execute_tactus optimizer/local procedure dispatch behavior
  • adds dashboard handling/tests for local procedure dispatch state
  • preserves async child-budget accounting at the runtime handle boundary
  • commits Kanbus records for the associated work

Verification

  • pytest MCP/tools/tactus_runtime/execute_test.py
  • pytest plexus/cli/shared/command_dispatch_test.py plexus/cli/shared/test_command_dispatch.py plexus/cli/shared/test_experiment_runner.py plexus/lambda/test_task_dispatcher.py
  • cd dashboard && npm run ci:typecheck
  • cd dashboard && npm test -- --runTestsByPath components/tests/ProcedureTask.optimizer-auth.test.tsx --runInBand
  • GitHub Actions run 25233097621 passed for prior commit; run 25233535492 is queued for latest commit.

endymion and others added 30 commits April 29, 2026 01:32
Introduce the single-tool Tactus runtime path with tracing, budget gating, long-running operation guards, and direct feedback lookup so Plexus can be exercised as a programmable MCP runtime.

Made-with: Cursor
Add cooperative Task cancellation checkpoints so report and procedure workers stop cleanly after execute_tactus handle cancellation marks dashboard work cancelled.

Made-with: Cursor
Keep cooperative report cancellation checks from turning unavailable Task refreshes into report failures, and relax the legacy mock expectation to allow intentional status polling.

Made-with: Cursor
Capture the cooperative cancellation CI fix and local verification in Kanbus for the execute_tactus handle task.

Made-with: Cursor
Bridge Tactus runtime events and Plexus API call progress to FastMCP Context notifications while preserving the final execute_tactus response envelope.

Made-with: Cursor
Require explicit child budgets for async execute_tactus work so dispatched evaluations, reports, and procedures remain attached to the parent runtime budget.

Made-with: Cursor
Apply propagated execute_tactus child budgets inside evaluation, report, and procedure workers so long-running child executions fail early when wallclock, depth, or known spend exceeds their allocation.

Made-with: Cursor
Align the runtime validation contract with explicit async budgets and broaden helper aliases so generated Tactus can use the advertised Plexus API surface directly.

Made-with: Cursor
- Replace the entire legacy MCP tool catalog (scorecard, score,
  evaluation, feedback, item, prediction, dataset, report, rubric_memory,
  etc.) with a single `execute_tactus` tool that exposes all Plexus
  functionality via the `plexus.*` Tactus runtime API
- Add `plexus.score.contradictions` for rubric vs. code consistency checks
- Add `score_rubric_consistency_check` option to `plexus.evaluation.run`
- Add `plexus.procedure.optimize` shortcut for launching the feedback
  alignment optimizer with standard parameters
- Add `plexus execute` CLI command for local Tactus snippet testing
- Rename `run_experiment` → `run_procedure` and
  `run_experiment_with_task_tracking` → `run_procedure_with_task_tracking`
  throughout the codebase to match domain terminology
- Delete `procedure_sop_agent`, `sop_agent_base`, `demo_ai_mcp_integration`,
  `model_config_examples`, and all associated tests (legacy LangGraph-based
  optimizer prototype; superseded by Tactus procedures)
- Remove SOPAgent routing from `procedure_executor.py`; only `class: Tactus`
  procedures are supported going forward
- Reorganise and expand Plexus documentation under `plexus/docs/` with
  topic-based subdirectories and new guides for the Tactus runtime API

Made-with: Cursor
- Add score.pull/update/test, feedback.latest_update, rubric_memory.*
  namespaces to PlexusRuntimeModule DIRECT_HANDLERS with full
  _default_* implementations
- Add _default_report_runner_sync for synchronous report execution
  needed by optimizer; route plexus.report.run(sync=true) through it
- Add --emit-id-file CLI option to plexus evaluate accuracy/feedback
  so _default_evaluation_runner can capture evaluation_id from
  background subprocess for handle tracking
- Construct and register PlexusRuntimeModule in procedure_executor.py
  so Tactus/Lua procedure code can call plexus.* directly
- Create rubric_memory_toolset.py: in-process MCP tools for
  plexus_rubric_memory_* sub-agent tools
- Replace legacy MCP tool calls in ScoreEditorToolset with direct
  _default_score_pull / _default_score_update calls
- Rewrite feedback_alignment_optimizer.yaml call_plexus_tool to use
  plexus.* APIs directly; batch evaluations via handle protocol;
  synchronous reports via sync=true; score pull via temp files
- Update execute_test.py and test_score_editor_toolset.py to mock
  new direct-call interfaces

Made-with: Cursor
…code storage

Closes plx-62b442, plx-51488a, plx-f804a6.
Updates plx-61c332, plx-07dc0d.
Adds plx-71ad53 (remaining L4 integration tests).

## execute_tactus contract hardening (plx-f804a6)

- Add `_truncate_envelope` helper: caps execute_tactus JSON responses at 40 K chars
  to prevent LLM context-window overflow from large evaluation / scorecard payloads.
- `BudgetGate.carve_child`: when the parent gate is effectively infinite (usd=inf,
  wallclock=inf — as in the embedded chat MCP context), auto-supply a generous default
  child budget instead of raising ChildBudgetRequired. Callers inside chat no longer
  need explicit `budget = { ... }` for async evaluation / procedure calls.
- `_default_score_update`: set `isFeatured: "false"` on new ScoreVersion records so
  optimizer-created versions are not featured by default.
- `_default_score_test`: remove erroneous lambda wrapper around coroutine, fixing
  asyncio awaitable error.
- `_default_score_pull`: write YAML and guidelines to temp files and return their paths
  so sandboxed Lua code can read them via File.read() without needing the io library.
- `_default_procedure_optimize`: dispatch optimizer via background daemon thread so
  the chat agent receives procedure_id immediately (~49 s) instead of blocking for hours.

## Console chat fixed end-to-end (plx-61c332, plx-62b442)

- `chat_agent.tac` `extract_text`: handle Lupa userdata (Lua receives Plexus Python
  objects as `userdata`, not `table`) using pcall attribute access; checks
  response/content/message/text keys and indexed first element.
- Remove `MessageHistory.get()` auto-load: history now comes exclusively from
  `console_session_history` passed by the caller, preventing cross-turn context bleed
  that caused 300 K–667 K token overflows.
- Add `assistant.output` fallback with garbage filter: filters out Python model reprs
  like "UsageStats" and "output=None" that appeared when the LLM returned without
  tool use.
- `mcp_transport.py`: pass a permissive BudgetGate (usd=inf, wallclock=inf, depth=20,
  tool_calls=500) to execute_tactus in the embedded procedure MCP context.
- `builtin_procedures.py`: increase chat agent max_tokens 220→1024, reasoning_effort
  low→medium; add explicit usage examples for evaluation.run (with budget),
  procedure.optimize (with budget), and evaluation.find_recent (with evaluation_type).

## S3-backed procedure code storage (plx-07dc0d)

- `service.py`: on procedure creation, upload YAML as `code.tac` to S3 and store the
  key in `procedure.metadata["code_s3_key"]`. On load, check S3 before falling back
  to template. Prevents DynamoDB 400 KB item limit from blocking large optimizer YAMLs.
- `s3_utils.py`: add `upload_procedure_file` and `download_procedure_code` helpers.
- `procedure.py` (model): add `metadata` field to Procedure GraphQL model.
- `resource.ts`: add `metadata` field to Procedure Amplify schema.

## procedure_executor.py (plx-07dc0d)

- Remove special `_create_console_plexus_dispatch_tool` branch for console chat;
  all procedures now use `PydanticAIMCPAdapter` uniformly to expose execute_tactus.
- Fix MCP dir path (one extra `..` removed).
- Inject `plexus` Lua global shim at the top of every procedure source so procedures
  that use `plexus.*` as a global (not via require) still work.
- Register an effectively unlimited BudgetGate for procedure-internal plexus.* calls
  so long-running procedures are not killed by the 60 s default budget.

## tactus_adapters/storage.py (plx-07dc0d)

- Wrap `OptimizerResultsService.index_optimizer_run` in try/except RuntimeError so
  missing `AMPLIFY_STORAGE_TASKATTACHMENTS_BUCKET_NAME` degrades gracefully with a
  warning instead of crashing the optimizer.

## feedback_alignment_optimizer.yaml (plx-07dc0d)

- Add nil guard before `rubric_memory_context.machine_context` access that caused
  "attempt to index a nil value" crashes during early optimizer turns.

## plexus execute CLI (plx-07dc0d)

- Fix sys.path construction so `plexus execute` finds the MCP module in all
  working-directory contexts.

Made-with: Cursor
…tion, S3 graceful degradation

- Fix plexus.score.predict system prompt (was plexus.predict, shorthand only in MCP wrapper)
- Add metadata-only score.update path (external_id, name, key, description) without creating a new version
- Fix CostEvent JSON serialization crash in procedure output with _json_safe() helper
- Graceful degradation when S3 bucket not configured in persist_task_output_artifact
- Fix console_chat_smoke.py: PLEXUS_CMD support, responseStatus/responseTarget schema fields

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The existing coercion only handled camelCase externalId; snake_case
external_id: 47833 passed through unmodified and failed schema validation
("not of type 'string'"), blocking every hypothesis submission.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove budget enforcement from all four async dispatch paths
(evaluation.run, report.run, procedure.run, procedure.optimize).
These are fire-and-forget calls that return a handle immediately —
the subprocess runs independently, so carving a child budget from
the MCP session cap ($0.25 / 60s) before dispatch was wrong and
blocked all async calls.

Also add configuration_id support to _default_report_runner so
full report configurations can be dispatched async via MCP, and
update the tool description to reflect that no budget table is
needed for async dispatches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Report blocks and configuration-based reports now automatically use
local thread execution in development (default) and remote task
dispatcher in cloud deployments (Lambda).

Set PLEXUS_REPORT_DISPATCH=remote to enqueue reports through the
remote dispatcher — required in Lambda where long-running threads
would time out. Omit it (or set to "local") for direct local
execution, matching CLI behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Report subtitle now shows "Last N days" (or date range) instead of
  sentinel text — derived from _format_date_window_for_display() in
  _persist_block_result()
- Procedure name now correctly extracted from YAML by passing code=
  to Procedure.create()
- Feedback block description and block_title now include date range
  and scorecard name respectively
- Local async report dispatch spawns subprocess instead of thread
  (survives MCP server restarts)
- .mcp.json updated to point to py311 env and Plexus-4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Procedure.create now accepts explicit name param and skips storing
  code in DynamoDB when it exceeds 350KB (large optimizer YAML was
  hitting DynamoDB's 400KB item limit); code still goes to S3
- procedure.optimize and procedure.run now dispatch as independent
  subprocesses instead of daemon threads (survive MCP server restarts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in DynamoDB

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AccuracyEvaluation.run() was missing the background metrics task drain that
the base Evaluation.run() performs in its finally block. This left background
tasks racing the outer code's final confusionMatrix write, allowing stale
intermediate values to persist after the evaluation completed.

Two fixes:
1. AccuracyEvaluation.run() now awaits pending metrics tasks (10s timeout)
   before returning, matching the base class pattern.
2. The dataset-backed accuracy final write now unconditionally writes
   confusionMatrix/predictedClassDistribution/datasetClassDistribution from
   final_metrics, so this write always wins over any earlier background write.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ault

The accuracy evaluation dispatcher defaulted yaml=True, always appending
--yaml to the CLI command. The --yaml flag causes the evaluate accuracy
command to suppress scoreVersionId on the evaluation record, so the accuracy
baseline appeared to have no version while the feedback baseline had one.

Change the default to False so --yaml is only added when explicitly requested.
The optimizer always passes --version explicitly, so --yaml is not needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `acceptance_rate` and `report_acceptance_rate` HELPER_BINDINGS aliases
- Add `("report", "acceptance_rate")` DIRECT_HANDLER → `_call_report_run`
- `_call_report_run` pre-populates `block_class = "AcceptanceRate"` and
  promotes top-level params (scorecard, score, days, include_item_acceptance_rate,
  max_items) into block_config when called as `plexus.report.acceptance_rate`
- Fix subprocess dispatch to pass `--include-item-acceptance-rate` and
  `--max-items` for AcceptanceRate blocks
- Update tool description to list `acceptance_rate` as a high-frequency alias
- Add `plexus/docs/evaluation-and-feedback/acceptance-rate.md` with full param
  reference, synonym list, and usage examples

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When acceptance_rate{ sync = true } is called, post-process the result:
- Parse the comment-header+JSON string output into a proper dict
- Drop the verbose shard-fetch log (hundreds of lines, not useful to LLMs)
- Strip the items array by default (can be thousands of rows); callers that
  need per-item rows pass include_items = true
- Drop raw_counts (internal diagnostic, not useful to consumers)

Also add include_items parameter to the doc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The optimizer was tracking repeat-offender items across cycles but telling the
agent to AVOID them as "likely label noise" or "approaching the ceiling." For
small eval sets (e.g. 51 samples) where 100% accuracy is the goal, that framing
was self-defeating: it made the agent give up on exactly the items it needed to
fix.

Changes:

1. Inject deterministic item_recurrence summary into BOTH synthesis contexts
   (Strategy A, Strategy B) — it was previously only shown during hypothesis
   generation, leaving the code-editing phases blind to cross-cycle recurrence.

2. Flip the IMPLICATION text at every injection site to mark PERSISTENT,
   OSCILLATING, and FLIP_FLOP items as HIGHEST-PRIORITY targets rather than
   things to avoid. Encourage literal rules, example snippets, item-specific
   carve-outs, and explicit overfitting.

3. Rewrite the feedback landscape diagnostic's analysis task to produce
   per-item targeted fix recommendations and an "Aggressive Fix Strategy"
   section instead of an "Optimization Ceiling Assessment" and "Suspected
   Low-Quality Feedback Labels" list.

4. Flip the early-stop advisor context so repeat offenders prompt escalation
   to ultra_creative mode rather than acceptance of a ceiling.

5. Flip the accumulated_lessons item-recurrence instructions to treat
   recurring items as fixable targets and record which specific hypotheses
   failed (so future cycles try something different).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Procedure cards now show "Optimizer Procedure" (or any procedure_type
  set in YAML) in the badge below the ⋯ button, matching EvaluationTask
- procedure_type, score_name, scorecard_name seeded into procedure metadata
  at creation so subtitle and badge appear immediately without waiting for
  first Lua State checkpoint
- Procedure.create() now sets status='RUNNING' so Amplify Gen2 realtime
  subscriptions recognise new records (byStatus index populated)
- Procedure.update() now accepts status and name parameters
- onCreateProcedure subscription re-fetches full record to resolve
  @belongsTo relations (scorecard/score) that AppSync omits from payloads
- onUpdateProcedure subscription preserves existing scorecard/score/metadata
  instead of clobbering with nulls from bare subscription payload
- Optimizer procedure name set to "Optimizer: {scorecard}" (title line);
  score name appears as subtitle via linked score relation
- feedback_alignment_optimizer.yaml declares procedure_type: Optimizer Procedure
- Old runs named "Feedback Alignment Optimizer" or "Optimizer: ..." inferred
  as Optimizer Procedure for backward compatibility

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
endymion and others added 11 commits May 1, 2026 14:09
…ew consistency

Add a persistent dispatch-mode indicator to procedure grid and detail cards showing
how the procedure is running (Local / Claimed... / Announced...), positioned directly
below the score subtitle with no layout jiggle. Move timestamp and elapsed time into
the card content below the indicator. Set dispatch_mode in task metadata at creation
so the Local label is available immediately without waiting for CommandDispatch.

- Grid card: indicator in header left column (no gap from title/subtitle)
- Detail view: same ordered block — indicator → timestamp → elapsed → notes → segmented bar
- hideTaskStatus=true for both variants; explicit TaskStatus with hideElapsedTime
- workerNodeId + celeryTaskId added to TASK_CARD_FIELDS and wired through transformProcedure
- service.py seeds dispatch_mode into task metadata at creation time (defaults to "local")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@endymion endymion requested a review from a team as a code owner May 1, 2026 21:17
@endymion endymion requested review from dereknorrbom and removed request for a team May 1, 2026 21:17
@endymion endymion merged commit 99d4682 into develop May 1, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants