Execute Tactus local dispatch and optimizer fixes#262
Merged
Conversation
Introduce the single-tool Tactus runtime path with tracing, budget gating, long-running operation guards, and direct feedback lookup so Plexus can be exercised as a programmable MCP runtime. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Add cooperative Task cancellation checkpoints so report and procedure workers stop cleanly after execute_tactus handle cancellation marks dashboard work cancelled. Made-with: Cursor
Keep cooperative report cancellation checks from turning unavailable Task refreshes into report failures, and relax the legacy mock expectation to allow intentional status polling. Made-with: Cursor
Capture the cooperative cancellation CI fix and local verification in Kanbus for the execute_tactus handle task. Made-with: Cursor
Bridge Tactus runtime events and Plexus API call progress to FastMCP Context notifications while preserving the final execute_tactus response envelope. Made-with: Cursor
Require explicit child budgets for async execute_tactus work so dispatched evaluations, reports, and procedures remain attached to the parent runtime budget. Made-with: Cursor
Apply propagated execute_tactus child budgets inside evaluation, report, and procedure workers so long-running child executions fail early when wallclock, depth, or known spend exceeds their allocation. Made-with: Cursor
Align the runtime validation contract with explicit async budgets and broaden helper aliases so generated Tactus can use the advertised Plexus API surface directly. Made-with: Cursor
…-execute-tactus-mcp-tool
- Replace the entire legacy MCP tool catalog (scorecard, score, evaluation, feedback, item, prediction, dataset, report, rubric_memory, etc.) with a single `execute_tactus` tool that exposes all Plexus functionality via the `plexus.*` Tactus runtime API - Add `plexus.score.contradictions` for rubric vs. code consistency checks - Add `score_rubric_consistency_check` option to `plexus.evaluation.run` - Add `plexus.procedure.optimize` shortcut for launching the feedback alignment optimizer with standard parameters - Add `plexus execute` CLI command for local Tactus snippet testing - Rename `run_experiment` → `run_procedure` and `run_experiment_with_task_tracking` → `run_procedure_with_task_tracking` throughout the codebase to match domain terminology - Delete `procedure_sop_agent`, `sop_agent_base`, `demo_ai_mcp_integration`, `model_config_examples`, and all associated tests (legacy LangGraph-based optimizer prototype; superseded by Tactus procedures) - Remove SOPAgent routing from `procedure_executor.py`; only `class: Tactus` procedures are supported going forward - Reorganise and expand Plexus documentation under `plexus/docs/` with topic-based subdirectories and new guides for the Tactus runtime API Made-with: Cursor
- Add score.pull/update/test, feedback.latest_update, rubric_memory.* namespaces to PlexusRuntimeModule DIRECT_HANDLERS with full _default_* implementations - Add _default_report_runner_sync for synchronous report execution needed by optimizer; route plexus.report.run(sync=true) through it - Add --emit-id-file CLI option to plexus evaluate accuracy/feedback so _default_evaluation_runner can capture evaluation_id from background subprocess for handle tracking - Construct and register PlexusRuntimeModule in procedure_executor.py so Tactus/Lua procedure code can call plexus.* directly - Create rubric_memory_toolset.py: in-process MCP tools for plexus_rubric_memory_* sub-agent tools - Replace legacy MCP tool calls in ScoreEditorToolset with direct _default_score_pull / _default_score_update calls - Rewrite feedback_alignment_optimizer.yaml call_plexus_tool to use plexus.* APIs directly; batch evaluations via handle protocol; synchronous reports via sync=true; score pull via temp files - Update execute_test.py and test_score_editor_toolset.py to mock new direct-call interfaces Made-with: Cursor
…code storage
Closes plx-62b442, plx-51488a, plx-f804a6.
Updates plx-61c332, plx-07dc0d.
Adds plx-71ad53 (remaining L4 integration tests).
## execute_tactus contract hardening (plx-f804a6)
- Add `_truncate_envelope` helper: caps execute_tactus JSON responses at 40 K chars
to prevent LLM context-window overflow from large evaluation / scorecard payloads.
- `BudgetGate.carve_child`: when the parent gate is effectively infinite (usd=inf,
wallclock=inf — as in the embedded chat MCP context), auto-supply a generous default
child budget instead of raising ChildBudgetRequired. Callers inside chat no longer
need explicit `budget = { ... }` for async evaluation / procedure calls.
- `_default_score_update`: set `isFeatured: "false"` on new ScoreVersion records so
optimizer-created versions are not featured by default.
- `_default_score_test`: remove erroneous lambda wrapper around coroutine, fixing
asyncio awaitable error.
- `_default_score_pull`: write YAML and guidelines to temp files and return their paths
so sandboxed Lua code can read them via File.read() without needing the io library.
- `_default_procedure_optimize`: dispatch optimizer via background daemon thread so
the chat agent receives procedure_id immediately (~49 s) instead of blocking for hours.
## Console chat fixed end-to-end (plx-61c332, plx-62b442)
- `chat_agent.tac` `extract_text`: handle Lupa userdata (Lua receives Plexus Python
objects as `userdata`, not `table`) using pcall attribute access; checks
response/content/message/text keys and indexed first element.
- Remove `MessageHistory.get()` auto-load: history now comes exclusively from
`console_session_history` passed by the caller, preventing cross-turn context bleed
that caused 300 K–667 K token overflows.
- Add `assistant.output` fallback with garbage filter: filters out Python model reprs
like "UsageStats" and "output=None" that appeared when the LLM returned without
tool use.
- `mcp_transport.py`: pass a permissive BudgetGate (usd=inf, wallclock=inf, depth=20,
tool_calls=500) to execute_tactus in the embedded procedure MCP context.
- `builtin_procedures.py`: increase chat agent max_tokens 220→1024, reasoning_effort
low→medium; add explicit usage examples for evaluation.run (with budget),
procedure.optimize (with budget), and evaluation.find_recent (with evaluation_type).
## S3-backed procedure code storage (plx-07dc0d)
- `service.py`: on procedure creation, upload YAML as `code.tac` to S3 and store the
key in `procedure.metadata["code_s3_key"]`. On load, check S3 before falling back
to template. Prevents DynamoDB 400 KB item limit from blocking large optimizer YAMLs.
- `s3_utils.py`: add `upload_procedure_file` and `download_procedure_code` helpers.
- `procedure.py` (model): add `metadata` field to Procedure GraphQL model.
- `resource.ts`: add `metadata` field to Procedure Amplify schema.
## procedure_executor.py (plx-07dc0d)
- Remove special `_create_console_plexus_dispatch_tool` branch for console chat;
all procedures now use `PydanticAIMCPAdapter` uniformly to expose execute_tactus.
- Fix MCP dir path (one extra `..` removed).
- Inject `plexus` Lua global shim at the top of every procedure source so procedures
that use `plexus.*` as a global (not via require) still work.
- Register an effectively unlimited BudgetGate for procedure-internal plexus.* calls
so long-running procedures are not killed by the 60 s default budget.
## tactus_adapters/storage.py (plx-07dc0d)
- Wrap `OptimizerResultsService.index_optimizer_run` in try/except RuntimeError so
missing `AMPLIFY_STORAGE_TASKATTACHMENTS_BUCKET_NAME` degrades gracefully with a
warning instead of crashing the optimizer.
## feedback_alignment_optimizer.yaml (plx-07dc0d)
- Add nil guard before `rubric_memory_context.machine_context` access that caused
"attempt to index a nil value" crashes during early optimizer turns.
## plexus execute CLI (plx-07dc0d)
- Fix sys.path construction so `plexus execute` finds the MCP module in all
working-directory contexts.
Made-with: Cursor
…tion, S3 graceful degradation - Fix plexus.score.predict system prompt (was plexus.predict, shorthand only in MCP wrapper) - Add metadata-only score.update path (external_id, name, key, description) without creating a new version - Fix CostEvent JSON serialization crash in procedure output with _json_safe() helper - Graceful degradation when S3 bucket not configured in persist_task_output_artifact - Fix console_chat_smoke.py: PLEXUS_CMD support, responseStatus/responseTarget schema fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The existing coercion only handled camelCase externalId; snake_case
external_id: 47833 passed through unmodified and failed schema validation
("not of type 'string'"), blocking every hypothesis submission.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove budget enforcement from all four async dispatch paths (evaluation.run, report.run, procedure.run, procedure.optimize). These are fire-and-forget calls that return a handle immediately — the subprocess runs independently, so carving a child budget from the MCP session cap ($0.25 / 60s) before dispatch was wrong and blocked all async calls. Also add configuration_id support to _default_report_runner so full report configurations can be dispatched async via MCP, and update the tool description to reflect that no budget table is needed for async dispatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Report blocks and configuration-based reports now automatically use local thread execution in development (default) and remote task dispatcher in cloud deployments (Lambda). Set PLEXUS_REPORT_DISPATCH=remote to enqueue reports through the remote dispatcher — required in Lambda where long-running threads would time out. Omit it (or set to "local") for direct local execution, matching CLI behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Report subtitle now shows "Last N days" (or date range) instead of sentinel text — derived from _format_date_window_for_display() in _persist_block_result() - Procedure name now correctly extracted from YAML by passing code= to Procedure.create() - Feedback block description and block_title now include date range and scorecard name respectively - Local async report dispatch spawns subprocess instead of thread (survives MCP server restarts) - .mcp.json updated to point to py311 env and Plexus-4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Procedure.create now accepts explicit name param and skips storing code in DynamoDB when it exceeds 350KB (large optimizer YAML was hitting DynamoDB's 400KB item limit); code still goes to S3 - procedure.optimize and procedure.run now dispatch as independent subprocesses instead of daemon threads (survive MCP server restarts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in DynamoDB Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AccuracyEvaluation.run() was missing the background metrics task drain that the base Evaluation.run() performs in its finally block. This left background tasks racing the outer code's final confusionMatrix write, allowing stale intermediate values to persist after the evaluation completed. Two fixes: 1. AccuracyEvaluation.run() now awaits pending metrics tasks (10s timeout) before returning, matching the base class pattern. 2. The dataset-backed accuracy final write now unconditionally writes confusionMatrix/predictedClassDistribution/datasetClassDistribution from final_metrics, so this write always wins over any earlier background write. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ault The accuracy evaluation dispatcher defaulted yaml=True, always appending --yaml to the CLI command. The --yaml flag causes the evaluate accuracy command to suppress scoreVersionId on the evaluation record, so the accuracy baseline appeared to have no version while the feedback baseline had one. Change the default to False so --yaml is only added when explicitly requested. The optimizer always passes --version explicitly, so --yaml is not needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `acceptance_rate` and `report_acceptance_rate` HELPER_BINDINGS aliases
- Add `("report", "acceptance_rate")` DIRECT_HANDLER → `_call_report_run`
- `_call_report_run` pre-populates `block_class = "AcceptanceRate"` and
promotes top-level params (scorecard, score, days, include_item_acceptance_rate,
max_items) into block_config when called as `plexus.report.acceptance_rate`
- Fix subprocess dispatch to pass `--include-item-acceptance-rate` and
`--max-items` for AcceptanceRate blocks
- Update tool description to list `acceptance_rate` as a high-frequency alias
- Add `plexus/docs/evaluation-and-feedback/acceptance-rate.md` with full param
reference, synonym list, and usage examples
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When acceptance_rate{ sync = true } is called, post-process the result:
- Parse the comment-header+JSON string output into a proper dict
- Drop the verbose shard-fetch log (hundreds of lines, not useful to LLMs)
- Strip the items array by default (can be thousands of rows); callers that
need per-item rows pass include_items = true
- Drop raw_counts (internal diagnostic, not useful to consumers)
Also add include_items parameter to the doc.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The optimizer was tracking repeat-offender items across cycles but telling the agent to AVOID them as "likely label noise" or "approaching the ceiling." For small eval sets (e.g. 51 samples) where 100% accuracy is the goal, that framing was self-defeating: it made the agent give up on exactly the items it needed to fix. Changes: 1. Inject deterministic item_recurrence summary into BOTH synthesis contexts (Strategy A, Strategy B) — it was previously only shown during hypothesis generation, leaving the code-editing phases blind to cross-cycle recurrence. 2. Flip the IMPLICATION text at every injection site to mark PERSISTENT, OSCILLATING, and FLIP_FLOP items as HIGHEST-PRIORITY targets rather than things to avoid. Encourage literal rules, example snippets, item-specific carve-outs, and explicit overfitting. 3. Rewrite the feedback landscape diagnostic's analysis task to produce per-item targeted fix recommendations and an "Aggressive Fix Strategy" section instead of an "Optimization Ceiling Assessment" and "Suspected Low-Quality Feedback Labels" list. 4. Flip the early-stop advisor context so repeat offenders prompt escalation to ultra_creative mode rather than acceptance of a ceiling. 5. Flip the accumulated_lessons item-recurrence instructions to treat recurring items as fixable targets and record which specific hypotheses failed (so future cycles try something different). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Procedure cards now show "Optimizer Procedure" (or any procedure_type
set in YAML) in the badge below the ⋯ button, matching EvaluationTask
- procedure_type, score_name, scorecard_name seeded into procedure metadata
at creation so subtitle and badge appear immediately without waiting for
first Lua State checkpoint
- Procedure.create() now sets status='RUNNING' so Amplify Gen2 realtime
subscriptions recognise new records (byStatus index populated)
- Procedure.update() now accepts status and name parameters
- onCreateProcedure subscription re-fetches full record to resolve
@belongsTo relations (scorecard/score) that AppSync omits from payloads
- onUpdateProcedure subscription preserves existing scorecard/score/metadata
instead of clobbering with nulls from bare subscription payload
- Optimizer procedure name set to "Optimizer: {scorecard}" (title line);
score name appears as subtitle via linked score relation
- feedback_alignment_optimizer.yaml declares procedure_type: Optimizer Procedure
- Old runs named "Feedback Alignment Optimizer" or "Optimizer: ..." inferred
as Optimizer Procedure for backward compatibility
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ew consistency Add a persistent dispatch-mode indicator to procedure grid and detail cards showing how the procedure is running (Local / Claimed... / Announced...), positioned directly below the score subtitle with no layout jiggle. Move timestamp and elapsed time into the card content below the indicator. Set dispatch_mode in task metadata at creation so the Local label is available immediately without waiting for CommandDispatch. - Grid card: indicator in header left column (no gap from title/subtitle) - Detail view: same ordered block — indicator → timestamp → elapsed → notes → segmented bar - hideTaskStatus=true for both variants; explicit TaskStatus with hideElapsedTime - workerNodeId + celeryTaskId added to TASK_CARD_FIELDS and wired through transformProcedure - service.py seeds dispatch_mode into task metadata at creation time (defaults to "local") Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-execute-tactus-mcp-tool
This reverts commit b000a54.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Verification