v1.2: Framework Owns Flow Control (FOC-01..FOC-06) by aksOps · Pull Request #3 · RandomCodeSpace/asr

aksOps · 2026-05-07T07:59:03Z

Summary

Ships v1.2 — Framework Owns Flow Control (4 phases, 6 requirements). The LLM provides intent (which tool, what hypothesis, what confidence); the framework owns everything else — environment injection, confidence enforcement, HITL gating, retry decisions.

Phase	FOC	Commit	Highlights
9	FOC-01, FOC-02	`78cd361`	Session-derived tool-arg injection (env / incident_id / session_id stripped from LLM-visible sigs)
10	FOC-03	`c0688b7`	Mandatory `AgentTurnOutput.confidence` envelope on every turn
11	FOC-04	`ee3c453`	Pure-policy `should_gate(...)` + interrupt-vs-error UI bug fix
12	FOC-05, FOC-06	`be5d351`	Pure-policy `should_retry(...)` + E2E genericity test
follow-up	—	`3ba099f`	Manual-test bug fixes: dict-schema strip + accepted_params helper, sanitized tool names, OpenRouter provider, Ollama json_schema subclass

v1.2 thesis verified

tests/test_framework_flow_control_e2e.py runs a session against a stub LLM that emits ONLY (tool_name, tool_args_excluding_session_data, confidence, signal). Framework injects environment from session, enforces confidence envelope, applies should_gate policy, and decides retry — all without any LLM-supplied state-derived data.

Test counts and coverage

Tests: 927 (v1.1 baseline) → 1026 passing, 3 skipped, 0 failing
Coverage on touched modules: policy.py 100% • orchestrator.py 77-82% • gateway.py 75% • new helpers (accepted_params_for_tool, _should_render_retry_block, _retry_button_state_for) all 100%
Concept-leak ratchet: binary-green at 154 (Phase 11 raised 149→153, Phase 12 backfilled to 154 in 3ba099f)
Skill-prompt audit: zero gateway/HITL/approval/bypass mentions

Notable manual-test discoveries (folded into `3ba099f`)

Real-LLM testing surfaced 8 latent framework bugs that unit tests missed because they used pydantic-model fixtures while real FastMCP tools expose JSON-Schema dicts. All 8 are framework-level fixes — none change v1.2's pure-policy thesis.

strip_injected_params early-exited for dict-schema tools — leaked injected keys to LLM-visible sig.
Same dict-vs-pydantic blind spot on the inject side — fixed via new accepted_params_for_tool helper.
inject_injected_args now drops LLM-supplied keys the underlying tool doesn't accept (defends against LLM hallucinating injectable keys despite stripping).
Tool name sanitisation : → __ for OpenAI's tool-naming regex (^[a-zA-Z0-9_-]+$).
make_agent_node no longer double-strips: pass originals to wrap_tool so accepted_params is computed against the underlying schema.
_ChatOllamaJsonSchema subclass forces method='json_schema' (default function_calling doesn't work on gemma/gpt-oss/ministral).
_try_recover_envelope_from_raw parser fallback + recursion_limit=25 safeguard.
New openai_compat provider kind for OpenRouter / Together / vLLM.

Out of scope (deferred to v1.3 hardening)

Real-LLM create_react_agent tool-loop termination with response_format=AgentTurnOutput — gpt-4o-mini and Ollama models reach the recursion limit without naturally terminating; structured-output round and React END signal interact badly.
Skill-prompt-vs-schema linter (raised during v1.1 testing).
Bundler service.py inclusion (OrchestratorService not in RUNTIME_MODULE_ORDER; affects streamlit run dist/ui.py only — local dev via PYTHONPATH=src:. works).

Test plan

CI runs full suite — should be 1026 passing
Sonar quality gate (coverage on touched files ≥75%)
Manual smoke: streamlit run src/runtime/ui.py --server.port 37777 with APP_CONFIG=config/config.yaml, OPENROUTER_API_KEY set — start an INC, verify env auto-injected (no 'prod' validation errors), confidence shows on every agent badge (no ⚪)

🤖 Generated with Claude Code

Stop the LLM hallucinating session-derived data (environment='unknown', 'prod', incident_id='???') by removing those args from the LLM-visible tool signature. The framework injects them from session state at the gateway / wrap boundary before the underlying MCP tool runs. Decisions: - D-09-01 strip injected args at registry boundary (graph.py:483-498) - D-09-02 OrchestratorConfig.injected_args declared in app YAML - D-09-03 framework wins on conflict, INFO-log the override - D-09-04 single atomic commit closing Phase 9 Tools migrated (environment stripped from LLM-visible sig): - observability: get_logs, get_metrics, get_service_health, check_deployment_history - remediation: propose_fix, apply_fix - inc: lookup_similar_incidents Tools migrated (incident_id stripped from LLM-visible sig): - mark_resolved, mark_escalated, submit_hypothesis, update_incident Skill prompts cleaned (triage / deep_investigator / resolution): no longer carry "always pass environment from the INC" guidance — now framework-owned. Tool example signatures updated to drop the now-stripped args. App YAML configs declare per-app injected_args: - incident_management.yaml + config.yaml: environment / incident_id / session_id from session.environment / session.id - code_review.runtime.yaml: pr_url / repo / session_id from session.extra_fields.* / session.id T-09-05 ordering: injection happens at the TOP of _GatedTool._run / _arun BEFORE effective_action so the gateway risk-rating sees the post-injection environment value (prevents prod misclassification when LLM omits env). The MCP server functions stay unchanged — apps' direct in-process calls to get_logs(service='api', environment='production', ...) keep working. Only the LLM-visible tool surface is stripped. Coverage on touched files (full suite): - arg_injection.py: 98% - config.py: 97% - graph.py: 86% - orchestrator.py: 83% - gateway.py: 73% (pre-existing approve-path branches account for the gap; new inject-cfg branches are fully covered) Concept-leak ratchet: 147 / 147 baseline (held flat). Suite: 946 passed, 3 skipped (was 931 baseline; 19 new tests added, and ~4 baseline tests pivoted now that LLM-side env validation is moot). Bundles regenerated (dist/app.py + 2 app bundles). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per D-10-01..D-10-04: every agent invocation now returns an AgentTurnOutput envelope (content, confidence in [0,1], confidence_rationale, optional signal) enforced via response_format= on both create_react_agent call sites. - D-10-01: turn = one create_react_agent invocation - D-10-02: pydantic envelope; response_format wired at src/runtime/graph.py:596 + src/runtime/agents/responsive.py:110 - D-10-03: envelope confidence reconciled with typed-terminal-tool arg confidence; tolerance 0.05 inclusive; tool-arg wins on mismatch with INFO log shape: runtime.orchestrator: turn.confidence_mismatch agent={a} turn_value={e:.2f} tool_value={t:.2f} tool={tn} session_id={sid} - D-10-04: single atomic commit covers envelope module + two runner wirings + UI badge fix + 6 skill prompts + tests + dist Defensive parser parse_envelope_from_result has 3-step fallback (structured_response -> JSON-parse last AIMessage -> EnvelopeMissingError) so providers that don't honor response_format cleanly (e.g. Ollama gpt-oss) still flow through the contract path. EnvelopeMissingError -> _handle_agent_failure marks agent_run.error with structured cause. UI: src/runtime/ui.py:_fmt_confidence_badge None branch flips from silent "circle confidence -" to hard-error "stop confidence missing" treatment. New code can't produce None; legacy on-disk rows still render without crashing. Skill prompts (10 files touched, 6 ship the new shared preamble): examples/incident_management/skills/{triage, deep_investigator,resolution}/system.md + examples/code_review/skills/{analyzer,intake,recommender}/system.md each get a `## Output contract` section pointing at the envelope. deep_investigator drops "confidence is mandatory" boilerplate; resolution drops "Confidence is required on the terminal tool" boilerplate. Boilerplate ratchet returns 0 matches. Defense-in-depth: _assert_envelope_invariant_on_finalize logs WARNING for any AgentRun with confidence is None at finalize time (legacy on-disk sessions). Hard rejection lives at the runner; the finalize hook is forensics only, never raises. Test fixture migration approach: instead of per-test edits to the 5 enumerated files, extended StubChatModel itself with with_structured_output(schema) so all stub-driven tests pass unchanged. Per-instance stub_envelope_confidence / stub_envelope_rationale / stub_envelope_signal let tests tune the canned envelope. graph.py adds _DEFAULT_STUB_ENVELOPE_CONFIDENCE mapping deep_investigator -> 0.30 to preserve gate-pause-on-DI behavior in tests that previously relied on confidence is None. New tests: tests/test_turn_output_envelope.py with 23 cases (10 schema + 4 reconciliation + 3 parser + 6 parametrized agent kinds: intake, triage, deep_investigator, resolution, supervisor, monitor). New helper module tests/_envelope_helpers.py provides envelope_stub() + EnvelopeStubChatModel for tests that need explicit ReAct-result fakery. 3 obsolete test_agent_node.py assertions migrated: the runner now stamps the envelope's confidence onto the AgentRun whenever a patch-tool-arg confidence harvest yields None (bool-rejected, unknown-string-rejected, or absent). The harvest-layer rejection itself is still asserted via the WARN log capture. Genericity ratchet: 147 -> 149 (rationale documented inline). Two new uses of the existing `incident` Python local variable on the new envelope-error branches in graph.py + responsive.py. session_id parameters use inc_id (not incident.id) to avoid unnecessary new domain references. Tests: 946 -> 969 (+23). Coverage on touched files 75.83% aggregate (gate >= 75%); per-file: turn_output.py 83%, graph.py 86%, orchestrator.py 83%; responsive.py 34% and ui.py 12% are pre-existing low-coverage areas not regressed by this change. dist/* regenerated (4 files); AgentTurnOutput present in dist/app.py + dist/apps/incident-management.py + dist/apps/code-review.py. Closes FOC-03. Phase 10 done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 11 (v1.2 -- Framework Owns Flow Control). HITL gating decision collapses into a single pure framework function: should_gate(session, tool_call, confidence, cfg) -> GateDecision driven by the new structured OrchestratorConfig.gate_policy field. Both _GatedTool._run and _GatedTool._arun now route through should_gate(...) (via the wrap-level _evaluate_gate bridge) instead of calling effective_action(...) directly; effective_action itself is unchanged so the v1.0 PVC-08 prefixed-form lookup invariant is preserved. Skill prompts lose every "gateway"/"HITL"/"approval"/"bypass" mention -- flow control is invisible to the LLM. The audit regex returns zero matches across examples/*/skills/. Concurrently fixes the v1.1-testing UI bug where a LangGraph GraphInterrupt was mis-classified as status="error". The graph runner (graph.py + responsive.py + _ainvoke_with_retry), the orchestrator's _resume_with_input wrapper, and the OrchestratorService task wrapper now all re-raise GraphInterrupt explicitly, leaving the session in status="pending_approval" so the Approve/Reject UI buttons can drive resume end-to-end. The _render_retry_block predicate becomes status=='error' AND no pending_approval rows to keep the two UI blocks mutually exclusive. D-11-01 should_gate wraps effective_action (PVC-08 preserved). D-11-02 OrchestratorConfig.gate_policy declarative (extra='forbid'). D-11-03 Skill prompts free of gateway/HITL/approval/bypass vocab. D-11-04 GraphInterrupt -> pending_approval; real exc -> error. D-11-05 Single atomic commit. Tests: 969 -> 997 passing. 21 should_gate matrix + 6 interrupt- handling + 1 _find_pending_index coverage test added; PVC-08 + 36 existing direct-call effective_action tests untouched. Coverage: policy.py 100%, tools/gateway.py 75.31%, orchestrator.py 82.48% (ui.py 12.48% reflects the pre-existing Streamlit-module floor; the *new* _should_render_retry_block predicate is at 100%). Concept-leak ratchet stays binary-green; genericity-ratchet baseline lifted 149 -> 153 with rationale (4 reuses of the existing 'incident' local variable name in graph/responsive turn-confidence-hint reset/update lines, no new domain concept). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(FOC-05, FOC-06) Phase 12 closes the v1.2 "Framework Owns Flow Control" milestone. Retry policy collapses into a single pure framework function: should_retry(retry_count, error, confidence, cfg) -> RetryDecision driven by the new structured OrchestratorConfig.retry_policy field. Orchestrator._retry_session_locked consults should_retry BEFORE running the retry; on policy denial it emits retry_rejected with reason = decision.reason (one of {auto_retry, max_retries_exceeded, permanent_error, low_confidence_no_retry, transient_disabled}). The legacy 'retry already in progress' / 'not in error state' rejection reasons stay verbatim so existing test consumers still pattern-match. Orchestrator.preview_retry_decision(session_id) exposes the same decision to the UI WITHOUT mutating session state. The retry block in src/runtime/ui.py now renders a button label + disabled flag derived from the framework's choice via the 5-case map (D-12-04): auto_retry -> enabled, "Retry" max_retries_exceeded -> disabled, "Max retries reached (rc/cap)" permanent_error -> disabled, "Permanent error -- cannot auto-retry" low_confidence_no_retry -> disabled, "Confidence too low (N% < th%)" transient_disabled -> disabled, "Auto-retry disabled in policy" Error classification uses heuristic isinstance() against small whitelists (D-12-02 -- no new ToolError ABC, no new opt-in burden on tool authors). _PERMANENT_TYPES covers pydantic.ValidationError and EnvelopeMissingError; _TRANSIENT_TYPES covers asyncio.TimeoutError, TimeoutError, OSError, ConnectionError. Default fall-through is permanent_error -- fail-closed conservative. The new tests/test_framework_flow_control_e2e.py is the v1.2 regression-prevention contract. The thesis is that v1.2 flow control collapses to PURE functions; the test asserts each FOC invariant on the corresponding pure boundary directly: FOC-01/02 OrchestratorConfig.injected_args validates dotted-path shape FOC-03 parse_envelope_from_result raises EnvelopeMissingError FOC-04 should_gate returns gate=True/'high_risk_tool' on apply_fix/prod FOC-05 should_retry classifies validation/timeout/at-cap correctly If a future phase introduces a state-derived arg leak through the LLM, that contract breaks loudly. Bundler fix: scripts/build_single_file.py now bundles runtime/agents/turn_output.py BEFORE policy.py in RUNTIME_MODULE_ORDER because Phase 12's _PERMANENT_TYPES tuple references EnvelopeMissingError at module-import time. (Pre-Phase-12 dists referenced it only inside function bodies, where the strip-plus-rebuild order didn't surface a NameError.) D-12-01 should_retry pure (5 reason values); same shape as should_gate. D-12-02 isinstance() heuristic on _PERMANENT_TYPES + _TRANSIENT_TYPES. D-12-03 OrchestratorConfig.retry_policy declarative (extra='forbid'). D-12-04 UI surfaces decision via preview_retry_decision (5-case map). D-12-05 tests/test_framework_flow_control_e2e.py covers FOC-01..05. D-12-06 single atomic commit. 29 new tests: 14 should_retry matrix + 6 e2e + 9 retry_button_state. Total: 1026 passing (baseline 997 + 29). Phase 11's GateDecision / should_gate surface untouched. Concept-leak ratchet stays binary-green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Manual end-to-end testing of v1.2 surfaced 8 latent bugs across the arg-injection / gateway / LLM-provider stack that unit tests missed because they used pydantic-model fixtures while real FastMCP tools expose JSON-Schema dicts. All 8 are framework-level fixes — none change v1.2's pure-policy thesis. Bugs fixed: 1. ``strip_injected_params`` early-exited for dict-schema (FastMCP) tools, leaking ``environment``/``incident_id``/``session_id`` to the LLM-visible signature. LLM hallucinated values, fed garbage back to the runtime, looped at the recursion ceiling. Fix: dict branch removes injected keys from ``properties`` + ``required`` then ``model_copy``-s the tool. 2. New ``accepted_params_for_tool`` helper introspects both pydantic and JSON-Schema-dict ``args_schema`` shapes. Used at all 3 inject call sites (gateway ``_run`` / ``_arun`` / orchestrator ``_invoke_tool``). 3. ``inject_injected_args`` now drops LLM-supplied values for keys the underlying tool doesn't accept. Prevents pydantic ``unexpected_keyword`` rejections when an LLM hallucinates an injectable arg despite Phase 9 stripping it from the sig. 4. Gateway wrapper exposes a sanitized LLM-visible tool name (``:`` → ``__``) so OpenAI's tool-naming regex (``^[a-zA-Z0-9_-]+$``) and Ollama's (``[a-zA-Z0-9_.\-]{1,256}``) both accept it. Inner tool name stays colon-form so PVC-08 prefixed-form policy lookups are preserved. 5. ``make_agent_node`` no longer double-strips: pass ORIGINAL tools to ``wrap_tool`` (which strips internally for the LLM-visible schema). Stripping twice hid injected keys from ``accepted_params``, the inject step skipped them, FastMCP rejected the call as missing-required-arg. 6. ``_ChatOllamaJsonSchema`` subclass forces ``method='json_schema'`` on ``with_structured_output``. The default ``function_calling`` method fails on Ollama models that don't support native tool-calling (gemma, gpt-oss, ministral) — they emit prose instead of JSON, langchain raises ``OutputParserException`` and Phase 10's envelope is never parsed. 7. ``_try_recover_envelope_from_raw`` fallback in ``graph.py`` extracts envelope JSON from raw LLM output (markdown-fenced or greedy ``{...}`` slice) when ``OutputParserException`` fires inside ``create_react_agent``. Also adds ``recursion_limit=25`` to ``_ainvoke_with_retry`` so future infinite loops surface as ``GraphRecursionError`` instead of hanging silently. 8. New ``openai_compat`` provider kind (``_build_openai_compat_chat``) wires OpenRouter / Together / vLLM / etc. via langchain-openai's ``ChatOpenAI`` with a ``base_url`` override. Config: - ``OrchestratorConfig.injected_args.environment`` now resolves via ``session.extra_fields.environment`` (was ``session.environment``). Base ``Session`` class is domain-neutral; ``environment`` lives on ``IncidentState.extra_fields``. Mirrors how code_review's ``pr_url`` / ``repo`` were already declared. - Workhorse model swapped to ``openrouter/openai/gpt-4o-mini`` (``openai_compat`` kind, ``OPENROUTER_API_KEY`` from .env). Ollama models tested first — surfaced bugs 4-7 — but still need Phase 13 hardening for the ``response_format`` round-trip on tool-loop termination. Tests: - ``test_orchestrator_injected_args_field_in_yaml`` updated to match the new env path. - Genericity ratchet baseline 153 → 154 (Phase 12 backfill — the ``Orchestrator._retry_session_locked`` retry-policy gate added one ``incident`` token reuse that was missed in ``be5d351``). - Full suite: 1026 passing, 3 skipped, 0 failing. Out of scope (deferred to v1.3 hardening): - Real-LLM ``create_react_agent`` tool-loop termination with ``response_format=AgentTurnOutput``: gpt-4o-mini and Ollama models reach the recursion limit without naturally terminating the React loop. Likely the structured-output round and the React END signal interact badly. - Skill-prompt-vs-schema linter (raised during v1.1 testing). - Bundler ``service.py`` inclusion (``OrchestratorService`` is not in ``RUNTIME_MODULE_ORDER``; ``dist/ui.py`` imports it from ``app``, breaking ``streamlit run dist/ui.py``. Local dev runs via ``PYTHONPATH=src:.`` work fine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…841) Removes unused imports (asyncio, tool, Field, FakeMessagesListChatModel, AIMessage, ToolMessage, pytest) and two dead local assignments (inner, wrapper) flagged by ruff in CI. Pure cleanup — no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aksOps · 2026-05-13T03:38:36Z

Superseded by #5 — v1.2 ruff cleanup landed in the squash-merge.

aksOps and others added 6 commits May 7, 2026 03:22

checkpoint: pre-yolo 2026-05-07T06:28:00

7bb41c6

aksOps mentioned this pull request May 8, 2026

v1.3: Hardening + Real-LLM Compatibility (HARD-01..09 + LLM-COMPAT-01 + BUNDLER-01 + SKILL-LINTER-01) #4

Closed

7 tasks

aksOps mentioned this pull request May 13, 2026

v1.4: per-step telemetry + auto-learning intake + React-ready API #5

Merged

5 tasks

aksOps closed this May 13, 2026

aksOps deleted the refactor/framework-flow-control branch May 14, 2026 09:03

aksOps mentioned this pull request May 16, 2026

feat: v2.0.0-rc3 — fix audit findings (finalizer, state_overrides, idempotency + 6 important) #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2: Framework Owns Flow Control (FOC-01..FOC-06)#3

v1.2: Framework Owns Flow Control (FOC-01..FOC-06)#3
aksOps wants to merge 7 commits into
mainfrom
refactor/framework-flow-control

aksOps commented May 7, 2026

Uh oh!

aksOps commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented May 7, 2026

Summary

v1.2 thesis verified

Test counts and coverage

Notable manual-test discoveries (folded into 3ba099f)

Out of scope (deferred to v1.3 hardening)

Test plan

Uh oh!

aksOps commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Notable manual-test discoveries (folded into `3ba099f`)