Skip to content

v1.2: Framework Owns Flow Control (FOC-01..FOC-06)#3

Closed
aksOps wants to merge 7 commits into
mainfrom
refactor/framework-flow-control
Closed

v1.2: Framework Owns Flow Control (FOC-01..FOC-06)#3
aksOps wants to merge 7 commits into
mainfrom
refactor/framework-flow-control

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented May 7, 2026

Summary

Ships v1.2 — Framework Owns Flow Control (4 phases, 6 requirements). The LLM provides intent (which tool, what hypothesis, what confidence); the framework owns everything else — environment injection, confidence enforcement, HITL gating, retry decisions.

Phase FOC Commit Highlights
9 FOC-01, FOC-02 78cd361 Session-derived tool-arg injection (env / incident_id / session_id stripped from LLM-visible sigs)
10 FOC-03 c0688b7 Mandatory AgentTurnOutput.confidence envelope on every turn
11 FOC-04 ee3c453 Pure-policy should_gate(...) + interrupt-vs-error UI bug fix
12 FOC-05, FOC-06 be5d351 Pure-policy should_retry(...) + E2E genericity test
follow-up 3ba099f Manual-test bug fixes: dict-schema strip + accepted_params helper, sanitized tool names, OpenRouter provider, Ollama json_schema subclass

v1.2 thesis verified

tests/test_framework_flow_control_e2e.py runs a session against a stub LLM that emits ONLY (tool_name, tool_args_excluding_session_data, confidence, signal). Framework injects environment from session, enforces confidence envelope, applies should_gate policy, and decides retry — all without any LLM-supplied state-derived data.

Test counts and coverage

  • Tests: 927 (v1.1 baseline) → 1026 passing, 3 skipped, 0 failing
  • Coverage on touched modules: policy.py 100% • orchestrator.py 77-82% • gateway.py 75% • new helpers (accepted_params_for_tool, _should_render_retry_block, _retry_button_state_for) all 100%
  • Concept-leak ratchet: binary-green at 154 (Phase 11 raised 149→153, Phase 12 backfilled to 154 in 3ba099f)
  • Skill-prompt audit: zero gateway/HITL/approval/bypass mentions

Notable manual-test discoveries (folded into 3ba099f)

Real-LLM testing surfaced 8 latent framework bugs that unit tests missed because they used pydantic-model fixtures while real FastMCP tools expose JSON-Schema dicts. All 8 are framework-level fixes — none change v1.2's pure-policy thesis.

  1. strip_injected_params early-exited for dict-schema tools — leaked injected keys to LLM-visible sig.
  2. Same dict-vs-pydantic blind spot on the inject side — fixed via new accepted_params_for_tool helper.
  3. inject_injected_args now drops LLM-supplied keys the underlying tool doesn't accept (defends against LLM hallucinating injectable keys despite stripping).
  4. Tool name sanitisation :__ for OpenAI's tool-naming regex (^[a-zA-Z0-9_-]+$).
  5. make_agent_node no longer double-strips: pass originals to wrap_tool so accepted_params is computed against the underlying schema.
  6. _ChatOllamaJsonSchema subclass forces method='json_schema' (default function_calling doesn't work on gemma/gpt-oss/ministral).
  7. _try_recover_envelope_from_raw parser fallback + recursion_limit=25 safeguard.
  8. New openai_compat provider kind for OpenRouter / Together / vLLM.

Out of scope (deferred to v1.3 hardening)

  • Real-LLM create_react_agent tool-loop termination with response_format=AgentTurnOutput — gpt-4o-mini and Ollama models reach the recursion limit without naturally terminating; structured-output round and React END signal interact badly.
  • Skill-prompt-vs-schema linter (raised during v1.1 testing).
  • Bundler service.py inclusion (OrchestratorService not in RUNTIME_MODULE_ORDER; affects streamlit run dist/ui.py only — local dev via PYTHONPATH=src:. works).

Test plan

  • CI runs full suite — should be 1026 passing
  • Sonar quality gate (coverage on touched files ≥75%)
  • Manual smoke: streamlit run src/runtime/ui.py --server.port 37777 with APP_CONFIG=config/config.yaml, OPENROUTER_API_KEY set — start an INC, verify env auto-injected (no 'prod' validation errors), confidence shows on every agent badge (no ⚪)

🤖 Generated with Claude Code

aksOps and others added 6 commits May 7, 2026 03:22
Stop the LLM hallucinating session-derived data (environment='unknown',
'prod', incident_id='???') by removing those args from the LLM-visible
tool signature. The framework injects them from session state at the
gateway / wrap boundary before the underlying MCP tool runs.

Decisions:
- D-09-01 strip injected args at registry boundary (graph.py:483-498)
- D-09-02 OrchestratorConfig.injected_args declared in app YAML
- D-09-03 framework wins on conflict, INFO-log the override
- D-09-04 single atomic commit closing Phase 9

Tools migrated (environment stripped from LLM-visible sig):
- observability: get_logs, get_metrics, get_service_health,
  check_deployment_history
- remediation: propose_fix, apply_fix
- inc: lookup_similar_incidents

Tools migrated (incident_id stripped from LLM-visible sig):
- mark_resolved, mark_escalated, submit_hypothesis, update_incident

Skill prompts cleaned (triage / deep_investigator / resolution):
no longer carry "always pass environment from the INC" guidance —
now framework-owned. Tool example signatures updated to drop the
now-stripped args.

App YAML configs declare per-app injected_args:
- incident_management.yaml + config.yaml: environment / incident_id
  / session_id from session.environment / session.id
- code_review.runtime.yaml: pr_url / repo / session_id from
  session.extra_fields.* / session.id

T-09-05 ordering: injection happens at the TOP of _GatedTool._run /
_arun BEFORE effective_action so the gateway risk-rating sees the
post-injection environment value (prevents prod misclassification
when LLM omits env).

The MCP server functions stay unchanged — apps' direct in-process
calls to get_logs(service='api', environment='production', ...)
keep working. Only the LLM-visible tool surface is stripped.

Coverage on touched files (full suite):
- arg_injection.py:  98%
- config.py:         97%
- graph.py:          86%
- orchestrator.py:   83%
- gateway.py:        73% (pre-existing approve-path branches account
                          for the gap; new inject-cfg branches are
                          fully covered)

Concept-leak ratchet: 147 / 147 baseline (held flat).
Suite: 946 passed, 3 skipped (was 931 baseline; 19 new tests added,
and ~4 baseline tests pivoted now that LLM-side env validation is
moot).
Bundles regenerated (dist/app.py + 2 app bundles).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per D-10-01..D-10-04: every agent invocation now returns an
AgentTurnOutput envelope (content, confidence in [0,1],
confidence_rationale, optional signal) enforced via
response_format= on both create_react_agent call sites.

- D-10-01: turn = one create_react_agent invocation
- D-10-02: pydantic envelope; response_format wired at
  src/runtime/graph.py:596 + src/runtime/agents/responsive.py:110
- D-10-03: envelope confidence reconciled with typed-terminal-tool
  arg confidence; tolerance 0.05 inclusive; tool-arg wins on
  mismatch with INFO log shape:
    runtime.orchestrator: turn.confidence_mismatch agent={a}
    turn_value={e:.2f} tool_value={t:.2f} tool={tn} session_id={sid}
- D-10-04: single atomic commit covers envelope module + two
  runner wirings + UI badge fix + 6 skill prompts + tests + dist

Defensive parser parse_envelope_from_result has 3-step fallback
(structured_response -> JSON-parse last AIMessage ->
EnvelopeMissingError) so providers that don't honor
response_format cleanly (e.g. Ollama gpt-oss) still flow through
the contract path. EnvelopeMissingError -> _handle_agent_failure
marks agent_run.error with structured cause.

UI: src/runtime/ui.py:_fmt_confidence_badge None branch flips
from silent "circle confidence -" to hard-error "stop confidence
missing" treatment. New code can't produce None; legacy on-disk
rows still render without crashing.

Skill prompts (10 files touched, 6 ship the new shared
preamble): examples/incident_management/skills/{triage,
deep_investigator,resolution}/system.md +
examples/code_review/skills/{analyzer,intake,recommender}/system.md
each get a `## Output contract` section pointing at the envelope.
deep_investigator drops "confidence is mandatory" boilerplate;
resolution drops "Confidence is required on the terminal tool"
boilerplate. Boilerplate ratchet returns 0 matches.

Defense-in-depth: _assert_envelope_invariant_on_finalize logs
WARNING for any AgentRun with confidence is None at finalize
time (legacy on-disk sessions). Hard rejection lives at the
runner; the finalize hook is forensics only, never raises.

Test fixture migration approach: instead of per-test edits to
the 5 enumerated files, extended StubChatModel itself with
with_structured_output(schema) so all stub-driven tests pass
unchanged. Per-instance stub_envelope_confidence /
stub_envelope_rationale / stub_envelope_signal let tests tune
the canned envelope. graph.py adds _DEFAULT_STUB_ENVELOPE_CONFIDENCE
mapping deep_investigator -> 0.30 to preserve gate-pause-on-DI
behavior in tests that previously relied on confidence is None.

New tests: tests/test_turn_output_envelope.py with 23 cases
(10 schema + 4 reconciliation + 3 parser + 6 parametrized agent
kinds: intake, triage, deep_investigator, resolution, supervisor,
monitor). New helper module tests/_envelope_helpers.py provides
envelope_stub() + EnvelopeStubChatModel for tests that need
explicit ReAct-result fakery.

3 obsolete test_agent_node.py assertions migrated: the runner
now stamps the envelope's confidence onto the AgentRun whenever
a patch-tool-arg confidence harvest yields None (bool-rejected,
unknown-string-rejected, or absent). The harvest-layer rejection
itself is still asserted via the WARN log capture.

Genericity ratchet: 147 -> 149 (rationale documented inline).
Two new uses of the existing `incident` Python local variable
on the new envelope-error branches in graph.py + responsive.py.
session_id parameters use inc_id (not incident.id) to avoid
unnecessary new domain references.

Tests: 946 -> 969 (+23). Coverage on touched files 75.83%
aggregate (gate >= 75%); per-file: turn_output.py 83%,
graph.py 86%, orchestrator.py 83%; responsive.py 34% and
ui.py 12% are pre-existing low-coverage areas not regressed
by this change.

dist/* regenerated (4 files); AgentTurnOutput present in
dist/app.py + dist/apps/incident-management.py +
dist/apps/code-review.py.

Closes FOC-03. Phase 10 done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 11 (v1.2 -- Framework Owns Flow Control). HITL gating decision
collapses into a single pure framework function:

    should_gate(session, tool_call, confidence, cfg) -> GateDecision

driven by the new structured OrchestratorConfig.gate_policy field.
Both _GatedTool._run and _GatedTool._arun now route through
should_gate(...) (via the wrap-level _evaluate_gate bridge) instead
of calling effective_action(...) directly; effective_action itself
is unchanged so the v1.0 PVC-08 prefixed-form lookup invariant is
preserved.

Skill prompts lose every "gateway"/"HITL"/"approval"/"bypass"
mention -- flow control is invisible to the LLM. The audit regex
returns zero matches across examples/*/skills/.

Concurrently fixes the v1.1-testing UI bug where a LangGraph
GraphInterrupt was mis-classified as status="error". The graph
runner (graph.py + responsive.py + _ainvoke_with_retry), the
orchestrator's _resume_with_input wrapper, and the
OrchestratorService task wrapper now all re-raise GraphInterrupt
explicitly, leaving the session in status="pending_approval" so
the Approve/Reject UI buttons can drive resume end-to-end. The
_render_retry_block predicate becomes status=='error' AND no
pending_approval rows to keep the two UI blocks mutually exclusive.

D-11-01 should_gate wraps effective_action (PVC-08 preserved).
D-11-02 OrchestratorConfig.gate_policy declarative (extra='forbid').
D-11-03 Skill prompts free of gateway/HITL/approval/bypass vocab.
D-11-04 GraphInterrupt -> pending_approval; real exc -> error.
D-11-05 Single atomic commit.

Tests: 969 -> 997 passing. 21 should_gate matrix + 6 interrupt-
handling + 1 _find_pending_index coverage test added; PVC-08 + 36
existing direct-call effective_action tests untouched. Coverage:
policy.py 100%, tools/gateway.py 75.31%, orchestrator.py 82.48%
(ui.py 12.48% reflects the pre-existing Streamlit-module floor;
the *new* _should_render_retry_block predicate is at 100%).
Concept-leak ratchet stays binary-green; genericity-ratchet
baseline lifted 149 -> 153 with rationale (4 reuses of the
existing 'incident' local variable name in graph/responsive
turn-confidence-hint reset/update lines, no new domain concept).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(FOC-05, FOC-06)

Phase 12 closes the v1.2 "Framework Owns Flow Control" milestone.
Retry policy collapses into a single pure framework function:

    should_retry(retry_count, error, confidence, cfg) -> RetryDecision

driven by the new structured OrchestratorConfig.retry_policy field.
Orchestrator._retry_session_locked consults should_retry BEFORE
running the retry; on policy denial it emits retry_rejected with
reason = decision.reason (one of {auto_retry, max_retries_exceeded,
permanent_error, low_confidence_no_retry, transient_disabled}).
The legacy 'retry already in progress' / 'not in error state'
rejection reasons stay verbatim so existing test consumers still
pattern-match.

Orchestrator.preview_retry_decision(session_id) exposes the same
decision to the UI WITHOUT mutating session state. The retry block
in src/runtime/ui.py now renders a button label + disabled flag
derived from the framework's choice via the 5-case map (D-12-04):

    auto_retry              -> enabled, "Retry"
    max_retries_exceeded    -> disabled, "Max retries reached (rc/cap)"
    permanent_error         -> disabled, "Permanent error -- cannot auto-retry"
    low_confidence_no_retry -> disabled, "Confidence too low (N% < th%)"
    transient_disabled      -> disabled, "Auto-retry disabled in policy"

Error classification uses heuristic isinstance() against small
whitelists (D-12-02 -- no new ToolError ABC, no new opt-in burden
on tool authors). _PERMANENT_TYPES covers pydantic.ValidationError
and EnvelopeMissingError; _TRANSIENT_TYPES covers asyncio.TimeoutError,
TimeoutError, OSError, ConnectionError. Default fall-through is
permanent_error -- fail-closed conservative.

The new tests/test_framework_flow_control_e2e.py is the v1.2
regression-prevention contract. The thesis is that v1.2 flow control
collapses to PURE functions; the test asserts each FOC invariant on
the corresponding pure boundary directly:

  FOC-01/02 OrchestratorConfig.injected_args validates dotted-path shape
  FOC-03    parse_envelope_from_result raises EnvelopeMissingError
  FOC-04    should_gate returns gate=True/'high_risk_tool' on apply_fix/prod
  FOC-05    should_retry classifies validation/timeout/at-cap correctly

If a future phase introduces a state-derived arg leak through the
LLM, that contract breaks loudly.

Bundler fix: scripts/build_single_file.py now bundles
runtime/agents/turn_output.py BEFORE policy.py in RUNTIME_MODULE_ORDER
because Phase 12's _PERMANENT_TYPES tuple references EnvelopeMissingError
at module-import time. (Pre-Phase-12 dists referenced it only inside
function bodies, where the strip-plus-rebuild order didn't surface a
NameError.)

D-12-01 should_retry pure (5 reason values); same shape as should_gate.
D-12-02 isinstance() heuristic on _PERMANENT_TYPES + _TRANSIENT_TYPES.
D-12-03 OrchestratorConfig.retry_policy declarative (extra='forbid').
D-12-04 UI surfaces decision via preview_retry_decision (5-case map).
D-12-05 tests/test_framework_flow_control_e2e.py covers FOC-01..05.
D-12-06 single atomic commit.

29 new tests: 14 should_retry matrix + 6 e2e + 9 retry_button_state.
Total: 1026 passing (baseline 997 + 29). Phase 11's GateDecision /
should_gate surface untouched. Concept-leak ratchet stays binary-green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual end-to-end testing of v1.2 surfaced 8 latent bugs across the
arg-injection / gateway / LLM-provider stack that unit tests missed
because they used pydantic-model fixtures while real FastMCP tools
expose JSON-Schema dicts. All 8 are framework-level fixes — none
change v1.2's pure-policy thesis.

Bugs fixed:

1. ``strip_injected_params`` early-exited for dict-schema (FastMCP)
   tools, leaking ``environment``/``incident_id``/``session_id`` to
   the LLM-visible signature. LLM hallucinated values, fed garbage
   back to the runtime, looped at the recursion ceiling. Fix: dict
   branch removes injected keys from ``properties`` + ``required``
   then ``model_copy``-s the tool.

2. New ``accepted_params_for_tool`` helper introspects both pydantic
   and JSON-Schema-dict ``args_schema`` shapes. Used at all 3 inject
   call sites (gateway ``_run`` / ``_arun`` / orchestrator
   ``_invoke_tool``).

3. ``inject_injected_args`` now drops LLM-supplied values for keys
   the underlying tool doesn't accept. Prevents pydantic
   ``unexpected_keyword`` rejections when an LLM hallucinates an
   injectable arg despite Phase 9 stripping it from the sig.

4. Gateway wrapper exposes a sanitized LLM-visible tool name
   (``:`` → ``__``) so OpenAI's tool-naming regex
   (``^[a-zA-Z0-9_-]+$``) and Ollama's
   (``[a-zA-Z0-9_.\-]{1,256}``) both accept it. Inner tool name
   stays colon-form so PVC-08 prefixed-form policy lookups are
   preserved.

5. ``make_agent_node`` no longer double-strips: pass ORIGINAL tools
   to ``wrap_tool`` (which strips internally for the LLM-visible
   schema). Stripping twice hid injected keys from
   ``accepted_params``, the inject step skipped them, FastMCP
   rejected the call as missing-required-arg.

6. ``_ChatOllamaJsonSchema`` subclass forces
   ``method='json_schema'`` on ``with_structured_output``. The
   default ``function_calling`` method fails on Ollama models
   that don't support native tool-calling (gemma, gpt-oss,
   ministral) — they emit prose instead of JSON, langchain raises
   ``OutputParserException`` and Phase 10's envelope is never
   parsed.

7. ``_try_recover_envelope_from_raw`` fallback in ``graph.py``
   extracts envelope JSON from raw LLM output (markdown-fenced or
   greedy ``{...}`` slice) when ``OutputParserException`` fires
   inside ``create_react_agent``. Also adds ``recursion_limit=25``
   to ``_ainvoke_with_retry`` so future infinite loops surface as
   ``GraphRecursionError`` instead of hanging silently.

8. New ``openai_compat`` provider kind (``_build_openai_compat_chat``)
   wires OpenRouter / Together / vLLM / etc. via langchain-openai's
   ``ChatOpenAI`` with a ``base_url`` override.

Config:

- ``OrchestratorConfig.injected_args.environment`` now resolves via
  ``session.extra_fields.environment`` (was ``session.environment``).
  Base ``Session`` class is domain-neutral; ``environment`` lives on
  ``IncidentState.extra_fields``. Mirrors how code_review's
  ``pr_url`` / ``repo`` were already declared.
- Workhorse model swapped to ``openrouter/openai/gpt-4o-mini``
  (``openai_compat`` kind, ``OPENROUTER_API_KEY`` from .env). Ollama
  models tested first — surfaced bugs 4-7 — but still need Phase 13
  hardening for the ``response_format`` round-trip on tool-loop
  termination.

Tests:

- ``test_orchestrator_injected_args_field_in_yaml`` updated to match
  the new env path.
- Genericity ratchet baseline 153 → 154 (Phase 12 backfill — the
  ``Orchestrator._retry_session_locked`` retry-policy gate added one
  ``incident`` token reuse that was missed in ``be5d351``).
- Full suite: 1026 passing, 3 skipped, 0 failing.

Out of scope (deferred to v1.3 hardening):

- Real-LLM ``create_react_agent`` tool-loop termination with
  ``response_format=AgentTurnOutput``: gpt-4o-mini and Ollama
  models reach the recursion limit without naturally terminating
  the React loop. Likely the structured-output round and the
  React END signal interact badly.
- Skill-prompt-vs-schema linter (raised during v1.1 testing).
- Bundler ``service.py`` inclusion (``OrchestratorService`` is not
  in ``RUNTIME_MODULE_ORDER``; ``dist/ui.py`` imports it from
  ``app``, breaking ``streamlit run dist/ui.py``. Local dev runs
  via ``PYTHONPATH=src:.`` work fine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…841)

Removes unused imports (asyncio, tool, Field, FakeMessagesListChatModel,
AIMessage, ToolMessage, pytest) and two dead local assignments (inner,
wrapper) flagged by ruff in CI. Pure cleanup — no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps
Copy link
Copy Markdown
Contributor Author

aksOps commented May 13, 2026

Superseded by #5 — v1.2 ruff cleanup landed in the squash-merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant