Skip to content

Refactor API, bridge, guardrails, playbook, and eval orchestration into focused services#29

Merged
hwuiwon merged 3 commits into
mainfrom
refactir
Mar 30, 2026
Merged

Refactor API, bridge, guardrails, playbook, and eval orchestration into focused services#29
hwuiwon merged 3 commits into
mainfrom
refactir

Conversation

@hwuiwon
Copy link
Copy Markdown
Collaborator

@hwuiwon hwuiwon commented Mar 30, 2026

Summary

This PR restructures several orchestration-heavy modules into smaller, single-purpose components without
changing the external API surface or core runtime behavior.

The main goal is to improve readability and maintainability by separating transport/bootstrap code from
business logic, isolating cross-cutting execution concerns, and reducing duplication in playbook/
evaluation flows.

What Changed

  • Split API server responsibilities into dedicated services:
    • api/modal_app.py
    • api/run_service.py
    • api/recording_service.py
    • kept api/server.py as thin FastAPI/Modal wiring
  • Broke bridge orchestration into clearer pieces:
    • extracted tool result formatting into bridge/tool_result.py
    • extracted background task management into bridge/background.py
    • extracted page observation helpers into bridge/observation.py
    • refactored bridge/execution.py into per-action handlers instead of one large branchy function
    • simplified bridge/router.py into a clearer guard/dispatch/postprocess pipeline
  • Split guardrail internals into focused collaborators:
    • guardrails/dns.py for SSRF/DNS checks
    • guardrails/destructive.py for destructive-click classification
    • reduced guardrails/engine.py to orchestration/policy glue
  • Separated playbook execution concerns:
    • playbooks/recovery.py for retry + LLM handoff behavior
    • playbooks/output.py for structured output extraction
    • simplified playbooks/runner.py
  • Deduplicated evaluation runtime setup:
    • added evaluation/runtime.py
    • simplified evaluation/runner.py by sharing browser/recording lifecycle

Why

  • Large orchestration modules were combining too many responsibilities.
  • Execution flow was harder to reason about because telemetry, persistence, routing, formatting, and
    recovery logic were interleaved.
  • Playbook and evaluation code had repeated lifecycle/setup patterns.
  • Guardrail internals relied on more implicit state than necessary.

Validation

  • uv run ruff check .
  • uvx ty check
  • uv run pytest -q

Summary by CodeRabbit

  • New Features

    • Dedicated recording/artifact endpoints for manifests, traces and screenshots with live fallback to persisted data.
    • Safer clicks and network requests via destructive-action checks and DNS-based hostname protection.
    • Playbook recovery: retry + LLM handoff for failed steps and structured output extraction.
    • Improved page-context capture and screenshot support; recordings are persisted for external access.
  • Refactor

    • API surface simplified to delegate run orchestration and artifact handling to service layers.
    • Background task management and tool-result formatting centralized for reliability.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 30, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5da3ec7d-b601-4010-86e4-fe1df1ec1083

📥 Commits

Reviewing files that changed from the base of the PR and between fce9551 and 01bd837.

📒 Files selected for processing (12)
  • README.md
  • agent/session_runner.py
  • api/modal_app.py
  • api/recording_service.py
  • api/run_service.py
  • api/server.py
  • bridge/observation.py
  • evaluation/runtime.py
  • guardrails/dns.py
  • playbooks/output.py
  • playbooks/recovery.py
  • playbooks/runner.py

📝 Walkthrough

Walkthrough

Adds Modal bootstrap and volume wiring, extracts run and recording lifecycle logic into RunService and RecordingService, introduces DOM observation and guardrail policy modules, centralizes evaluation runtime, and refactors playbook recovery, routing, and background task handling. API routes now delegate to the new services.

Changes

Cohort / File(s) Summary
API Modal & Volumes
api/modal_app.py
New Modal app/image bootstrap, PYTHONPATH, include/exclude file matcher, recording_volume named volume and VOLUME_MOUNT.
API Services
api/recording_service.py, api/run_service.py
New RecordingService for artifact/trace/screenshot retrieval with live-proxy → volume fallback; new RunService managing sandbox lifecycle, status, SSE streaming/proxy, and persisted replay.
API Surface
api/server.py
Routes simplified to delegate run/recording operations to RunService/RecordingService; removed previous in-file lifecycle and SSE logic.
Bridge: observation & execution
bridge/observation.py, bridge/execution.py
New Playwright page-inspection helpers (DOM snapshots, page map, mutation observer, screenshot) and refactored action dispatcher with per-action handlers and execution policy/context.
Bridge: router & tooling
bridge/router.py, bridge/tool_result.py, bridge/background.py
Introduced ActionRequest, centralized tool-result formatting, and BackgroundTasks abstraction replacing inline fire-and-forget task management.
Guardrails
guardrails/destructive.py, guardrails/dns.py, guardrails/engine.py
New destructive-click policy (regex + optional LLM) and DNS-based SSRF protection (resolver, cache, timeouts); GuardrailEngine refactored to delegate to these components.
Playbooks: recovery & output
playbooks/recovery.py, playbooks/output.py, playbooks/runner.py
Added StepRecoveryPolicy with retry + LLM handoff and extract_structured_data; playbook runner now delegates recovery and extraction logic.
Evaluation runtime
evaluation/runtime.py, evaluation/runner.py
New trial_runtime context manager bundling browser + recording lifecycle; runner updated to use it.
Agent session persistence
agent/session_runner.py
Commits named Modal recording volume on terminal paths to ensure persisted artifacts are available.
Docs
README.md
Project structure updated to reflect new top-level modules and settings file.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as FastAPI (server)
    participant RunSvc as RunService
    participant Modal as Modal Sandbox
    participant Volume as Recording Volume (Modal)
    participant Status as Sandbox /status endpoint

    Client->>API: POST /runs (create)
    API->>RunSvc: create_run(config)
    RunSvc->>Modal: create_cua_sandbox(config, env...)
    Modal-->>RunSvc: sandbox handle (id, tunnel info)
    RunSvc->>RunSvc: register RunHandle, increment metrics
    RunSvc->>API: RunResponse (run_id, urls)
    API->>Client: 201 Created (RunResponse)

    Client->>API: GET /runs/{id}/events (stream)
    API->>RunSvc: stream_run(run_id, request)
    alt live handle exists
        RunSvc->>Modal: proxy /events (SSE) via HTTP client
        Modal-->>RunSvc: SSE chunks
        RunSvc-->>API: stream proxied SSE to Client
    else sandbox finished or missing
        RunSvc->>Volume: load persisted status/events
        RunSvc-->>API: replay persisted SSE to Client
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main refactoring effort across multiple modules to separate concerns into focused services.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactir

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (13)
api/run_service.py (1)

202-217: Consider adding validation for proxied status response.

Line 206 constructs RunStatus(**resp.json()) directly. If the sandbox returns an unexpected schema, this will raise a Pydantic ValidationError that propagates as a 500 error. Consider wrapping in try/except to fall back gracefully:

♻️ Proposed validation wrapper
         client = self._get_http_client()
         try:
             resp = await client.get(f"{handle.status_base_url}/status")
             resp.raise_for_status()
-            return RunStatus(**resp.json())
+            try:
+                return RunStatus(**resp.json())
+            except Exception:
+                log.warning("Invalid status response for run %s", run_id, exc_info=True)
+                # Fall through to remove handle and try persisted
         except httpx.HTTPError as exc:
-            self.remove_handle(run_id)
             log.warning("Status request failed for run %s: %s", run_id, exc)
-            persisted = self.load_persisted_status(run_id)
-            if persisted:
-                return persisted
-            return RunStatus(
-                run_id=run_id,
-                status=RunStatusValue.TERMINATED,
-                error="Sandbox is no longer reachable",
-            )
+        self.remove_handle(run_id)
+        persisted = self.load_persisted_status(run_id)
+        if persisted:
+            return persisted
+        return RunStatus(
+            run_id=run_id,
+            status=RunStatusValue.TERMINATED,
+            error="Sandbox is no longer reachable",
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/run_service.py` around lines 202 - 217, The code directly constructs
RunStatus(**resp.json()) in the status fetch path which will raise a Pydantic
ValidationError if the sandbox returns an unexpected schema; wrap the
deserialization in a try/except to catch pydantic.ValidationError (and possibly
ValueError/TypeError), log the validation failure, call
self.remove_handle(run_id) if appropriate, then fall back to
self.load_persisted_status(run_id) (as in the existing HTTPError path) and
finally return a default RunStatus with status=RunStatusValue.TERMINATED and an
explanatory error message when no persisted status exists; update the block
around client.get(...) / RunStatus(...) accordingly to handle these validation
errors gracefully.
api/modal_app.py (1)

13-29: Consider renaming _include_exts to _exclude_non_source for clarity.

The ~ operator correctly negates the FilePatternMatcher on line 16, so _include_exts matches files that do not have the listed extensions. Combined with _ignore returning True when either matcher matches, the logic correctly:

  • Excludes files in specified directories
  • Excludes files without source extensions (keeping only .py, .js, .json, .yaml, .yml, .toml, .lock, .sh)

The naming _include_exts is misleading for a negated matcher. Renaming to _exclude_non_source or _exclude_other_extensions would clarify the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/modal_app.py` around lines 13 - 29, Rename the misleading negated matcher
variable _include_exts to something clearer like _exclude_non_source (or
_exclude_other_extensions) and update all references (notably in the _ignore
function) to use the new name; leave the negation (~FilePatternMatcher(...)) and
matcher patterns unchanged so behavior is preserved, and update any nearby
docstring/comments to reflect that this matcher excludes files that do not have
the listed source extensions while _exclude_dirs continues to exclude
directories.
playbooks/runner.py (3)

131-135: getattr is redundant; _output_schema is always set.

_output_schema is assigned in __init__ (line 38), so getattr(self, "_output_schema", None) is equivalent to just self._output_schema. The getattr adds unnecessary indirection.

♻️ Simplify to direct attribute access
         data = await extract_structured_data(
             step_results,
             playbook_name=playbook.name,
-            output_schema=getattr(self, "_output_schema", None),
+            output_schema=self._output_schema,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/runner.py` around lines 131 - 135, The getattr call is unnecessary
because _output_schema is always set in __init__; replace getattr(self,
"_output_schema", None) with direct attribute access self._output_schema in the
extract_structured_data call (the call using
extract_structured_data(step_results, playbook_name=playbook.name,
output_schema=...)). Update the invocation in the Runner method where
step_results and playbook.name are passed so output_schema uses
self._output_schema.

187-194: Accessing private method _llm_complete_remaining on policy is fragile.

The compatibility wrapper calls self._get_recovery_policy()._llm_complete_remaining(...), which reaches into a private method. If StepRecoveryPolicy refactors this method, the wrapper will break. Consider exposing a public method on the policy or documenting this coupling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/runner.py` around lines 187 - 194, The compatibility wrapper in
run_playbook (or wherever this wrapper lives) calls the private policy method
_llm_complete_remaining via self._get_recovery_policy(), which is fragile;
change the policy API to expose a public method (e.g., llm_complete_remaining or
complete_remaining_with_llm) on StepRecoveryPolicy and update this wrapper to
call that public method instead of _llm_complete_remaining, or alternatively add
a thin public forwarder on the policy class that delegates to the existing
private implementation; update references of _llm_complete_remaining in the
wrapper to the new public method to avoid depending on private internals.

172-177: Lazy initialization is redundant since __init__ already sets _recovery.

_get_recovery_policy does lazy initialization, but __init__ unconditionally creates and assigns self._recovery on line 39. The lazy check at lines 173-176 will never trigger unless a subclass bypasses __init__.

♻️ Simplify getter to direct return
     def _get_recovery_policy(self) -> StepRecoveryPolicy:
-        recovery = getattr(self, "_recovery", None)
-        if recovery is None:
-            recovery = self._create_recovery_policy()
-            self._recovery = recovery
-        return recovery
+        return self._recovery
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/runner.py` around lines 172 - 177, The _get_recovery_policy method
currently does redundant lazy initialization even though __init__ already
assigns self._recovery; simplify _get_recovery_policy to directly return
self._recovery (remove the getattr/creation branch) so the method just returns
the existing StepRecoveryPolicy instance created in __init__ (method names:
_get_recovery_policy, _create_recovery_policy, and attribute _recovery).
guardrails/dns.py (3)

68-71: ThreadPoolExecutor created per resolution is inefficient.

A new ThreadPoolExecutor is instantiated for each hostname resolution. Consider using a shared executor instance initialized in __init__ and shut down explicitly, or using asyncio.to_thread() which manages a default executor.

♻️ Optional: Use asyncio.to_thread for simpler threading
     def _resolve_and_check(self, hostname: str) -> str | None:
-        def do_resolve():
-            return socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP)
-
-        try:
-            with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
-                future = executor.submit(do_resolve)
-                addrinfos = future.result(timeout=self._timeout_s)
+        import asyncio
+        async def _resolve():
+            try:
+                return await asyncio.wait_for(
+                    asyncio.to_thread(
+                        socket.getaddrinfo, hostname, None, proto=socket.IPPROTO_TCP
+                    ),
+                    timeout=self._timeout_s,
+                )
+            except (socket.gaierror, asyncio.TimeoutError):
+                return None

Note: This would require making _resolve_and_check async, which ripples through check_hostname. If the current sync interface is required, keeping the executor is acceptable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guardrails/dns.py` around lines 68 - 71, The per-call ThreadPoolExecutor
creation inside _resolve_and_check is inefficient; refactor to use a shared
executor created in the class __init__ (e.g., self._executor =
ThreadPoolExecutor(...)) and reuse it for executor.submit(do_resolve) in
_resolve_and_check, and ensure it is cleanly shut down (e.g., in a close()
method or __del__); alternatively, if you can convert the sync API, make
_resolve_and_check async and replace executor.submit/future.result with
asyncio.to_thread(do_resolve) and adjust check_hostname accordingly.

86-88: Failing open on DNS errors may warrant logging.

DNS resolution failures (timeout, gaierror) silently allow the request. This is valid for availability, but consider logging at DEBUG level for operational visibility.

🔍 Optional: Add debug logging for DNS failures
         except (socket.gaierror, concurrent.futures.TimeoutError):
-            pass
+            log.debug("DNS resolution failed for %s; allowing request", hostname)
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guardrails/dns.py` around lines 86 - 88, The except block that catches
socket.gaierror and concurrent.futures.TimeoutError currently swallows DNS
errors; update that handler to emit a DEBUG-level log with contextual details
and the exception before returning None. Specifically, in the except
(socket.gaierror, concurrent.futures.TimeoutError): block add a call like
logger.debug("DNS resolution failed for %s: %s", host_or_query, e,
exc_info=True) (or use the module's logger name) so the failure is recorded for
ops visibility, then return None as before. Ensure you reference the same
exception variable (e) and include enough context (host/query and that it’s a
DNS resolution timeout/gaierror).

57-62: Consider LRU eviction instead of full cache clear.

Clearing the entire cache when hitting cache_max discards potentially valuable entries. An LRU-style eviction (e.g., functools.lru_cache or collections.OrderedDict) would be more efficient.

♻️ Optional: Use OrderedDict for LRU-style eviction
+from collections import OrderedDict
+
 class DnsProtection:
     """Resolve hostnames and block private/internal network targets."""

     def __init__(self, *, timeout_s: float = 2.0, cache_max: int = 1024) -> None:
         self._timeout_s = timeout_s
         self._cache_max = cache_max
-        self._cache: dict[str, str | None] = {}
+        self._cache: OrderedDict[str, str | None] = OrderedDict()

     # ...

     def _check_resolved_hostname(self, hostname: str) -> str | None:
         if hostname in self._cache:
+            self._cache.move_to_end(hostname)  # Mark as recently used
             return self._cache[hostname]

         reason = self._resolve_and_check(hostname)
         if self._cache_max > 0:
             if len(self._cache) >= self._cache_max:
-                self._cache.clear()
+                self._cache.popitem(last=False)  # Evict oldest entry
             self._cache[hostname] = reason
         return reason
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guardrails/dns.py` around lines 57 - 62, The current caching code in the
method that calls self._resolve_and_check(hostname) clears the whole dict when
len(self._cache) >= self._cache_max; replace this full-clear strategy with an
LRU eviction policy by turning self._cache into a collections.OrderedDict (or
using functools.lru_cache around _resolve_and_check) and on set: if
len(self._cache) >= self._cache_max call self._cache.popitem(last=False) to
evict the least-recently-used entry, then assign self._cache[hostname] = reason
and call self._cache.move_to_end(hostname) on hits; update the cache
initialization and any cache-access sites accordingly (references: self._cache,
self._cache_max, _resolve_and_check).
playbooks/output.py (2)

26-36: Token usage from structured extraction is discarded.

The extract_structured_output call returns (data, input_tokens, output_tokens), but the token counts are ignored. If cost/usage tracking is needed at the playbook level, consider returning or logging these values.

♻️ Optional: Return token counts for usage tracking

If tracking is needed, modify the return type:

 async def extract_structured_data(
     step_results: list[StepResult],
     *,
     playbook_name: str,
     output_schema: dict[str, Any] | None,
-) -> dict[str, Any] | None:
+) -> tuple[dict[str, Any] | None, int, int]:
     """Run structured extraction from successful extracted texts."""
     if not output_schema:
-        return None
+        return None, 0, 0
     # ...
-        data, _, _ = await extract_structured_output(...)
-        return data
+        data, input_tokens, output_tokens = await extract_structured_output(...)
+        return data, input_tokens, output_tokens
     except Exception:
-        return None
+        return None, 0, 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/output.py` around lines 26 - 36, The call to
extract_structured_output currently discards the returned token counts; capture
the full tuple (data, input_tokens, output_tokens) from
extract_structured_output and either return the token counts alongside data
(e.g., return data, input_tokens, output_tokens) or log them for cost/usage
tracking (use your logger). Update the error path to mirror the new return type
or logging behavior and adjust any callers of this function accordingly;
reference extract_structured_output, the variables input_tokens and
output_tokens, and playbook_name when adding contextual log messages.

35-36: Bare except Exception silently swallows extraction failures.

Consider logging the exception at WARNING or DEBUG level to aid debugging when structured extraction unexpectedly returns None.

🔍 Optional: Log extraction failures
+import logging
+
+log = logging.getLogger(__name__)
+
     try:
         from agent.output import extract_structured_output

         data, _, _ = await extract_structured_output(
             summary=f"Playbook '{playbook_name}' completed successfully.",
             extracted_texts=extracted_texts,
             output_schema=output_schema,
         )
         return data
-    except Exception:
+    except Exception as exc:
+        log.warning("Structured extraction failed for playbook '%s': %s", playbook_name, exc)
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/output.py` around lines 35 - 36, The bare "except Exception: return
None" in playbooks/output.py swallows errors; modify that except to "except
Exception as e:" and log the failure (e.g. logger.warning or logger.debug)
including the exception message and optional traceback before returning None so
extraction failures are visible; use the existing logger or import logging and
call logger.exception/logger.debug with the exception to capture details.
guardrails/destructive.py (1)

53-55: Approving unmatched selectors when LLM is disabled may be permissive.

When llm_enabled=False, any selector not matching either regex is auto-approved and cached. If the destructive regex misses an edge case, it will be permanently allowed. Consider whether this is the intended behavior or if a more conservative default (block unknown) would be safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guardrails/destructive.py` around lines 53 - 55, The current logic
auto-approves any unmatched selector when self._llm_enabled is False by adding
normalized to self._approved_selectors and returning None; change this to a
conservative default: do not add unmatched selectors to self._approved_selectors
and instead treat them as denied (e.g., add to a self._denied_selectors set or
return an explicit deny value) unless an explicit opt-in flag is set. Add a new
config flag (e.g., self._allow_unknown_when_llm_disabled) defaulting to False,
check that flag where self._llm_enabled is used, and update the branch that
references self._approved_selectors/normalized so unknown selectors are blocked
unless the flag is true; also add/update unit tests for both flag values.
evaluation/runtime.py (1)

74-78: Consider independent cleanup to prevent resource leaks on partial failure.

The cleanup order is correct (recording stops before browser closes, ensuring trace data is flushed while the context is alive). However, if recording.stop() raises an unexpected exception, browser.close() will never execute, potentially leaking browser processes.

Proposed fix for independent cleanup
     finally:
-        await recording.stop()
-        await browser.close()
+        try:
+            await recording.stop()
+        finally:
+            await browser.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/runtime.py` around lines 74 - 78, The finally block currently
awaits recording.stop() then awaits browser.close(), so if recording.stop()
raises the browser may not be closed; update the cleanup to perform the two
shutdowns independently—e.g., call await recording.stop() inside its own
try/except (or run both await recording.stop() and await browser.close() via
asyncio.gather(..., return_exceptions=True)) and ensure browser.close() is
always awaited even when recording.stop() fails. Apply this change around the
TrialRuntime yield cleanup to guarantee browser.close() runs regardless of
recording.stop() errors.
playbooks/recovery.py (1)

135-145: Clarify guardrails vs blinders in recovery handoff.

The ActionRouter is correctly initialized with guardrail_config=playbook.guardrails.to_runtime_config() at line 137, so playbook guardrails ARE propagated. However, the recovery handoff passes allowed_actions=ALL_ACTIONS to run_agent, which may bypass guardrail-based action restrictions.

The playbook schema has no blinders field (only guardrails), so DOM-level blinders cannot be propagated from the playbook. If the recovery agent should respect the same action restrictions as the original playbook execution, consider whether allowed_actions should be constrained based on the playbook's guardrails instead of using ALL_ACTIONS.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/recovery.py` around lines 135 - 145, The recovery handoff currently
forces allowed_actions=ALL_ACTIONS when calling run_agent which can bypass
playbook guardrails; instead derive the allowed actions from the playbook
guardrails (use the runtime config returned by
playbook.guardrails.to_runtime_config() or a helper like
guardrails.allowed_actions()) and pass that to run_agent (or omit
allowed_actions so ActionRouter's guardrail_config is authoritative) so the
recovery agent is constrained by the same guardrail-derived action set as the
original run; update the call site where run_agent(...) is invoked (referencing
run_agent, ActionRouter, allowed_actions, ALL_ACTIONS, and
playbook.guardrails.to_runtime_config()) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/recording_service.py`:
- Around line 59-87: The try/except currently only wraps the generator
definition, so runtime httpx.HTTPError inside proxy_trace() is unhandled; update
get_trace to catch streaming errors by (1) moving the httpx.HTTPError handling
into the async generator proxy_trace(): wrap the async with client.stream(...)
and the async for loop in a try/except that catches httpx.HTTPError, logs the
error (with run_id and exc_info), and raises a small sentinel exception (e.g.,
StreamProxyError) or re-raise a custom exception; (2) then wrap the
StreamingResponse creation/return in a try/except that catches that sentinel
(StreamProxyError) and falls back to the persisted-file path logic (returning
FileResponse) so streaming failures correctly fall back to
_volume_path(...,"trace.zip"); reference symbols: get_trace, proxy_trace,
StreamingResponse, client.stream, httpx.HTTPError, _volume_path, FileResponse.
- Around line 34-40: The path traversal guard in _volume_path uses a string
startswith check that can be bypassed (e.g., /recordings/run123 vs
/recordings/run123_malicious); replace the startswith logic with a proper
Path-based containment check by resolving both base and result and using
Path.is_relative_to(result, base.resolve()) (or
result.is_relative_to(base.resolve()) on Python 3.9+) to validate that result is
inside base, raising the same HTTPException when the check fails.

In `@api/run_service.py`:
- Around line 129-165: The active_sessions() counter is incremented on sandbox
creation but only decremented on the creation failure path; ensure it is
decremented on every exit path by moving or adding active_sessions().add(-1) for
successful completions and cleanup; specifically, decrement after the run
finishes or when the sandbox is stopped/cleaned (hooks/functions: stop_run,
cleanup_finished_sandbox) or add a finally/finalizer around the lifecycle that
calls active_sessions().add(-1) so every created sandbox (created via
create_cua_sandbox and referenced by run_id) always decrements the counter,
mirroring the pattern used in agent/session_runner.py.

In `@bridge/observation.py`:
- Around line 59-78: get_dom_snapshot currently parses JSON from page.evaluate
and indexes keys directly, which can raise JSONDecodeError or KeyError and
violate the docstring contract; update get_dom_snapshot to catch JSON parsing
and key access errors (e.g., json.JSONDecodeError, KeyError, and optionally
TypeError) around json.loads(raw) and the subsequent access to data['title'],
data['url'], and data['dom'], and return None on those exceptions (matching
quick_dom_snapshot behavior) so the function reliably returns None when the
snapshot is unavailable.

---

Nitpick comments:
In `@api/modal_app.py`:
- Around line 13-29: Rename the misleading negated matcher variable
_include_exts to something clearer like _exclude_non_source (or
_exclude_other_extensions) and update all references (notably in the _ignore
function) to use the new name; leave the negation (~FilePatternMatcher(...)) and
matcher patterns unchanged so behavior is preserved, and update any nearby
docstring/comments to reflect that this matcher excludes files that do not have
the listed source extensions while _exclude_dirs continues to exclude
directories.

In `@api/run_service.py`:
- Around line 202-217: The code directly constructs RunStatus(**resp.json()) in
the status fetch path which will raise a Pydantic ValidationError if the sandbox
returns an unexpected schema; wrap the deserialization in a try/except to catch
pydantic.ValidationError (and possibly ValueError/TypeError), log the validation
failure, call self.remove_handle(run_id) if appropriate, then fall back to
self.load_persisted_status(run_id) (as in the existing HTTPError path) and
finally return a default RunStatus with status=RunStatusValue.TERMINATED and an
explanatory error message when no persisted status exists; update the block
around client.get(...) / RunStatus(...) accordingly to handle these validation
errors gracefully.

In `@evaluation/runtime.py`:
- Around line 74-78: The finally block currently awaits recording.stop() then
awaits browser.close(), so if recording.stop() raises the browser may not be
closed; update the cleanup to perform the two shutdowns independently—e.g., call
await recording.stop() inside its own try/except (or run both await
recording.stop() and await browser.close() via asyncio.gather(...,
return_exceptions=True)) and ensure browser.close() is always awaited even when
recording.stop() fails. Apply this change around the TrialRuntime yield cleanup
to guarantee browser.close() runs regardless of recording.stop() errors.

In `@guardrails/destructive.py`:
- Around line 53-55: The current logic auto-approves any unmatched selector when
self._llm_enabled is False by adding normalized to self._approved_selectors and
returning None; change this to a conservative default: do not add unmatched
selectors to self._approved_selectors and instead treat them as denied (e.g.,
add to a self._denied_selectors set or return an explicit deny value) unless an
explicit opt-in flag is set. Add a new config flag (e.g.,
self._allow_unknown_when_llm_disabled) defaulting to False, check that flag
where self._llm_enabled is used, and update the branch that references
self._approved_selectors/normalized so unknown selectors are blocked unless the
flag is true; also add/update unit tests for both flag values.

In `@guardrails/dns.py`:
- Around line 68-71: The per-call ThreadPoolExecutor creation inside
_resolve_and_check is inefficient; refactor to use a shared executor created in
the class __init__ (e.g., self._executor = ThreadPoolExecutor(...)) and reuse it
for executor.submit(do_resolve) in _resolve_and_check, and ensure it is cleanly
shut down (e.g., in a close() method or __del__); alternatively, if you can
convert the sync API, make _resolve_and_check async and replace
executor.submit/future.result with asyncio.to_thread(do_resolve) and adjust
check_hostname accordingly.
- Around line 86-88: The except block that catches socket.gaierror and
concurrent.futures.TimeoutError currently swallows DNS errors; update that
handler to emit a DEBUG-level log with contextual details and the exception
before returning None. Specifically, in the except (socket.gaierror,
concurrent.futures.TimeoutError): block add a call like logger.debug("DNS
resolution failed for %s: %s", host_or_query, e, exc_info=True) (or use the
module's logger name) so the failure is recorded for ops visibility, then return
None as before. Ensure you reference the same exception variable (e) and include
enough context (host/query and that it’s a DNS resolution timeout/gaierror).
- Around line 57-62: The current caching code in the method that calls
self._resolve_and_check(hostname) clears the whole dict when len(self._cache) >=
self._cache_max; replace this full-clear strategy with an LRU eviction policy by
turning self._cache into a collections.OrderedDict (or using functools.lru_cache
around _resolve_and_check) and on set: if len(self._cache) >= self._cache_max
call self._cache.popitem(last=False) to evict the least-recently-used entry,
then assign self._cache[hostname] = reason and call
self._cache.move_to_end(hostname) on hits; update the cache initialization and
any cache-access sites accordingly (references: self._cache, self._cache_max,
_resolve_and_check).

In `@playbooks/output.py`:
- Around line 26-36: The call to extract_structured_output currently discards
the returned token counts; capture the full tuple (data, input_tokens,
output_tokens) from extract_structured_output and either return the token counts
alongside data (e.g., return data, input_tokens, output_tokens) or log them for
cost/usage tracking (use your logger). Update the error path to mirror the new
return type or logging behavior and adjust any callers of this function
accordingly; reference extract_structured_output, the variables input_tokens and
output_tokens, and playbook_name when adding contextual log messages.
- Around line 35-36: The bare "except Exception: return None" in
playbooks/output.py swallows errors; modify that except to "except Exception as
e:" and log the failure (e.g. logger.warning or logger.debug) including the
exception message and optional traceback before returning None so extraction
failures are visible; use the existing logger or import logging and call
logger.exception/logger.debug with the exception to capture details.

In `@playbooks/recovery.py`:
- Around line 135-145: The recovery handoff currently forces
allowed_actions=ALL_ACTIONS when calling run_agent which can bypass playbook
guardrails; instead derive the allowed actions from the playbook guardrails (use
the runtime config returned by playbook.guardrails.to_runtime_config() or a
helper like guardrails.allowed_actions()) and pass that to run_agent (or omit
allowed_actions so ActionRouter's guardrail_config is authoritative) so the
recovery agent is constrained by the same guardrail-derived action set as the
original run; update the call site where run_agent(...) is invoked (referencing
run_agent, ActionRouter, allowed_actions, ALL_ACTIONS, and
playbook.guardrails.to_runtime_config()) accordingly.

In `@playbooks/runner.py`:
- Around line 131-135: The getattr call is unnecessary because _output_schema is
always set in __init__; replace getattr(self, "_output_schema", None) with
direct attribute access self._output_schema in the extract_structured_data call
(the call using extract_structured_data(step_results,
playbook_name=playbook.name, output_schema=...)). Update the invocation in the
Runner method where step_results and playbook.name are passed so output_schema
uses self._output_schema.
- Around line 187-194: The compatibility wrapper in run_playbook (or wherever
this wrapper lives) calls the private policy method _llm_complete_remaining via
self._get_recovery_policy(), which is fragile; change the policy API to expose a
public method (e.g., llm_complete_remaining or complete_remaining_with_llm) on
StepRecoveryPolicy and update this wrapper to call that public method instead of
_llm_complete_remaining, or alternatively add a thin public forwarder on the
policy class that delegates to the existing private implementation; update
references of _llm_complete_remaining in the wrapper to the new public method to
avoid depending on private internals.
- Around line 172-177: The _get_recovery_policy method currently does redundant
lazy initialization even though __init__ already assigns self._recovery;
simplify _get_recovery_policy to directly return self._recovery (remove the
getattr/creation branch) so the method just returns the existing
StepRecoveryPolicy instance created in __init__ (method names:
_get_recovery_policy, _create_recovery_policy, and attribute _recovery).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e22df133-3e8e-4492-ac30-060389300535

📥 Commits

Reviewing files that changed from the base of the PR and between aaca53e and fce9551.

📒 Files selected for processing (17)
  • api/modal_app.py
  • api/recording_service.py
  • api/run_service.py
  • api/server.py
  • bridge/background.py
  • bridge/execution.py
  • bridge/observation.py
  • bridge/router.py
  • bridge/tool_result.py
  • evaluation/runner.py
  • evaluation/runtime.py
  • guardrails/destructive.py
  • guardrails/dns.py
  • guardrails/engine.py
  • playbooks/output.py
  • playbooks/recovery.py
  • playbooks/runner.py

Comment thread api/recording_service.py
Comment thread api/recording_service.py
Comment thread api/run_service.py
Comment thread bridge/observation.py Outdated
@hwuiwon hwuiwon merged commit af1e705 into main Mar 30, 2026
1 check passed
@hwuiwon hwuiwon deleted the refactir branch March 30, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant