Conversation
…to focused services
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughAdds Modal bootstrap and volume wiring, extracts run and recording lifecycle logic into RunService and RecordingService, introduces DOM observation and guardrail policy modules, centralizes evaluation runtime, and refactors playbook recovery, routing, and background task handling. API routes now delegate to the new services. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as FastAPI (server)
participant RunSvc as RunService
participant Modal as Modal Sandbox
participant Volume as Recording Volume (Modal)
participant Status as Sandbox /status endpoint
Client->>API: POST /runs (create)
API->>RunSvc: create_run(config)
RunSvc->>Modal: create_cua_sandbox(config, env...)
Modal-->>RunSvc: sandbox handle (id, tunnel info)
RunSvc->>RunSvc: register RunHandle, increment metrics
RunSvc->>API: RunResponse (run_id, urls)
API->>Client: 201 Created (RunResponse)
Client->>API: GET /runs/{id}/events (stream)
API->>RunSvc: stream_run(run_id, request)
alt live handle exists
RunSvc->>Modal: proxy /events (SSE) via HTTP client
Modal-->>RunSvc: SSE chunks
RunSvc-->>API: stream proxied SSE to Client
else sandbox finished or missing
RunSvc->>Volume: load persisted status/events
RunSvc-->>API: replay persisted SSE to Client
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (13)
api/run_service.py (1)
202-217: Consider adding validation for proxied status response.Line 206 constructs
RunStatus(**resp.json())directly. If the sandbox returns an unexpected schema, this will raise a PydanticValidationErrorthat propagates as a 500 error. Consider wrapping in try/except to fall back gracefully:♻️ Proposed validation wrapper
client = self._get_http_client() try: resp = await client.get(f"{handle.status_base_url}/status") resp.raise_for_status() - return RunStatus(**resp.json()) + try: + return RunStatus(**resp.json()) + except Exception: + log.warning("Invalid status response for run %s", run_id, exc_info=True) + # Fall through to remove handle and try persisted except httpx.HTTPError as exc: - self.remove_handle(run_id) log.warning("Status request failed for run %s: %s", run_id, exc) - persisted = self.load_persisted_status(run_id) - if persisted: - return persisted - return RunStatus( - run_id=run_id, - status=RunStatusValue.TERMINATED, - error="Sandbox is no longer reachable", - ) + self.remove_handle(run_id) + persisted = self.load_persisted_status(run_id) + if persisted: + return persisted + return RunStatus( + run_id=run_id, + status=RunStatusValue.TERMINATED, + error="Sandbox is no longer reachable", + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/run_service.py` around lines 202 - 217, The code directly constructs RunStatus(**resp.json()) in the status fetch path which will raise a Pydantic ValidationError if the sandbox returns an unexpected schema; wrap the deserialization in a try/except to catch pydantic.ValidationError (and possibly ValueError/TypeError), log the validation failure, call self.remove_handle(run_id) if appropriate, then fall back to self.load_persisted_status(run_id) (as in the existing HTTPError path) and finally return a default RunStatus with status=RunStatusValue.TERMINATED and an explanatory error message when no persisted status exists; update the block around client.get(...) / RunStatus(...) accordingly to handle these validation errors gracefully.api/modal_app.py (1)
13-29: Consider renaming_include_extsto_exclude_non_sourcefor clarity.The
~operator correctly negates theFilePatternMatcheron line 16, so_include_extsmatches files that do not have the listed extensions. Combined with_ignorereturningTruewhen either matcher matches, the logic correctly:
- Excludes files in specified directories
- Excludes files without source extensions (keeping only
.py,.js,.json,.yaml,.yml,.toml,.lock,.sh)The naming
_include_extsis misleading for a negated matcher. Renaming to_exclude_non_sourceor_exclude_other_extensionswould clarify the intent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/modal_app.py` around lines 13 - 29, Rename the misleading negated matcher variable _include_exts to something clearer like _exclude_non_source (or _exclude_other_extensions) and update all references (notably in the _ignore function) to use the new name; leave the negation (~FilePatternMatcher(...)) and matcher patterns unchanged so behavior is preserved, and update any nearby docstring/comments to reflect that this matcher excludes files that do not have the listed source extensions while _exclude_dirs continues to exclude directories.playbooks/runner.py (3)
131-135:getattris redundant;_output_schemais always set.
_output_schemais assigned in__init__(line 38), sogetattr(self, "_output_schema", None)is equivalent to justself._output_schema. Thegetattradds unnecessary indirection.♻️ Simplify to direct attribute access
data = await extract_structured_data( step_results, playbook_name=playbook.name, - output_schema=getattr(self, "_output_schema", None), + output_schema=self._output_schema, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/runner.py` around lines 131 - 135, The getattr call is unnecessary because _output_schema is always set in __init__; replace getattr(self, "_output_schema", None) with direct attribute access self._output_schema in the extract_structured_data call (the call using extract_structured_data(step_results, playbook_name=playbook.name, output_schema=...)). Update the invocation in the Runner method where step_results and playbook.name are passed so output_schema uses self._output_schema.
187-194: Accessing private method_llm_complete_remainingon policy is fragile.The compatibility wrapper calls
self._get_recovery_policy()._llm_complete_remaining(...), which reaches into a private method. IfStepRecoveryPolicyrefactors this method, the wrapper will break. Consider exposing a public method on the policy or documenting this coupling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/runner.py` around lines 187 - 194, The compatibility wrapper in run_playbook (or wherever this wrapper lives) calls the private policy method _llm_complete_remaining via self._get_recovery_policy(), which is fragile; change the policy API to expose a public method (e.g., llm_complete_remaining or complete_remaining_with_llm) on StepRecoveryPolicy and update this wrapper to call that public method instead of _llm_complete_remaining, or alternatively add a thin public forwarder on the policy class that delegates to the existing private implementation; update references of _llm_complete_remaining in the wrapper to the new public method to avoid depending on private internals.
172-177: Lazy initialization is redundant since__init__already sets_recovery.
_get_recovery_policydoes lazy initialization, but__init__unconditionally creates and assignsself._recoveryon line 39. The lazy check at lines 173-176 will never trigger unless a subclass bypasses__init__.♻️ Simplify getter to direct return
def _get_recovery_policy(self) -> StepRecoveryPolicy: - recovery = getattr(self, "_recovery", None) - if recovery is None: - recovery = self._create_recovery_policy() - self._recovery = recovery - return recovery + return self._recovery🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/runner.py` around lines 172 - 177, The _get_recovery_policy method currently does redundant lazy initialization even though __init__ already assigns self._recovery; simplify _get_recovery_policy to directly return self._recovery (remove the getattr/creation branch) so the method just returns the existing StepRecoveryPolicy instance created in __init__ (method names: _get_recovery_policy, _create_recovery_policy, and attribute _recovery).guardrails/dns.py (3)
68-71: ThreadPoolExecutor created per resolution is inefficient.A new
ThreadPoolExecutoris instantiated for each hostname resolution. Consider using a shared executor instance initialized in__init__and shut down explicitly, or usingasyncio.to_thread()which manages a default executor.♻️ Optional: Use asyncio.to_thread for simpler threading
def _resolve_and_check(self, hostname: str) -> str | None: - def do_resolve(): - return socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP) - - try: - with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor: - future = executor.submit(do_resolve) - addrinfos = future.result(timeout=self._timeout_s) + import asyncio + async def _resolve(): + try: + return await asyncio.wait_for( + asyncio.to_thread( + socket.getaddrinfo, hostname, None, proto=socket.IPPROTO_TCP + ), + timeout=self._timeout_s, + ) + except (socket.gaierror, asyncio.TimeoutError): + return NoneNote: This would require making
_resolve_and_checkasync, which ripples throughcheck_hostname. If the current sync interface is required, keeping the executor is acceptable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guardrails/dns.py` around lines 68 - 71, The per-call ThreadPoolExecutor creation inside _resolve_and_check is inefficient; refactor to use a shared executor created in the class __init__ (e.g., self._executor = ThreadPoolExecutor(...)) and reuse it for executor.submit(do_resolve) in _resolve_and_check, and ensure it is cleanly shut down (e.g., in a close() method or __del__); alternatively, if you can convert the sync API, make _resolve_and_check async and replace executor.submit/future.result with asyncio.to_thread(do_resolve) and adjust check_hostname accordingly.
86-88: Failing open on DNS errors may warrant logging.DNS resolution failures (timeout, gaierror) silently allow the request. This is valid for availability, but consider logging at
DEBUGlevel for operational visibility.🔍 Optional: Add debug logging for DNS failures
except (socket.gaierror, concurrent.futures.TimeoutError): - pass + log.debug("DNS resolution failed for %s; allowing request", hostname) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guardrails/dns.py` around lines 86 - 88, The except block that catches socket.gaierror and concurrent.futures.TimeoutError currently swallows DNS errors; update that handler to emit a DEBUG-level log with contextual details and the exception before returning None. Specifically, in the except (socket.gaierror, concurrent.futures.TimeoutError): block add a call like logger.debug("DNS resolution failed for %s: %s", host_or_query, e, exc_info=True) (or use the module's logger name) so the failure is recorded for ops visibility, then return None as before. Ensure you reference the same exception variable (e) and include enough context (host/query and that it’s a DNS resolution timeout/gaierror).
57-62: Consider LRU eviction instead of full cache clear.Clearing the entire cache when hitting
cache_maxdiscards potentially valuable entries. An LRU-style eviction (e.g.,functools.lru_cacheorcollections.OrderedDict) would be more efficient.♻️ Optional: Use OrderedDict for LRU-style eviction
+from collections import OrderedDict + class DnsProtection: """Resolve hostnames and block private/internal network targets.""" def __init__(self, *, timeout_s: float = 2.0, cache_max: int = 1024) -> None: self._timeout_s = timeout_s self._cache_max = cache_max - self._cache: dict[str, str | None] = {} + self._cache: OrderedDict[str, str | None] = OrderedDict() # ... def _check_resolved_hostname(self, hostname: str) -> str | None: if hostname in self._cache: + self._cache.move_to_end(hostname) # Mark as recently used return self._cache[hostname] reason = self._resolve_and_check(hostname) if self._cache_max > 0: if len(self._cache) >= self._cache_max: - self._cache.clear() + self._cache.popitem(last=False) # Evict oldest entry self._cache[hostname] = reason return reason🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guardrails/dns.py` around lines 57 - 62, The current caching code in the method that calls self._resolve_and_check(hostname) clears the whole dict when len(self._cache) >= self._cache_max; replace this full-clear strategy with an LRU eviction policy by turning self._cache into a collections.OrderedDict (or using functools.lru_cache around _resolve_and_check) and on set: if len(self._cache) >= self._cache_max call self._cache.popitem(last=False) to evict the least-recently-used entry, then assign self._cache[hostname] = reason and call self._cache.move_to_end(hostname) on hits; update the cache initialization and any cache-access sites accordingly (references: self._cache, self._cache_max, _resolve_and_check).playbooks/output.py (2)
26-36: Token usage from structured extraction is discarded.The
extract_structured_outputcall returns(data, input_tokens, output_tokens), but the token counts are ignored. If cost/usage tracking is needed at the playbook level, consider returning or logging these values.♻️ Optional: Return token counts for usage tracking
If tracking is needed, modify the return type:
async def extract_structured_data( step_results: list[StepResult], *, playbook_name: str, output_schema: dict[str, Any] | None, -) -> dict[str, Any] | None: +) -> tuple[dict[str, Any] | None, int, int]: """Run structured extraction from successful extracted texts.""" if not output_schema: - return None + return None, 0, 0 # ... - data, _, _ = await extract_structured_output(...) - return data + data, input_tokens, output_tokens = await extract_structured_output(...) + return data, input_tokens, output_tokens except Exception: - return None + return None, 0, 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/output.py` around lines 26 - 36, The call to extract_structured_output currently discards the returned token counts; capture the full tuple (data, input_tokens, output_tokens) from extract_structured_output and either return the token counts alongside data (e.g., return data, input_tokens, output_tokens) or log them for cost/usage tracking (use your logger). Update the error path to mirror the new return type or logging behavior and adjust any callers of this function accordingly; reference extract_structured_output, the variables input_tokens and output_tokens, and playbook_name when adding contextual log messages.
35-36: Bareexcept Exceptionsilently swallows extraction failures.Consider logging the exception at
WARNINGorDEBUGlevel to aid debugging when structured extraction unexpectedly returnsNone.🔍 Optional: Log extraction failures
+import logging + +log = logging.getLogger(__name__) + try: from agent.output import extract_structured_output data, _, _ = await extract_structured_output( summary=f"Playbook '{playbook_name}' completed successfully.", extracted_texts=extracted_texts, output_schema=output_schema, ) return data - except Exception: + except Exception as exc: + log.warning("Structured extraction failed for playbook '%s': %s", playbook_name, exc) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/output.py` around lines 35 - 36, The bare "except Exception: return None" in playbooks/output.py swallows errors; modify that except to "except Exception as e:" and log the failure (e.g. logger.warning or logger.debug) including the exception message and optional traceback before returning None so extraction failures are visible; use the existing logger or import logging and call logger.exception/logger.debug with the exception to capture details.guardrails/destructive.py (1)
53-55: Approving unmatched selectors when LLM is disabled may be permissive.When
llm_enabled=False, any selector not matching either regex is auto-approved and cached. If the destructive regex misses an edge case, it will be permanently allowed. Consider whether this is the intended behavior or if a more conservative default (block unknown) would be safer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guardrails/destructive.py` around lines 53 - 55, The current logic auto-approves any unmatched selector when self._llm_enabled is False by adding normalized to self._approved_selectors and returning None; change this to a conservative default: do not add unmatched selectors to self._approved_selectors and instead treat them as denied (e.g., add to a self._denied_selectors set or return an explicit deny value) unless an explicit opt-in flag is set. Add a new config flag (e.g., self._allow_unknown_when_llm_disabled) defaulting to False, check that flag where self._llm_enabled is used, and update the branch that references self._approved_selectors/normalized so unknown selectors are blocked unless the flag is true; also add/update unit tests for both flag values.evaluation/runtime.py (1)
74-78: Consider independent cleanup to prevent resource leaks on partial failure.The cleanup order is correct (recording stops before browser closes, ensuring trace data is flushed while the context is alive). However, if
recording.stop()raises an unexpected exception,browser.close()will never execute, potentially leaking browser processes.Proposed fix for independent cleanup
finally: - await recording.stop() - await browser.close() + try: + await recording.stop() + finally: + await browser.close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/runtime.py` around lines 74 - 78, The finally block currently awaits recording.stop() then awaits browser.close(), so if recording.stop() raises the browser may not be closed; update the cleanup to perform the two shutdowns independently—e.g., call await recording.stop() inside its own try/except (or run both await recording.stop() and await browser.close() via asyncio.gather(..., return_exceptions=True)) and ensure browser.close() is always awaited even when recording.stop() fails. Apply this change around the TrialRuntime yield cleanup to guarantee browser.close() runs regardless of recording.stop() errors.playbooks/recovery.py (1)
135-145: Clarify guardrails vs blinders in recovery handoff.The
ActionRouteris correctly initialized withguardrail_config=playbook.guardrails.to_runtime_config()at line 137, so playbook guardrails ARE propagated. However, the recovery handoff passesallowed_actions=ALL_ACTIONStorun_agent, which may bypass guardrail-based action restrictions.The playbook schema has no
blindersfield (onlyguardrails), so DOM-level blinders cannot be propagated from the playbook. If the recovery agent should respect the same action restrictions as the original playbook execution, consider whetherallowed_actionsshould be constrained based on the playbook's guardrails instead of usingALL_ACTIONS.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/recovery.py` around lines 135 - 145, The recovery handoff currently forces allowed_actions=ALL_ACTIONS when calling run_agent which can bypass playbook guardrails; instead derive the allowed actions from the playbook guardrails (use the runtime config returned by playbook.guardrails.to_runtime_config() or a helper like guardrails.allowed_actions()) and pass that to run_agent (or omit allowed_actions so ActionRouter's guardrail_config is authoritative) so the recovery agent is constrained by the same guardrail-derived action set as the original run; update the call site where run_agent(...) is invoked (referencing run_agent, ActionRouter, allowed_actions, ALL_ACTIONS, and playbook.guardrails.to_runtime_config()) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/recording_service.py`:
- Around line 59-87: The try/except currently only wraps the generator
definition, so runtime httpx.HTTPError inside proxy_trace() is unhandled; update
get_trace to catch streaming errors by (1) moving the httpx.HTTPError handling
into the async generator proxy_trace(): wrap the async with client.stream(...)
and the async for loop in a try/except that catches httpx.HTTPError, logs the
error (with run_id and exc_info), and raises a small sentinel exception (e.g.,
StreamProxyError) or re-raise a custom exception; (2) then wrap the
StreamingResponse creation/return in a try/except that catches that sentinel
(StreamProxyError) and falls back to the persisted-file path logic (returning
FileResponse) so streaming failures correctly fall back to
_volume_path(...,"trace.zip"); reference symbols: get_trace, proxy_trace,
StreamingResponse, client.stream, httpx.HTTPError, _volume_path, FileResponse.
- Around line 34-40: The path traversal guard in _volume_path uses a string
startswith check that can be bypassed (e.g., /recordings/run123 vs
/recordings/run123_malicious); replace the startswith logic with a proper
Path-based containment check by resolving both base and result and using
Path.is_relative_to(result, base.resolve()) (or
result.is_relative_to(base.resolve()) on Python 3.9+) to validate that result is
inside base, raising the same HTTPException when the check fails.
In `@api/run_service.py`:
- Around line 129-165: The active_sessions() counter is incremented on sandbox
creation but only decremented on the creation failure path; ensure it is
decremented on every exit path by moving or adding active_sessions().add(-1) for
successful completions and cleanup; specifically, decrement after the run
finishes or when the sandbox is stopped/cleaned (hooks/functions: stop_run,
cleanup_finished_sandbox) or add a finally/finalizer around the lifecycle that
calls active_sessions().add(-1) so every created sandbox (created via
create_cua_sandbox and referenced by run_id) always decrements the counter,
mirroring the pattern used in agent/session_runner.py.
In `@bridge/observation.py`:
- Around line 59-78: get_dom_snapshot currently parses JSON from page.evaluate
and indexes keys directly, which can raise JSONDecodeError or KeyError and
violate the docstring contract; update get_dom_snapshot to catch JSON parsing
and key access errors (e.g., json.JSONDecodeError, KeyError, and optionally
TypeError) around json.loads(raw) and the subsequent access to data['title'],
data['url'], and data['dom'], and return None on those exceptions (matching
quick_dom_snapshot behavior) so the function reliably returns None when the
snapshot is unavailable.
---
Nitpick comments:
In `@api/modal_app.py`:
- Around line 13-29: Rename the misleading negated matcher variable
_include_exts to something clearer like _exclude_non_source (or
_exclude_other_extensions) and update all references (notably in the _ignore
function) to use the new name; leave the negation (~FilePatternMatcher(...)) and
matcher patterns unchanged so behavior is preserved, and update any nearby
docstring/comments to reflect that this matcher excludes files that do not have
the listed source extensions while _exclude_dirs continues to exclude
directories.
In `@api/run_service.py`:
- Around line 202-217: The code directly constructs RunStatus(**resp.json()) in
the status fetch path which will raise a Pydantic ValidationError if the sandbox
returns an unexpected schema; wrap the deserialization in a try/except to catch
pydantic.ValidationError (and possibly ValueError/TypeError), log the validation
failure, call self.remove_handle(run_id) if appropriate, then fall back to
self.load_persisted_status(run_id) (as in the existing HTTPError path) and
finally return a default RunStatus with status=RunStatusValue.TERMINATED and an
explanatory error message when no persisted status exists; update the block
around client.get(...) / RunStatus(...) accordingly to handle these validation
errors gracefully.
In `@evaluation/runtime.py`:
- Around line 74-78: The finally block currently awaits recording.stop() then
awaits browser.close(), so if recording.stop() raises the browser may not be
closed; update the cleanup to perform the two shutdowns independently—e.g., call
await recording.stop() inside its own try/except (or run both await
recording.stop() and await browser.close() via asyncio.gather(...,
return_exceptions=True)) and ensure browser.close() is always awaited even when
recording.stop() fails. Apply this change around the TrialRuntime yield cleanup
to guarantee browser.close() runs regardless of recording.stop() errors.
In `@guardrails/destructive.py`:
- Around line 53-55: The current logic auto-approves any unmatched selector when
self._llm_enabled is False by adding normalized to self._approved_selectors and
returning None; change this to a conservative default: do not add unmatched
selectors to self._approved_selectors and instead treat them as denied (e.g.,
add to a self._denied_selectors set or return an explicit deny value) unless an
explicit opt-in flag is set. Add a new config flag (e.g.,
self._allow_unknown_when_llm_disabled) defaulting to False, check that flag
where self._llm_enabled is used, and update the branch that references
self._approved_selectors/normalized so unknown selectors are blocked unless the
flag is true; also add/update unit tests for both flag values.
In `@guardrails/dns.py`:
- Around line 68-71: The per-call ThreadPoolExecutor creation inside
_resolve_and_check is inefficient; refactor to use a shared executor created in
the class __init__ (e.g., self._executor = ThreadPoolExecutor(...)) and reuse it
for executor.submit(do_resolve) in _resolve_and_check, and ensure it is cleanly
shut down (e.g., in a close() method or __del__); alternatively, if you can
convert the sync API, make _resolve_and_check async and replace
executor.submit/future.result with asyncio.to_thread(do_resolve) and adjust
check_hostname accordingly.
- Around line 86-88: The except block that catches socket.gaierror and
concurrent.futures.TimeoutError currently swallows DNS errors; update that
handler to emit a DEBUG-level log with contextual details and the exception
before returning None. Specifically, in the except (socket.gaierror,
concurrent.futures.TimeoutError): block add a call like logger.debug("DNS
resolution failed for %s: %s", host_or_query, e, exc_info=True) (or use the
module's logger name) so the failure is recorded for ops visibility, then return
None as before. Ensure you reference the same exception variable (e) and include
enough context (host/query and that it’s a DNS resolution timeout/gaierror).
- Around line 57-62: The current caching code in the method that calls
self._resolve_and_check(hostname) clears the whole dict when len(self._cache) >=
self._cache_max; replace this full-clear strategy with an LRU eviction policy by
turning self._cache into a collections.OrderedDict (or using functools.lru_cache
around _resolve_and_check) and on set: if len(self._cache) >= self._cache_max
call self._cache.popitem(last=False) to evict the least-recently-used entry,
then assign self._cache[hostname] = reason and call
self._cache.move_to_end(hostname) on hits; update the cache initialization and
any cache-access sites accordingly (references: self._cache, self._cache_max,
_resolve_and_check).
In `@playbooks/output.py`:
- Around line 26-36: The call to extract_structured_output currently discards
the returned token counts; capture the full tuple (data, input_tokens,
output_tokens) from extract_structured_output and either return the token counts
alongside data (e.g., return data, input_tokens, output_tokens) or log them for
cost/usage tracking (use your logger). Update the error path to mirror the new
return type or logging behavior and adjust any callers of this function
accordingly; reference extract_structured_output, the variables input_tokens and
output_tokens, and playbook_name when adding contextual log messages.
- Around line 35-36: The bare "except Exception: return None" in
playbooks/output.py swallows errors; modify that except to "except Exception as
e:" and log the failure (e.g. logger.warning or logger.debug) including the
exception message and optional traceback before returning None so extraction
failures are visible; use the existing logger or import logging and call
logger.exception/logger.debug with the exception to capture details.
In `@playbooks/recovery.py`:
- Around line 135-145: The recovery handoff currently forces
allowed_actions=ALL_ACTIONS when calling run_agent which can bypass playbook
guardrails; instead derive the allowed actions from the playbook guardrails (use
the runtime config returned by playbook.guardrails.to_runtime_config() or a
helper like guardrails.allowed_actions()) and pass that to run_agent (or omit
allowed_actions so ActionRouter's guardrail_config is authoritative) so the
recovery agent is constrained by the same guardrail-derived action set as the
original run; update the call site where run_agent(...) is invoked (referencing
run_agent, ActionRouter, allowed_actions, ALL_ACTIONS, and
playbook.guardrails.to_runtime_config()) accordingly.
In `@playbooks/runner.py`:
- Around line 131-135: The getattr call is unnecessary because _output_schema is
always set in __init__; replace getattr(self, "_output_schema", None) with
direct attribute access self._output_schema in the extract_structured_data call
(the call using extract_structured_data(step_results,
playbook_name=playbook.name, output_schema=...)). Update the invocation in the
Runner method where step_results and playbook.name are passed so output_schema
uses self._output_schema.
- Around line 187-194: The compatibility wrapper in run_playbook (or wherever
this wrapper lives) calls the private policy method _llm_complete_remaining via
self._get_recovery_policy(), which is fragile; change the policy API to expose a
public method (e.g., llm_complete_remaining or complete_remaining_with_llm) on
StepRecoveryPolicy and update this wrapper to call that public method instead of
_llm_complete_remaining, or alternatively add a thin public forwarder on the
policy class that delegates to the existing private implementation; update
references of _llm_complete_remaining in the wrapper to the new public method to
avoid depending on private internals.
- Around line 172-177: The _get_recovery_policy method currently does redundant
lazy initialization even though __init__ already assigns self._recovery;
simplify _get_recovery_policy to directly return self._recovery (remove the
getattr/creation branch) so the method just returns the existing
StepRecoveryPolicy instance created in __init__ (method names:
_get_recovery_policy, _create_recovery_policy, and attribute _recovery).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e22df133-3e8e-4492-ac30-060389300535
📒 Files selected for processing (17)
api/modal_app.pyapi/recording_service.pyapi/run_service.pyapi/server.pybridge/background.pybridge/execution.pybridge/observation.pybridge/router.pybridge/tool_result.pyevaluation/runner.pyevaluation/runtime.pyguardrails/destructive.pyguardrails/dns.pyguardrails/engine.pyplaybooks/output.pyplaybooks/recovery.pyplaybooks/runner.py
…to focused services
Summary
This PR restructures several orchestration-heavy modules into smaller, single-purpose components without
changing the external API surface or core runtime behavior.
The main goal is to improve readability and maintainability by separating transport/bootstrap code from
business logic, isolating cross-cutting execution concerns, and reducing duplication in playbook/
evaluation flows.
What Changed
Why
recovery logic were interleaved.
Validation
Summary by CodeRabbit
New Features
Refactor