Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries by hwuiwon · Pull Request #47 · AppliedLabsAI/cua

hwuiwon · 2026-03-31T04:25:41Z

Summary

This PR reorganizes the run/session lifecycle code into clearer packages, extracts repeated lifecycle
logic into dedicated helpers, and tightens the bridge/playbook boundaries to make the codebase easier to
reason about and test.

What changed

Moved API run lifecycle code into api/runs/
- api/runs/service.py
- api/runs/registry.py
- api/runs/store.py
Moved agent session lifecycle code into agent/session/
- agent/session/runner.py
- agent/session/finalizer.py
Updated imports across the app, scripts, evaluation flow, and tests to use the new package layout
Updated README.md project structure to reflect the new directories

Refactors

Extracted sandbox terminal-state handling into RunOutcome and RunFinalizer
- centralizes persistence, cleanup, metrics, and trace finalization
- removes repeated success/failure/cancel/setup-error handling from the session runner
Extracted persisted run status/replay handling into PersistedRunStore
- narrows RunService to orchestration concerns
- simplifies persisted SSE fallback behavior
Cleaned up ActionRouter
- separated action recording from dispatch
- added explicit post-navigation validation flow
- replaced verifier private-method access with a public check_post_navigation() hook
- isolated DOM blinders post-filtering
Cleaned up browser execution
- replaced private browser constant imports with public exports
- introduced SequenceExecutor to isolate execute_sequence handling
Simplified playbook runner/test seams
- PlaybookRunner now delegates directly to StepRecoveryPolicy
- tests target bind_step_params and StepRecoveryPolicy directly instead of private runner helpers
Fixed the runtime import regression caused by the browser constant rename in bridge/observation.py
Fixed an invalid f-string in playbooks/recovery.py

Tests

Added:

tests/test_run_service.py
tests/test_session_finalizer.py

Updated:

run registry / streaming imports
playbook tests to target public seams
blinders test for post-navigation verifier behavior

Verification

uv run ruff check .
uv run ty check
uv run pytest -q tests/test_run_service.py tests/test_run_registry.py tests/test_streaming.py tests/
test_session_finalizer.py tests/test_dry_run.py tests/test_playbooks.py
uv run python -c "import api.server, agent.main, evaluation.runner, scripts.run_local; print('ok')"

Summary by CodeRabbit

New Features
- Session finalization with detailed run outcomes, exit codes, telemetry, and safe cleanup
- Persisted run status storage with replayable SSE streaming for run recovery
- Post-navigation URL validation to strengthen guardrails
Documentation
- README updated to reflect reorganized session and run service structure
Refactor
- Session and run lifecycle reorganized into dedicated packages; action routing and DOM/navigation flow clarified
Tests
- Added tests for finalization, persisted runs, and post-navigation checks

…owser execution boundaries

coderabbitai · 2026-03-31T04:25:55Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 211f02fe-8ff3-46d2-926b-0375b06ed028

📥 Commits

Reviewing files that changed from the base of the PR and between ab99288 and 96de8f9.

📒 Files selected for processing (8)

agent/session/__init__.py
agent/session/finalizer.py
api/runs/__init__.py
api/runs/service.py
api/runs/store.py
api/server.py
tests/test_run_service.py
tests/test_session_finalizer.py

📝 Walkthrough

Walkthrough

Reorganizes sandbox session lifecycle into a new agent/session/ package (runner + finalizer), introduces persisted run storage and SSE replay under api/runs/, removes the legacy agent/session_runner.py, refactors ActionRouter to drop recording dependency, and updates related imports and tests.

Changes

Cohort / File(s)	Summary
README Project Structure `README.md`	Documented new `agent/session/` subdirectory and `api/runs/` structure; updated API/session role descriptions.
Session package (new) `agent/session/__init__.py`, `agent/session/runner.py`, `agent/session/finalizer.py`	Added `agent.session` package. `runner.py` provides `run_sandbox_session` orchestration with spans, browser launch, optional recording, blinders, agent loop. `finalizer.py` adds `RunOutcome` and `RunFinalizer` for persistence, recording commit, cleanup, and telemetry.
Removed legacy session `agent/session_runner.py`	Deleted previous sandbox session lifecycle implementation (migrated to `agent/session/`).
Session entrypoint update `agent/main.py`	Updated import to load `run_sandbox_session` from `agent.session.runner`.
API runs package (new) `api/runs/__init__.py`, `api/runs/store.py`, `api/runs/service.py`	Added `PersistedRunStore` to load `status.json` and build SSE replay streams; `service.py` now delegates persisted-status and event-stream handling to the store and centralizes 404/replay logic.
API import updates `api/recording_service.py`, `api/server.py`, `tests/test_dry_run.py`, `tests/test_run_registry.py`, `tests/test_streaming.py`	Redirected imports to `api.runs.*` modules (registry/service) where applicable.
ActionRouter refactor `bridge/router.py`, `evaluation/runner.py`, `playbooks/recovery.py`, `scripts/run_local.py`	Removed `recording` parameter from `ActionRouter` and call sites; refactored dispatch to centralize post-navigation logic into `_post_navigation_phase`, `_check_post_navigation`, `_apply_dom_blinders`, and added `_record_action` for logging/persistence.
Sequence execution refactor `bridge/execution.py`	Replaced private `_execute_sequence` with frozen `SequenceExecutor` dataclass; moved per-step validation/error handling into methods; updated call sites.
DOM constants & observation `bridge/browser.py`, `bridge/observation.py`	Renamed `_DOM_MAX_CHARS` → `DOM_MAX_CHARS` and `_AUTO_DOM_MAX_CHARS` → `AUTO_DOM_MAX_CHARS`; updated observation defaults to use public constants.
Blinders verifier extension `blinders/verifier.py`, `tests/test_blinders.py`	Added `ScopeVerifier.check_post_navigation(url)` to perform domain scoping and guardrail checks after navigation; added test for post-navigation behavior.
Playbooks runner cleanup `playbooks/runner.py`, `playbooks/recovery.py`, `tests/test_playbooks.py`	Simplified PlaybookRunner recovery wiring (direct `StepRecoveryPolicy` use), removed private helpers, and updated tests to use public helpers/policy.
Session finalizer & run service tests `tests/test_session_finalizer.py`, `tests/test_run_service.py`	Added tests for `RunOutcome.from_agent_result` and `RunFinalizer.finalize`; added persisted-run store and `RunService` fallback tests exercising SSE replay and not-found behavior.
Misc tests/import fixes `tests/...` (various)	Updated test imports to new `api.runs` locations and adjusted small call sites to match refactors.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent/Main
    participant Runner as run_sandbox_session
    participant Browser as BrowserManager
    participant Recording as RecordingManager
    participant Blinders as DOMBlinders
    participant Router as ActionRouter
    participant LLM as Agent/Model
    participant Finalizer as RunFinalizer

    Agent->>Runner: run_sandbox_session(run_id, config, ...)
    activate Runner
    Runner->>Browser: Launch browser
    activate Browser
    alt recording enabled
        Runner->>Recording: Start recording
        activate Recording
    end
    Runner->>Blinders: Extract task scope & build blinders
    Runner->>Router: Create ActionRouter(browser, blinders, ...)
    loop agent loop
        Runner->>LLM: run_agent(router,...)
        LLM->>Router: push_action / execute
        Router->>Browser: perform browser action
        Router->>Blinders: apply DOM blinders / guardrails
        Router-->>LLM: action result
    end
    Runner->>Finalizer: finalize(outcome, result)
    activate Finalizer
    Finalizer->>Finalizer: persist status (complete_run / persist_status)
    alt recording present
        Finalizer->>Recording: stop & commit/upload recordings
        deactivate Recording
    end
    Finalizer->>Browser: close browser
    deactivate Browser
    Finalizer->>Finalizer: emit telemetry & update metrics
    deactivate Finalizer
    deactivate Runner

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries #47 — Implements the same refactor moving session lifecycle into agent/session and introducing api/runs; closely overlaps runner/finalizer changes.
Add SSE replay, event persistence, and multi-container support #20 — Adds persisted run storage and SSE replay support; directly related to api/runs/store.py and api/runs/service.py.
Add structured run errors, reduce stuck false positives, and harden sandbox shutdown #38 — Modifies run_sandbox_session relocation and error-persistence/finalization paths; related to runner/finalizer relocation.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main refactoring effort: reorganizing run/session lifecycle code into dedicated packages and improving browser execution boundaries.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor-2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

agent/session/finalizer.py (1)

101-105: Consider adding a type hint for the result parameter.

The result parameter lacks a type annotation, which reduces IDE support and type checker coverage. If circular import is a concern, consider using TYPE_CHECKING with a string annotation.

💡 Suggested type annotation

+    from typing import TYPE_CHECKING
+    if TYPE_CHECKING:
+        from agent.loop import AgentResult
+
     `@classmethod`
     def from_agent_result(
         cls,
-        result,
+        result: "AgentResult",
     ) -> RunOutcome:

Or at minimum, document the expected interface in a docstring.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent/session/finalizer.py` around lines 101 - 105, The classmethod
from_agent_result is missing a type annotation for its result parameter; add an
appropriate type hint (e.g., the AgentResult type or a protocol describing the
expected interface) to the result parameter in from_agent_result and update
imports using typing.TYPE_CHECKING with a string annotation if needed to avoid
circular imports, or use a forward-reference string like "AgentResult" and/or
document the expected attributes in the method docstring; ensure the return type
remains RunOutcome and adjust imports to include typing.TYPE_CHECKING if you
import AgentResult only for type checking.

tests/test_session_finalizer.py (1)

16-18: Consider documenting the stub's expected fields.

The _AgentResult stub declares only success: bool but tests pass additional fields via SimpleNamespace kwargs. While this works, it could mask issues if the real AgentResult schema changes.

💡 Optional: Add comments documenting expected fields

 class _AgentResult(SimpleNamespace):
+    """Stub for AgentResult. Expected fields depend on test:
+    - success, summary, data, extracted_texts, error (for from_agent_result)
+    - action_count, total_input_tokens, total_output_tokens, total_duration_ms (for finalize)
+    """
     success: bool

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_session_finalizer.py` around lines 16 - 18, The _AgentResult test
stub currently subclasses SimpleNamespace and only declares success: bool, which
can hide missing fields passed via kwargs; update the _AgentResult stub in
tests/test_session_finalizer.py to document (via an inline comment or docstring)
the expected fields used by tests (e.g., success, message, output, etc.) and,
optionally, add explicit attributes with types to match the real AgentResult
shape so the tests fail if the real schema changes; reference the _AgentResult
class and SimpleNamespace base when making these additions.

agent/session/runner.py (1)

70-76: Consider extracting finalizer creation to avoid duplication.

The RunFinalizer is created twice: once before setup (with recording=None) and again after successful setup (with the actual recording). While this is logically correct (early failure has no recording), the duplication could be reduced.

💡 Optional: Create finalizer only after setup or use a builder pattern

One approach is to defer finalizer creation until after setup:

         browser = BrowserManager()
-        finalizer = RunFinalizer(
-            run_id=run_id,
-            browser=browser,
-            recording=recording,
-            recording_upload=config.recording_config.upload,
-        )
+        finalizer: RunFinalizer | None = None
         try:
             with tracer.start_as_current_span(AGENT_SETUP) as setup_span:
                 # ... setup code ...
         except Exception as exc:
             logger.error("Setup failed: %s", exc)
             run_span.record_exception(exc)
             outcome = RunOutcome.setup_failed(run_id, exc)
             run_span.set_status(outcome.trace_status, outcome.trace_message or "")
+            finalizer = RunFinalizer(run_id=run_id, browser=browser, recording=None, recording_upload=False)
             return await finalizer.finalize(outcome)

         finalizer = RunFinalizer(
             run_id=run_id,
             browser=browser,
             recording=recording,
             recording_upload=config.recording_config.upload,
         )

This makes it clearer that the finalizer configuration depends on setup success.

Also applies to: 117-122

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent/session/runner.py` around lines 70 - 76, Refactor to eliminate the
duplicated RunFinalizer construction by deferring or centralizing its creation:
remove the initial RunFinalizer(...) instantiation before setup and instead
create the finalizer after setup succeeds using the actual recording (use
run_id, BrowserManager(), recording, and config.recording_config.upload), or add
a small helper/builder function (e.g., build_finalizer(run_id, browser,
recording, upload)) that both pre- and post-setup code paths call with the
appropriate recording value; update any references to the finalizer variable
accordingly (locations around the current RunFinalizer usage and the later
re-creation at lines that reference recording and finalizer).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/runs/store.py`:
- Around line 20-22: The constructor __init__ should annotate the volume
parameter as optional (e.g., Optional[...] or the correct volume type) and store
it as an Optional on self; then guard any usage of self._volume.reload.aio()
(and similar calls) with a None check or early return so that when volume is
None (local/non-Modal) you don't call attributes on None; update the type import
(typing.Optional) and ensure methods that call self._volume (referenced by
self._volume.reload.aio()) handle the None case safely.
- Around line 62-64: The SSE "complete" event currently serializes only the
status field; update the yield in the generator that emits the complete event so
it includes the full RunStatus object from the persisted run (the same RunStatus
that contains result, error, data, duration_ms, etc.) instead of {'status':
persisted.status}. Serialize the entire RunStatus (e.g., convert
persisted.status to a dict/JSON via its existing
to_dict()/dict()/asdict()/__dict__ or JSON helper) and use json.dumps on that
object when forming the f"event: complete\ndata: ..." payload so clients receive
the full run outcome.

---

Nitpick comments:
In `@agent/session/finalizer.py`:
- Around line 101-105: The classmethod from_agent_result is missing a type
annotation for its result parameter; add an appropriate type hint (e.g., the
AgentResult type or a protocol describing the expected interface) to the result
parameter in from_agent_result and update imports using typing.TYPE_CHECKING
with a string annotation if needed to avoid circular imports, or use a
forward-reference string like "AgentResult" and/or document the expected
attributes in the method docstring; ensure the return type remains RunOutcome
and adjust imports to include typing.TYPE_CHECKING if you import AgentResult
only for type checking.

In `@agent/session/runner.py`:
- Around line 70-76: Refactor to eliminate the duplicated RunFinalizer
construction by deferring or centralizing its creation: remove the initial
RunFinalizer(...) instantiation before setup and instead create the finalizer
after setup succeeds using the actual recording (use run_id, BrowserManager(),
recording, and config.recording_config.upload), or add a small helper/builder
function (e.g., build_finalizer(run_id, browser, recording, upload)) that both
pre- and post-setup code paths call with the appropriate recording value; update
any references to the finalizer variable accordingly (locations around the
current RunFinalizer usage and the later re-creation at lines that reference
recording and finalizer).

In `@tests/test_session_finalizer.py`:
- Around line 16-18: The _AgentResult test stub currently subclasses
SimpleNamespace and only declares success: bool, which can hide missing fields
passed via kwargs; update the _AgentResult stub in
tests/test_session_finalizer.py to document (via an inline comment or docstring)
the expected fields used by tests (e.g., success, message, output, etc.) and,
optionally, add explicit attributes with types to match the real AgentResult
shape so the tests fail if the real schema changes; reference the _AgentResult
class and SimpleNamespace base when making these additions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3644c5ba-722f-4d45-b4c5-080407ec20c1

📥 Commits

Reviewing files that changed from the base of the PR and between ff9b2a9 and 3bcb639.

📒 Files selected for processing (28)

README.md
agent/main.py
agent/session/__init__.py
agent/session/finalizer.py
agent/session/runner.py
agent/session_runner.py
api/recording_service.py
api/runs/__init__.py
api/runs/registry.py
api/runs/service.py
api/runs/store.py
api/server.py
blinders/verifier.py
bridge/browser.py
bridge/execution.py
bridge/observation.py
bridge/router.py
evaluation/runner.py
playbooks/recovery.py
playbooks/runner.py
scripts/run_local.py
tests/test_blinders.py
tests/test_dry_run.py
tests/test_playbooks.py
tests/test_run_registry.py
tests/test_run_service.py
tests/test_session_finalizer.py
tests/test_streaming.py

💤 Files with no reviewable changes (3)

evaluation/runner.py
scripts/run_local.py
agent/session_runner.py

Refactor run/session lifecycle into dedicated packages and tighten br…

3bcb639

…owser execution boundaries

github-code-quality Bot found potential problems Mar 31, 2026

View reviewed changes

Comment thread api/runs/service.py Fixed

Comment thread tests/test_run_service.py Fixed

fix

ab99288

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread api/runs/store.py Outdated

Comment thread api/runs/store.py Outdated

hwuiwon added 2 commits March 31, 2026 00:35

fix

177e8cf

fix

96de8f9

hwuiwon merged commit fd4a818 into main Mar 31, 2026
3 checks passed

hwuiwon deleted the refactor-2 branch March 31, 2026 04:37

coderabbitai Bot mentioned this pull request Apr 7, 2026

Fix sandbox errors, add project docs, and update stale documentation #56

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries#47

Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries#47
hwuiwon merged 4 commits into
mainfrom
refactor-2

hwuiwon commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hwuiwon commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Refactors

Tests

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hwuiwon commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading