Skip to content

Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries#47

Merged
hwuiwon merged 4 commits into
mainfrom
refactor-2
Mar 31, 2026
Merged

Refactor run/session lifecycle into dedicated packages and tighten browser execution boundaries#47
hwuiwon merged 4 commits into
mainfrom
refactor-2

Conversation

@hwuiwon
Copy link
Copy Markdown
Collaborator

@hwuiwon hwuiwon commented Mar 31, 2026

Summary

This PR reorganizes the run/session lifecycle code into clearer packages, extracts repeated lifecycle
logic into dedicated helpers, and tightens the bridge/playbook boundaries to make the codebase easier to
reason about and test.

What changed

  • Moved API run lifecycle code into api/runs/
    • api/runs/service.py
    • api/runs/registry.py
    • api/runs/store.py
  • Moved agent session lifecycle code into agent/session/
    • agent/session/runner.py
    • agent/session/finalizer.py
  • Updated imports across the app, scripts, evaluation flow, and tests to use the new package layout
  • Updated README.md project structure to reflect the new directories

Refactors

  • Extracted sandbox terminal-state handling into RunOutcome and RunFinalizer
    • centralizes persistence, cleanup, metrics, and trace finalization
    • removes repeated success/failure/cancel/setup-error handling from the session runner
  • Extracted persisted run status/replay handling into PersistedRunStore
    • narrows RunService to orchestration concerns
    • simplifies persisted SSE fallback behavior
  • Cleaned up ActionRouter
    • separated action recording from dispatch
    • added explicit post-navigation validation flow
    • replaced verifier private-method access with a public check_post_navigation() hook
    • isolated DOM blinders post-filtering
  • Cleaned up browser execution
    • replaced private browser constant imports with public exports
    • introduced SequenceExecutor to isolate execute_sequence handling
  • Simplified playbook runner/test seams
    • PlaybookRunner now delegates directly to StepRecoveryPolicy
    • tests target bind_step_params and StepRecoveryPolicy directly instead of private runner helpers
  • Fixed the runtime import regression caused by the browser constant rename in bridge/observation.py
  • Fixed an invalid f-string in playbooks/recovery.py

Tests

Added:

  • tests/test_run_service.py
  • tests/test_session_finalizer.py

Updated:

  • run registry / streaming imports
  • playbook tests to target public seams
  • blinders test for post-navigation verifier behavior

Verification

  • uv run ruff check .
  • uv run ty check
  • uv run pytest -q tests/test_run_service.py tests/test_run_registry.py tests/test_streaming.py tests/
    test_session_finalizer.py tests/test_dry_run.py tests/test_playbooks.py
  • uv run python -c "import api.server, agent.main, evaluation.runner, scripts.run_local; print('ok')"

Summary by CodeRabbit

  • New Features

    • Session finalization with detailed run outcomes, exit codes, telemetry, and safe cleanup
    • Persisted run status storage with replayable SSE streaming for run recovery
    • Post-navigation URL validation to strengthen guardrails
  • Documentation

    • README updated to reflect reorganized session and run service structure
  • Refactor

    • Session and run lifecycle reorganized into dedicated packages; action routing and DOM/navigation flow clarified
  • Tests

    • Added tests for finalization, persisted runs, and post-navigation checks

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 31, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 211f02fe-8ff3-46d2-926b-0375b06ed028

📥 Commits

Reviewing files that changed from the base of the PR and between ab99288 and 96de8f9.

📒 Files selected for processing (8)
  • agent/session/__init__.py
  • agent/session/finalizer.py
  • api/runs/__init__.py
  • api/runs/service.py
  • api/runs/store.py
  • api/server.py
  • tests/test_run_service.py
  • tests/test_session_finalizer.py

📝 Walkthrough

Walkthrough

Reorganizes sandbox session lifecycle into a new agent/session/ package (runner + finalizer), introduces persisted run storage and SSE replay under api/runs/, removes the legacy agent/session_runner.py, refactors ActionRouter to drop recording dependency, and updates related imports and tests.

Changes

Cohort / File(s) Summary
README Project Structure
README.md
Documented new agent/session/ subdirectory and api/runs/ structure; updated API/session role descriptions.
Session package (new)
agent/session/__init__.py, agent/session/runner.py, agent/session/finalizer.py
Added agent.session package. runner.py provides run_sandbox_session orchestration with spans, browser launch, optional recording, blinders, agent loop. finalizer.py adds RunOutcome and RunFinalizer for persistence, recording commit, cleanup, and telemetry.
Removed legacy session
agent/session_runner.py
Deleted previous sandbox session lifecycle implementation (migrated to agent/session/).
Session entrypoint update
agent/main.py
Updated import to load run_sandbox_session from agent.session.runner.
API runs package (new)
api/runs/__init__.py, api/runs/store.py, api/runs/service.py
Added PersistedRunStore to load status.json and build SSE replay streams; service.py now delegates persisted-status and event-stream handling to the store and centralizes 404/replay logic.
API import updates
api/recording_service.py, api/server.py, tests/test_dry_run.py, tests/test_run_registry.py, tests/test_streaming.py
Redirected imports to api.runs.* modules (registry/service) where applicable.
ActionRouter refactor
bridge/router.py, evaluation/runner.py, playbooks/recovery.py, scripts/run_local.py
Removed recording parameter from ActionRouter and call sites; refactored dispatch to centralize post-navigation logic into _post_navigation_phase, _check_post_navigation, _apply_dom_blinders, and added _record_action for logging/persistence.
Sequence execution refactor
bridge/execution.py
Replaced private _execute_sequence with frozen SequenceExecutor dataclass; moved per-step validation/error handling into methods; updated call sites.
DOM constants & observation
bridge/browser.py, bridge/observation.py
Renamed _DOM_MAX_CHARSDOM_MAX_CHARS and _AUTO_DOM_MAX_CHARSAUTO_DOM_MAX_CHARS; updated observation defaults to use public constants.
Blinders verifier extension
blinders/verifier.py, tests/test_blinders.py
Added ScopeVerifier.check_post_navigation(url) to perform domain scoping and guardrail checks after navigation; added test for post-navigation behavior.
Playbooks runner cleanup
playbooks/runner.py, playbooks/recovery.py, tests/test_playbooks.py
Simplified PlaybookRunner recovery wiring (direct StepRecoveryPolicy use), removed private helpers, and updated tests to use public helpers/policy.
Session finalizer & run service tests
tests/test_session_finalizer.py, tests/test_run_service.py
Added tests for RunOutcome.from_agent_result and RunFinalizer.finalize; added persisted-run store and RunService fallback tests exercising SSE replay and not-found behavior.
Misc tests/import fixes
tests/... (various)
Updated test imports to new api.runs locations and adjusted small call sites to match refactors.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent/Main
    participant Runner as run_sandbox_session
    participant Browser as BrowserManager
    participant Recording as RecordingManager
    participant Blinders as DOMBlinders
    participant Router as ActionRouter
    participant LLM as Agent/Model
    participant Finalizer as RunFinalizer

    Agent->>Runner: run_sandbox_session(run_id, config, ...)
    activate Runner
    Runner->>Browser: Launch browser
    activate Browser
    alt recording enabled
        Runner->>Recording: Start recording
        activate Recording
    end
    Runner->>Blinders: Extract task scope & build blinders
    Runner->>Router: Create ActionRouter(browser, blinders, ...)
    loop agent loop
        Runner->>LLM: run_agent(router,...)
        LLM->>Router: push_action / execute
        Router->>Browser: perform browser action
        Router->>Blinders: apply DOM blinders / guardrails
        Router-->>LLM: action result
    end
    Runner->>Finalizer: finalize(outcome, result)
    activate Finalizer
    Finalizer->>Finalizer: persist status (complete_run / persist_status)
    alt recording present
        Finalizer->>Recording: stop & commit/upload recordings
        deactivate Recording
    end
    Finalizer->>Browser: close browser
    deactivate Browser
    Finalizer->>Finalizer: emit telemetry & update metrics
    deactivate Finalizer
    deactivate Runner
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main refactoring effort: reorganizing run/session lifecycle code into dedicated packages and improving browser execution boundaries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor-2

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread api/runs/service.py Fixed
Comment thread tests/test_run_service.py Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
agent/session/finalizer.py (1)

101-105: Consider adding a type hint for the result parameter.

The result parameter lacks a type annotation, which reduces IDE support and type checker coverage. If circular import is a concern, consider using TYPE_CHECKING with a string annotation.

💡 Suggested type annotation
+    from typing import TYPE_CHECKING
+    if TYPE_CHECKING:
+        from agent.loop import AgentResult
+
     `@classmethod`
     def from_agent_result(
         cls,
-        result,
+        result: "AgentResult",
     ) -> RunOutcome:

Or at minimum, document the expected interface in a docstring.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent/session/finalizer.py` around lines 101 - 105, The classmethod
from_agent_result is missing a type annotation for its result parameter; add an
appropriate type hint (e.g., the AgentResult type or a protocol describing the
expected interface) to the result parameter in from_agent_result and update
imports using typing.TYPE_CHECKING with a string annotation if needed to avoid
circular imports, or use a forward-reference string like "AgentResult" and/or
document the expected attributes in the method docstring; ensure the return type
remains RunOutcome and adjust imports to include typing.TYPE_CHECKING if you
import AgentResult only for type checking.
tests/test_session_finalizer.py (1)

16-18: Consider documenting the stub's expected fields.

The _AgentResult stub declares only success: bool but tests pass additional fields via SimpleNamespace kwargs. While this works, it could mask issues if the real AgentResult schema changes.

💡 Optional: Add comments documenting expected fields
 class _AgentResult(SimpleNamespace):
+    """Stub for AgentResult. Expected fields depend on test:
+    - success, summary, data, extracted_texts, error (for from_agent_result)
+    - action_count, total_input_tokens, total_output_tokens, total_duration_ms (for finalize)
+    """
     success: bool
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_session_finalizer.py` around lines 16 - 18, The _AgentResult test
stub currently subclasses SimpleNamespace and only declares success: bool, which
can hide missing fields passed via kwargs; update the _AgentResult stub in
tests/test_session_finalizer.py to document (via an inline comment or docstring)
the expected fields used by tests (e.g., success, message, output, etc.) and,
optionally, add explicit attributes with types to match the real AgentResult
shape so the tests fail if the real schema changes; reference the _AgentResult
class and SimpleNamespace base when making these additions.
agent/session/runner.py (1)

70-76: Consider extracting finalizer creation to avoid duplication.

The RunFinalizer is created twice: once before setup (with recording=None) and again after successful setup (with the actual recording). While this is logically correct (early failure has no recording), the duplication could be reduced.

💡 Optional: Create finalizer only after setup or use a builder pattern

One approach is to defer finalizer creation until after setup:

         browser = BrowserManager()
-        finalizer = RunFinalizer(
-            run_id=run_id,
-            browser=browser,
-            recording=recording,
-            recording_upload=config.recording_config.upload,
-        )
+        finalizer: RunFinalizer | None = None
         try:
             with tracer.start_as_current_span(AGENT_SETUP) as setup_span:
                 # ... setup code ...
         except Exception as exc:
             logger.error("Setup failed: %s", exc)
             run_span.record_exception(exc)
             outcome = RunOutcome.setup_failed(run_id, exc)
             run_span.set_status(outcome.trace_status, outcome.trace_message or "")
+            finalizer = RunFinalizer(run_id=run_id, browser=browser, recording=None, recording_upload=False)
             return await finalizer.finalize(outcome)

         finalizer = RunFinalizer(
             run_id=run_id,
             browser=browser,
             recording=recording,
             recording_upload=config.recording_config.upload,
         )

This makes it clearer that the finalizer configuration depends on setup success.

Also applies to: 117-122

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent/session/runner.py` around lines 70 - 76, Refactor to eliminate the
duplicated RunFinalizer construction by deferring or centralizing its creation:
remove the initial RunFinalizer(...) instantiation before setup and instead
create the finalizer after setup succeeds using the actual recording (use
run_id, BrowserManager(), recording, and config.recording_config.upload), or add
a small helper/builder function (e.g., build_finalizer(run_id, browser,
recording, upload)) that both pre- and post-setup code paths call with the
appropriate recording value; update any references to the finalizer variable
accordingly (locations around the current RunFinalizer usage and the later
re-creation at lines that reference recording and finalizer).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/runs/store.py`:
- Around line 20-22: The constructor __init__ should annotate the volume
parameter as optional (e.g., Optional[...] or the correct volume type) and store
it as an Optional on self; then guard any usage of self._volume.reload.aio()
(and similar calls) with a None check or early return so that when volume is
None (local/non-Modal) you don't call attributes on None; update the type import
(typing.Optional) and ensure methods that call self._volume (referenced by
self._volume.reload.aio()) handle the None case safely.
- Around line 62-64: The SSE "complete" event currently serializes only the
status field; update the yield in the generator that emits the complete event so
it includes the full RunStatus object from the persisted run (the same RunStatus
that contains result, error, data, duration_ms, etc.) instead of {'status':
persisted.status}. Serialize the entire RunStatus (e.g., convert
persisted.status to a dict/JSON via its existing
to_dict()/dict()/asdict()/__dict__ or JSON helper) and use json.dumps on that
object when forming the f"event: complete\ndata: ..." payload so clients receive
the full run outcome.

---

Nitpick comments:
In `@agent/session/finalizer.py`:
- Around line 101-105: The classmethod from_agent_result is missing a type
annotation for its result parameter; add an appropriate type hint (e.g., the
AgentResult type or a protocol describing the expected interface) to the result
parameter in from_agent_result and update imports using typing.TYPE_CHECKING
with a string annotation if needed to avoid circular imports, or use a
forward-reference string like "AgentResult" and/or document the expected
attributes in the method docstring; ensure the return type remains RunOutcome
and adjust imports to include typing.TYPE_CHECKING if you import AgentResult
only for type checking.

In `@agent/session/runner.py`:
- Around line 70-76: Refactor to eliminate the duplicated RunFinalizer
construction by deferring or centralizing its creation: remove the initial
RunFinalizer(...) instantiation before setup and instead create the finalizer
after setup succeeds using the actual recording (use run_id, BrowserManager(),
recording, and config.recording_config.upload), or add a small helper/builder
function (e.g., build_finalizer(run_id, browser, recording, upload)) that both
pre- and post-setup code paths call with the appropriate recording value; update
any references to the finalizer variable accordingly (locations around the
current RunFinalizer usage and the later re-creation at lines that reference
recording and finalizer).

In `@tests/test_session_finalizer.py`:
- Around line 16-18: The _AgentResult test stub currently subclasses
SimpleNamespace and only declares success: bool, which can hide missing fields
passed via kwargs; update the _AgentResult stub in
tests/test_session_finalizer.py to document (via an inline comment or docstring)
the expected fields used by tests (e.g., success, message, output, etc.) and,
optionally, add explicit attributes with types to match the real AgentResult
shape so the tests fail if the real schema changes; reference the _AgentResult
class and SimpleNamespace base when making these additions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3644c5ba-722f-4d45-b4c5-080407ec20c1

📥 Commits

Reviewing files that changed from the base of the PR and between ff9b2a9 and 3bcb639.

📒 Files selected for processing (28)
  • README.md
  • agent/main.py
  • agent/session/__init__.py
  • agent/session/finalizer.py
  • agent/session/runner.py
  • agent/session_runner.py
  • api/recording_service.py
  • api/runs/__init__.py
  • api/runs/registry.py
  • api/runs/service.py
  • api/runs/store.py
  • api/server.py
  • blinders/verifier.py
  • bridge/browser.py
  • bridge/execution.py
  • bridge/observation.py
  • bridge/router.py
  • evaluation/runner.py
  • playbooks/recovery.py
  • playbooks/runner.py
  • scripts/run_local.py
  • tests/test_blinders.py
  • tests/test_dry_run.py
  • tests/test_playbooks.py
  • tests/test_run_registry.py
  • tests/test_run_service.py
  • tests/test_session_finalizer.py
  • tests/test_streaming.py
💤 Files with no reviewable changes (3)
  • evaluation/runner.py
  • scripts/run_local.py
  • agent/session_runner.py

Comment thread api/runs/store.py Outdated
Comment thread api/runs/store.py Outdated
@hwuiwon hwuiwon merged commit fd4a818 into main Mar 31, 2026
3 checks passed
@hwuiwon hwuiwon deleted the refactor-2 branch March 31, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant