feat: reuse mock-LLM E2E tests for Docker image validation#992
Conversation
Add a Docker-specific Playwright config (playwright.mock-llm-docker.config.ts)
that runs the exact same test specs and helpers against the agent-canvas Docker
image instead of the npm build path (bin/agent-canvas.mjs + uvx).
Key changes:
- Split MOCK_LLM_BASE_URL into two constants in mock-llm-helpers.ts:
- MOCK_LLM_BASE_URL: always host-local, used by tests for admin API
- MOCK_LLM_AGENT_URL: env-overridable, used when configuring the LLM
profile (the URL the agent-server uses for inference). Defaults to
MOCK_LLM_BASE_URL for backward compatibility with the npm path.
- New playwright.mock-llm-docker.config.ts:
- Starts the mock LLM server on the host (same as npm path)
- Runs the Docker container with --network host (Linux CI)
- Points to the same testDir (tests/e2e/mock-llm/) and specs
- Separate output dirs to avoid collision with npm path results
- New CI workflow (.github/workflows/mock-llm-docker-e2e.yml):
- Builds the Docker image from current code (or uses a pre-built image)
- Runs the same specs against the container
- Posts PR comment with differentiated report title
- render-mock-llm-report.mjs: accept --title flag for Docker vs npm reports
- npm run test:e2e:mock-llm:docker script added
- .gitignore updated for docker test output dirs
The npm path (test:e2e:mock-llm) is fully backward-compatible — no env var
override needed since MOCK_LLM_AGENT_URL defaults to MOCK_LLM_BASE_URL.
Co-authored-by: openhands <openhands@all-hands.dev>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
❌ Mock-LLM Docker E2E Test Results5/7 passed · 1 failed · 1 skipped Commit:
🔍 Failure details (1)❌ mock-llm-automation.spec.ts › mock-LLM automation lifecycle › step 2: create automation and dispatch run via the UIPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
Instead of rebuilding the Docker image in the E2E workflow (duplicating ~10-15 min of Docker build time), use workflow_run to trigger automatically after the existing 'Docker' workflow completes successfully. The workflow now: - Triggers on: workflow_run (Docker completed) + workflow_dispatch (manual) - Derives the image tag from the Docker build's commit SHA (ghcr.io/openhands/agent-canvas:sha-<short>-amd64) - Pulls the already-built image from GHCR — no rebuild needed - Checks out code at the same SHA as the Docker build - Extracts PR number from workflow_run.pull_requests[] for comments Removed: Docker build steps, Buildx setup, build-arg resolution. All image building stays in docker.yml where it belongs. Co-authored-by: openhands <openhands@all-hands.dev>
873eeaf to
d0d3086
Compare
❌ Mock-LLM E2E Tests4/7 passed · 1 failed · 2 skipped Commit:
🔍 Failure details (1)❌ mock-llm-conversation.spec.ts › mock-LLM agent-server conversation › step 2: activate the mock-llm profile and verify settings APIPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
The 'Active badge' check in step 2 used a hardcoded 1-second waitForTimeout before reloading. On a loaded CI runner the profile activation mutation may not persist in time, causing the reload to show stale state. This is a pre-existing flake (identical test code passed on the first push and failed on the second). Replace with expect.poll() that retries the reload+check cycle with increasing intervals (1s, 2s, 3s) up to 15 seconds total. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
workflow_run only fires when the workflow file exists on the default branch (main). Since mock-llm-docker-e2e.yml is new and only on the PR branch, GitHub doesn't recognize it as a workflow_run listener yet. Add pull_request trigger (gated by 'e2e-tests' label, skip forks) that polls the Docker workflow via gh API until it completes for the PR's head SHA, then pulls the already-built image from GHCR and runs tests. After merge, workflow_run takes over as the primary automatic trigger. The pull_request path remains as a fallback for label-gated runs. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
❌ Mock-LLM Docker E2E Test Results5/7 passed · 1 failed · 1 skipped Commit:
🔍 Failure details (1)❌ mock-llm-automation.spec.ts › mock-LLM automation lifecycle › step 2: create automation and dispatch run via the UIPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
❌ Mock-LLM Docker E2E Test Results5/7 passed · 1 failed · 1 skipped Commit:
🔍 Failure details (1)❌ mock-llm-automation.spec.ts › mock-LLM automation lifecycle › step 2: create automation and dispatch run via the UIPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
…o Docker entrypoint The Docker entrypoint was missing several environment variables that the npm path (dev-with-automation.mjs) sets for the automation backend: - FILE_STORE=local — without this, the automation backend may fall back to cloud storage (S3/GCS) which fails without credentials, causing tarball- based presets (preset/prompt, preset/plugin) to silently error - LOCAL_STORAGE_PATH — where to store files on the local filesystem - AUTOMATION_BASE_URL — publicly-reachable base URL for callback URLs - AUTOMATION_WORKSPACE_BASE — where automation runs unpack tarballs This explains the Docker E2E failure: the agent's curl to create an automation via /api/automation/v1/preset/prompt returned an error (likely 500 from missing storage config), but the mock LLM doesn't care about terminal output and proceeded to return the scripted final reply. The test then found 0 automations. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
# Conflicts: # tests/e2e/mock-llm/utils/mock-llm-helpers.ts
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
✅ Mock-LLM E2E Tests12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
❌ Mock-LLM Docker E2E Test Results8/12 passed · 1 failed · 3 skipped Commit:
🔍 Failure details (1)❌ mock-llm-auth-modes.spec.ts › auth mode: public gate › shows the auth screen when no key is configuredPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
The mock-llm-auth-modes.spec.ts tests npm-binary-specific --auth-required behaviour (a second static-server instance on port 18301). The Docker image doesn't provide this second server — it has its own auth handling. Exclude the spec from the Docker test run via testIgnore. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
Instead of excluding the auth-modes spec from the Docker E2E run or spinning up a host-side static server with a duplicate build/ directory, the Docker entrypoint now supports an optional PUBLIC_MODE_PORT env var. When set, entrypoint.sh starts a second static-server instance from the same baked-in frontend assets with --auth-required (no session key injected). This tests the actual Docker image's auth gate behaviour — not a host-side approximation. The Playwright Docker config passes -e PUBLIC_MODE_PORT=18301 to the container and exports MOCK_LLM_PUBLIC_MODE_URL so the auth-modes spec can reach it. With --network host the port is accessible from the host. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
✅ Mock-LLM E2E Tests12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM Docker E2E Test Results12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Summary
Solid PR that cleanly extends the existing mock-LLM E2E test infrastructure to validate the Docker all-in-one image, reusing specs and helpers with minimal changes. The MOCK_LLM_BASE_URL → MOCK_LLM_AGENT_URL abstraction is the right design for Docker networking, the three-trigger CI chain (workflow_run / pull_request / workflow_dispatch) handles the ordering constraint elegantly, and the expect.poll refactor in the spec is a genuine CI robustness improvement.
A few items worth addressing before merge, noted inline.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
…es, document env vars - Drop 'unlabeled' from pull_request trigger types to avoid wasted workflow runs when any label is removed (the job-level if: condition would skip immediately anyway) - Distinguish 'no Docker run found' vs 'didn't complete in time' in the polling loop's final error message - Add comment explaining /api/automation/v1 probe returns 200 without auth so the readiness check won't spin for 180s - Document FILE_STORE, LOCAL_STORAGE_PATH, AUTOMATION_BASE_URL, and AUTOMATION_WORKSPACE_BASE in the entrypoint header — these affect production deployments, not just E2E tests Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM Docker E2E Test Results12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM E2E Tests12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM Docker E2E Test Results12/12 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
📸 Snapshot Test Report✅ All snapshots match the main branch baselines.
✅ Unchanged snapshots (73)
Generated by the Snapshot Tests workflow. This comment was created by an AI agent (OpenHands) on behalf of the repo maintainers. |
* feat: reuse mock-LLM E2E tests for Docker image validation
Add a Docker-specific Playwright config (playwright.mock-llm-docker.config.ts)
that runs the exact same test specs and helpers against the agent-canvas Docker
image instead of the npm build path (bin/agent-canvas.mjs + uvx).
Key changes:
- Split MOCK_LLM_BASE_URL into two constants in mock-llm-helpers.ts:
- MOCK_LLM_BASE_URL: always host-local, used by tests for admin API
- MOCK_LLM_AGENT_URL: env-overridable, used when configuring the LLM
profile (the URL the agent-server uses for inference). Defaults to
MOCK_LLM_BASE_URL for backward compatibility with the npm path.
- New playwright.mock-llm-docker.config.ts:
- Starts the mock LLM server on the host (same as npm path)
- Runs the Docker container with --network host (Linux CI)
- Points to the same testDir (tests/e2e/mock-llm/) and specs
- Separate output dirs to avoid collision with npm path results
- New CI workflow (.github/workflows/mock-llm-docker-e2e.yml):
- Builds the Docker image from current code (or uses a pre-built image)
- Runs the same specs against the container
- Posts PR comment with differentiated report title
- render-mock-llm-report.mjs: accept --title flag for Docker vs npm reports
- npm run test:e2e:mock-llm:docker script added
- .gitignore updated for docker test output dirs
The npm path (test:e2e:mock-llm) is fully backward-compatible — no env var
override needed since MOCK_LLM_AGENT_URL defaults to MOCK_LLM_BASE_URL.
Co-authored-by: openhands <openhands@all-hands.dev>
* refactor: chain Docker E2E off existing Docker CI via workflow_run
Instead of rebuilding the Docker image in the E2E workflow (duplicating
~10-15 min of Docker build time), use workflow_run to trigger automatically
after the existing 'Docker' workflow completes successfully.
The workflow now:
- Triggers on: workflow_run (Docker completed) + workflow_dispatch (manual)
- Derives the image tag from the Docker build's commit SHA
(ghcr.io/openhands/agent-canvas:sha-<short>-amd64)
- Pulls the already-built image from GHCR — no rebuild needed
- Checks out code at the same SHA as the Docker build
- Extracts PR number from workflow_run.pull_requests[] for comments
Removed: Docker build steps, Buildx setup, build-arg resolution.
All image building stays in docker.yml where it belongs.
Co-authored-by: openhands <openhands@all-hands.dev>
* fix: replace flaky 1s timeout with polling for Active badge assertion
The 'Active badge' check in step 2 used a hardcoded 1-second
waitForTimeout before reloading. On a loaded CI runner the profile
activation mutation may not persist in time, causing the reload to
show stale state. This is a pre-existing flake (identical test code
passed on the first push and failed on the second).
Replace with expect.poll() that retries the reload+check cycle with
increasing intervals (1s, 2s, 3s) up to 15 seconds total.
Co-authored-by: openhands <openhands@all-hands.dev>
* fix: add pull_request trigger for Docker E2E (workflow_run bootstrap)
workflow_run only fires when the workflow file exists on the default
branch (main). Since mock-llm-docker-e2e.yml is new and only on the
PR branch, GitHub doesn't recognize it as a workflow_run listener yet.
Add pull_request trigger (gated by 'e2e-tests' label, skip forks) that
polls the Docker workflow via gh API until it completes for the PR's
head SHA, then pulls the already-built image from GHCR and runs tests.
After merge, workflow_run takes over as the primary automatic trigger.
The pull_request path remains as a fallback for label-gated runs.
Co-authored-by: openhands <openhands@all-hands.dev>
* fix: add FILE_STORE, AUTOMATION_BASE_URL, AUTOMATION_WORKSPACE_BASE to Docker entrypoint
The Docker entrypoint was missing several environment variables that the npm
path (dev-with-automation.mjs) sets for the automation backend:
- FILE_STORE=local — without this, the automation backend may fall back to
cloud storage (S3/GCS) which fails without credentials, causing tarball-
based presets (preset/prompt, preset/plugin) to silently error
- LOCAL_STORAGE_PATH — where to store files on the local filesystem
- AUTOMATION_BASE_URL — publicly-reachable base URL for callback URLs
- AUTOMATION_WORKSPACE_BASE — where automation runs unpack tarballs
This explains the Docker E2E failure: the agent's curl to create an automation
via /api/automation/v1/preset/prompt returned an error (likely 500 from missing
storage config), but the mock LLM doesn't care about terminal output and
proceeded to return the scripted final reply. The test then found 0 automations.
Co-authored-by: openhands <openhands@all-hands.dev>
* fix: exclude auth-modes spec from Docker E2E tests
The mock-llm-auth-modes.spec.ts tests npm-binary-specific --auth-required
behaviour (a second static-server instance on port 18301). The Docker image
doesn't provide this second server — it has its own auth handling. Exclude
the spec from the Docker test run via testIgnore.
Co-authored-by: openhands <openhands@all-hands.dev>
* feat: run auth-modes tests inside Docker via PUBLIC_MODE_PORT
Instead of excluding the auth-modes spec from the Docker E2E run or
spinning up a host-side static server with a duplicate build/ directory,
the Docker entrypoint now supports an optional PUBLIC_MODE_PORT env var.
When set, entrypoint.sh starts a second static-server instance from the
same baked-in frontend assets with --auth-required (no session key
injected). This tests the actual Docker image's auth gate behaviour —
not a host-side approximation.
The Playwright Docker config passes -e PUBLIC_MODE_PORT=18301 to the
container and exports MOCK_LLM_PUBLIC_MODE_URL so the auth-modes spec
can reach it. With --network host the port is accessible from the host.
Co-authored-by: openhands <openhands@all-hands.dev>
* address review feedback: drop unlabeled trigger, improve error messages, document env vars
- Drop 'unlabeled' from pull_request trigger types to avoid wasted
workflow runs when any label is removed (the job-level if: condition
would skip immediately anyway)
- Distinguish 'no Docker run found' vs 'didn't complete in time' in
the polling loop's final error message
- Add comment explaining /api/automation/v1 probe returns 200 without
auth so the readiness check won't spin for 180s
- Document FILE_STORE, LOCAL_STORAGE_PATH, AUTOMATION_BASE_URL, and
AUTOMATION_WORKSPACE_BASE in the entrypoint header — these affect
production deployments, not just E2E tests
Co-authored-by: openhands <openhands@all-hands.dev>
---------
Co-authored-by: openhands <openhands@all-hands.dev>
Why
Related to #511
The mock-LLM E2E tests currently only validate the npm build path (
bin/agent-canvas.mjs+ uvx). The Docker all-in-one image (ghcr.io/openhands/agent-canvas) has no automated behavioral validation — a broken entrypoint, misconfigured proxy route, or missing dependency would only be caught manually. The test specs and helpers are already well-factored and infrastructure-agnostic, so reusing them for Docker validation is straightforward.Summary
MOCK_LLM_BASE_URLinto test-facing (MOCK_LLM_BASE_URL) and agent-facing (MOCK_LLM_AGENT_URL) constants for Docker networking compatibilityplaywright.mock-llm-docker.config.tsthat launches a Docker container (--network host) instead ofbin/agent-canvas.mjs, pointing at the exact same test specs.github/workflows/mock-llm-docker-e2e.yml) that chains off the existing Docker CI viaworkflow_run— pulls the already-built image from GHCR (no rebuild), runs tests against it, posts PR commentIssue Number
N/A
How to Test
npm path (unchanged behavior):
Docker path (new):
CI: The Docker E2E workflow triggers automatically after the existing
Dockerworkflow completes successfully (viaworkflow_run). Can also be triggered manually viaworkflow_dispatchwith a custom image tag.Type
Notes
MOCK_LLM_AGENT_URLdefaults toMOCK_LLM_BASE_URLwhen not set--network host(Linux-only). For macOS/Windows Docker Desktop, setMOCK_LLM_AGENT_URL=http://host.docker.internal:9999Dockerworkflow and pulls the already-pushedsha-<short>-amd64tag from GHCRworkflow_dispatchstill available for testing specific image versions manuallyrender-mock-llm-report.mjsnow accepts--titleflag to differentiate Docker vs npm reportsThis PR was created by an AI agent (OpenHands) on behalf of the user.
🐳 Docker images for this PR
• GHCR package: https://github.com/OpenHands/agent-canvas/pkgs/container/agent-canvas
ghcr.io/openhands/agent-canvasghcr.io/openhands/agent-server:1.24.0-pythonopenhands-automation==1.0.0a57b048d3246efbd60aba89133dedb322cb8f89b42Pull (multi-arch manifest)
# Multi-arch manifest — Docker automatically pulls the correct architecture docker pull ghcr.io/openhands/agent-canvas:sha-7b048d3Run
All tags pushed for this build
About Multi-Architecture Support
sha-7b048d3) is a multi-arch manifest supporting both amd64 and arm64sha-7b048d3-amd64) are also available if needed