Add read-only PR gate intake validation by Am1n3e · Pull Request #33 · ServiceNow/webarena-verified

Am1n3e · 2026-03-07T12:18:42Z

Summary

Add a new Leaderboard PR Gate workflow for leaderboard-submissions pull requests touching submissions/inbox/**, with read-only permissions and credential-less checkout.
Introduce dev/leaderboard/pr_gate_intake_validator.py plus inv dev.leaderboard.pr-gate-intake-validate to enforce one-intake scope, required files, manifest hash/size integrity, and task .missing xor artifact invariants.
Align leaderboard payload types with StrEnum values and replace PR validator dataclasses with frozen Pydantic models; add focused validator/type tests and update Lane C planning docs.

Validation

uv run pytest tests/leaderboard/test_pr_gate_intake_validator.py tests/leaderboard/test_submission_pr_validator.py tests/types/test_leaderboard_types.py

Copilot

Pull request overview

Adds a read-only “PR Gate” validation layer for leaderboard-submissions PRs that modify submissions/inbox/**, plus associated validator logic/types/tests to ensure intake payloads are structurally valid and produce a deterministic score preview.

Changes:

Introduces Leaderboard PR Gate GitHub Actions workflow with read-only permissions and credential-less checkout for submissions/inbox/** diffs.
Adds dev/leaderboard/pr_gate_intake_validator.py and an invoke task to validate intake layout, manifest hash/size checks, and task .missing vs artifacts invariants, emitting a score preview.
Aligns leaderboard payload/type fields to StrEnum values and migrates PR validator context/file models to frozen Pydantic models; adds/updates targeted tests and planning docs.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`.github/workflows/leaderboard-pr-gate.yml`	Adds the PR Gate workflow that runs intake validation + publishes a step summary.
`dev/leaderboard/pr_gate_intake_validator.py`	Implements intake tree validation, manifest integrity checks, task invariants, and score preview generation.
`dev/leaderboard/tasks.py`	Wires a new invoke task to run the PR gate intake validator and print parsed outputs.
`dev/leaderboard/submission_pr_validator.py`	Replaces dataclasses with frozen, extra-forbid Pydantic models for PR validation inputs.
`src/webarena_verified/types/leaderboard/submission_payload.py`	Introduces `SubmissionLeaderboard` as a `StrEnum` and updates `SubmissionMetadata` typing accordingly.
`src/webarena_verified/types/leaderboard/leaderboard_data.py`	Introduces `LeaderboardView` as a `StrEnum` and updates `LeaderboardTableFile.leaderboard` typing accordingly.
`src/webarena_verified/types/leaderboard/__init__.py`	Re-exports the new enums from the leaderboard types package.
`tests/leaderboard/test_pr_gate_intake_validator.py`	Adds unit tests for intake selection, manifest hashing, task invariants, and the run wrapper.
`tests/types/test_leaderboard_types.py`	Updates type tests to assert `StrEnum` values instead of raw strings.
`.wip/lane-c-pr-gate.md`	Updates Lane C checklist items to completed with implementation references.
`.wip/lane-c-pr-gate-implementation-plan.md`	Adds a detailed code-level implementation plan and test matrix for the PR Gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T13:08:04Z

dev/leaderboard/pr_gate_intake_validator.py

+def validate_manifest_integrity(intake_root: Path, manifest: IntakeManifest) -> None:
+    for entry in manifest.files:
+        target = _resolve_declared_file(intake_root, entry.path)
+        if not target.is_file():
+            _fail("C04_MANIFEST_PATH_MISSING", f"Manifest declared path does not exist as a file: {entry.path}")
+


validate_manifest_integrity() only verifies the files listed in manifest.files. As a result, an intake can omit required artifacts (e.g., task JSON/HAR files) from the manifest or add extra undeclared files under the intake tree and still pass hash/size validation. To make the “manifest integrity” guarantee meaningful, consider enforcing that the manifest covers all required files (at least submission.json and all validated task files) and/or that there are no extra files on disk that are not declared in the manifest (excluding manifest.json itself).

Copilot · 2026-03-07T13:08:05Z

dev/leaderboard/pr_gate_intake_validator.py

+    for task_dir in task_dirs:
+        names = {entry.name for entry in task_dir.iterdir() if entry.is_file()}
+        if not names:
+            _fail("C05_TASK_EMPTY_DIR", f"Task directory has no files: {task_dir}")


validate_task_invariants() builds names from entry.is_file() only, so directories/symlinks inside a task folder are silently ignored. This allows extra content (including nested directories) to bypass both the “allowed files” check and manifest hashing. Consider validating that each task directory contains only regular files (no subdirectories/symlinks) and failing if any non-file entries exist (and similarly rejecting non-directory entries directly under tasks/).

Copilot · 2026-03-07T13:08:05Z

tests/leaderboard/test_pr_gate_intake_validator.py

+def test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch(monkeypatch, tmp_path: Path):
+    _create_valid_intake(tmp_path, intake_id="intake-xyz")
+
+    monkeypatch.setattr(
+        validator,
+        "_run_git_diff_paths",
+        lambda **_: [
+            "submissions/inbox/intake-xyz/submission.json",
+            "submissions/inbox/intake-xyz/manifest.json",
+            "submissions/inbox/intake-xyz/tasks/1/agent_response.json",
+        ],
+    )
+
+    with pytest.raises(validator.PRGateValidationError, match="C06_EVALUATOR_VERSION_MISMATCH"):
+        validator.run_pr_gate_intake_validation(
+            tmp_path,
+            base_sha="abc",
+            head_sha="def",
+            expected_evaluator_version="0.0.0",
+        )


test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch depends on whatever importlib_metadata.version("webarena-verified") returns in the test environment. If the project version ever starts with 0.0.0 (or the expected prefix matches for other reasons), this test will become flaky/incorrect. Consider monkeypatching validator.resolve_evaluator_version() to return a known value and then asserting the mismatch against a different expected prefix.

Am1n3e requested a review from Copilot March 7, 2026 12:59

Copilot started reviewing on behalf of Am1n3e March 7, 2026 13:00 View session

Am1n3e added 2 commits March 7, 2026 14:01

Add read-only PR gate intake validation for leaderboard submissions

128bc91

Add score preview output and coded diagnostics to PR gate

3b617dc

Am1n3e force-pushed the wip/implementation-plan-check branch from cc5b0c1 to 3b617dc Compare March 7, 2026 13:01

Am1n3e merged commit 79a6a63 into feature/add-leaderboard Mar 7, 2026

Am1n3e deleted the wip/implementation-plan-check branch March 7, 2026 13:02

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read-only PR gate intake validation#33

Add read-only PR gate intake validation#33
Am1n3e merged 2 commits intofeature/add-leaderboardfrom
wip/implementation-plan-check

Am1n3e commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Am1n3e commented Mar 7, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants