Add read-only PR gate intake validation#33
Conversation
cc5b0c1 to
3b617dc
Compare
There was a problem hiding this comment.
Pull request overview
Adds a read-only “PR Gate” validation layer for leaderboard-submissions PRs that modify submissions/inbox/**, plus associated validator logic/types/tests to ensure intake payloads are structurally valid and produce a deterministic score preview.
Changes:
- Introduces
Leaderboard PR GateGitHub Actions workflow with read-only permissions and credential-less checkout forsubmissions/inbox/**diffs. - Adds
dev/leaderboard/pr_gate_intake_validator.pyand an invoke task to validate intake layout, manifest hash/size checks, and task.missingvs artifacts invariants, emitting a score preview. - Aligns leaderboard payload/type fields to
StrEnumvalues and migrates PR validator context/file models to frozen Pydantic models; adds/updates targeted tests and planning docs.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/leaderboard-pr-gate.yml |
Adds the PR Gate workflow that runs intake validation + publishes a step summary. |
dev/leaderboard/pr_gate_intake_validator.py |
Implements intake tree validation, manifest integrity checks, task invariants, and score preview generation. |
dev/leaderboard/tasks.py |
Wires a new invoke task to run the PR gate intake validator and print parsed outputs. |
dev/leaderboard/submission_pr_validator.py |
Replaces dataclasses with frozen, extra-forbid Pydantic models for PR validation inputs. |
src/webarena_verified/types/leaderboard/submission_payload.py |
Introduces SubmissionLeaderboard as a StrEnum and updates SubmissionMetadata typing accordingly. |
src/webarena_verified/types/leaderboard/leaderboard_data.py |
Introduces LeaderboardView as a StrEnum and updates LeaderboardTableFile.leaderboard typing accordingly. |
src/webarena_verified/types/leaderboard/__init__.py |
Re-exports the new enums from the leaderboard types package. |
tests/leaderboard/test_pr_gate_intake_validator.py |
Adds unit tests for intake selection, manifest hashing, task invariants, and the run wrapper. |
tests/types/test_leaderboard_types.py |
Updates type tests to assert StrEnum values instead of raw strings. |
.wip/lane-c-pr-gate.md |
Updates Lane C checklist items to completed with implementation references. |
.wip/lane-c-pr-gate-implementation-plan.md |
Adds a detailed code-level implementation plan and test matrix for the PR Gate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def validate_manifest_integrity(intake_root: Path, manifest: IntakeManifest) -> None: | ||
| for entry in manifest.files: | ||
| target = _resolve_declared_file(intake_root, entry.path) | ||
| if not target.is_file(): | ||
| _fail("C04_MANIFEST_PATH_MISSING", f"Manifest declared path does not exist as a file: {entry.path}") | ||
|
|
There was a problem hiding this comment.
validate_manifest_integrity() only verifies the files listed in manifest.files. As a result, an intake can omit required artifacts (e.g., task JSON/HAR files) from the manifest or add extra undeclared files under the intake tree and still pass hash/size validation. To make the “manifest integrity” guarantee meaningful, consider enforcing that the manifest covers all required files (at least submission.json and all validated task files) and/or that there are no extra files on disk that are not declared in the manifest (excluding manifest.json itself).
| for task_dir in task_dirs: | ||
| names = {entry.name for entry in task_dir.iterdir() if entry.is_file()} | ||
| if not names: | ||
| _fail("C05_TASK_EMPTY_DIR", f"Task directory has no files: {task_dir}") |
There was a problem hiding this comment.
validate_task_invariants() builds names from entry.is_file() only, so directories/symlinks inside a task folder are silently ignored. This allows extra content (including nested directories) to bypass both the “allowed files” check and manifest hashing. Consider validating that each task directory contains only regular files (no subdirectories/symlinks) and failing if any non-file entries exist (and similarly rejecting non-directory entries directly under tasks/).
| def test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch(monkeypatch, tmp_path: Path): | ||
| _create_valid_intake(tmp_path, intake_id="intake-xyz") | ||
|
|
||
| monkeypatch.setattr( | ||
| validator, | ||
| "_run_git_diff_paths", | ||
| lambda **_: [ | ||
| "submissions/inbox/intake-xyz/submission.json", | ||
| "submissions/inbox/intake-xyz/manifest.json", | ||
| "submissions/inbox/intake-xyz/tasks/1/agent_response.json", | ||
| ], | ||
| ) | ||
|
|
||
| with pytest.raises(validator.PRGateValidationError, match="C06_EVALUATOR_VERSION_MISMATCH"): | ||
| validator.run_pr_gate_intake_validation( | ||
| tmp_path, | ||
| base_sha="abc", | ||
| head_sha="def", | ||
| expected_evaluator_version="0.0.0", | ||
| ) |
There was a problem hiding this comment.
test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch depends on whatever importlib_metadata.version("webarena-verified") returns in the test environment. If the project version ever starts with 0.0.0 (or the expected prefix matches for other reasons), this test will become flaky/incorrect. Consider monkeypatching validator.resolve_evaluator_version() to return a known value and then asserting the mismatch against a different expected prefix.
Summary
Leaderboard PR Gateworkflow forleaderboard-submissionspull requests touchingsubmissions/inbox/**, with read-only permissions and credential-less checkout.dev/leaderboard/pr_gate_intake_validator.pyplusinv dev.leaderboard.pr-gate-intake-validateto enforce one-intake scope, required files, manifest hash/size integrity, and task.missingxor artifact invariants.StrEnumvalues and replace PR validator dataclasses with frozen Pydantic models; add focused validator/type tests and update Lane C planning docs.Validation
uv run pytest tests/leaderboard/test_pr_gate_intake_validator.py tests/leaderboard/test_submission_pr_validator.py tests/types/test_leaderboard_types.py