Skip to content

Add read-only PR gate intake validation#33

Merged
Am1n3e merged 2 commits intofeature/add-leaderboardfrom
wip/implementation-plan-check
Mar 7, 2026
Merged

Add read-only PR gate intake validation#33
Am1n3e merged 2 commits intofeature/add-leaderboardfrom
wip/implementation-plan-check

Conversation

@Am1n3e
Copy link
Contributor

@Am1n3e Am1n3e commented Mar 7, 2026

Summary

  • Add a new Leaderboard PR Gate workflow for leaderboard-submissions pull requests touching submissions/inbox/**, with read-only permissions and credential-less checkout.
  • Introduce dev/leaderboard/pr_gate_intake_validator.py plus inv dev.leaderboard.pr-gate-intake-validate to enforce one-intake scope, required files, manifest hash/size integrity, and task .missing xor artifact invariants.
  • Align leaderboard payload types with StrEnum values and replace PR validator dataclasses with frozen Pydantic models; add focused validator/type tests and update Lane C planning docs.

Validation

  • uv run pytest tests/leaderboard/test_pr_gate_intake_validator.py tests/leaderboard/test_submission_pr_validator.py tests/types/test_leaderboard_types.py

@Am1n3e Am1n3e force-pushed the wip/implementation-plan-check branch from cc5b0c1 to 3b617dc Compare March 7, 2026 13:01
@Am1n3e Am1n3e merged commit 79a6a63 into feature/add-leaderboard Mar 7, 2026
@Am1n3e Am1n3e deleted the wip/implementation-plan-check branch March 7, 2026 13:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a read-only “PR Gate” validation layer for leaderboard-submissions PRs that modify submissions/inbox/**, plus associated validator logic/types/tests to ensure intake payloads are structurally valid and produce a deterministic score preview.

Changes:

  • Introduces Leaderboard PR Gate GitHub Actions workflow with read-only permissions and credential-less checkout for submissions/inbox/** diffs.
  • Adds dev/leaderboard/pr_gate_intake_validator.py and an invoke task to validate intake layout, manifest hash/size checks, and task .missing vs artifacts invariants, emitting a score preview.
  • Aligns leaderboard payload/type fields to StrEnum values and migrates PR validator context/file models to frozen Pydantic models; adds/updates targeted tests and planning docs.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/leaderboard-pr-gate.yml Adds the PR Gate workflow that runs intake validation + publishes a step summary.
dev/leaderboard/pr_gate_intake_validator.py Implements intake tree validation, manifest integrity checks, task invariants, and score preview generation.
dev/leaderboard/tasks.py Wires a new invoke task to run the PR gate intake validator and print parsed outputs.
dev/leaderboard/submission_pr_validator.py Replaces dataclasses with frozen, extra-forbid Pydantic models for PR validation inputs.
src/webarena_verified/types/leaderboard/submission_payload.py Introduces SubmissionLeaderboard as a StrEnum and updates SubmissionMetadata typing accordingly.
src/webarena_verified/types/leaderboard/leaderboard_data.py Introduces LeaderboardView as a StrEnum and updates LeaderboardTableFile.leaderboard typing accordingly.
src/webarena_verified/types/leaderboard/__init__.py Re-exports the new enums from the leaderboard types package.
tests/leaderboard/test_pr_gate_intake_validator.py Adds unit tests for intake selection, manifest hashing, task invariants, and the run wrapper.
tests/types/test_leaderboard_types.py Updates type tests to assert StrEnum values instead of raw strings.
.wip/lane-c-pr-gate.md Updates Lane C checklist items to completed with implementation references.
.wip/lane-c-pr-gate-implementation-plan.md Adds a detailed code-level implementation plan and test matrix for the PR Gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +226 to +231
def validate_manifest_integrity(intake_root: Path, manifest: IntakeManifest) -> None:
for entry in manifest.files:
target = _resolve_declared_file(intake_root, entry.path)
if not target.is_file():
_fail("C04_MANIFEST_PATH_MISSING", f"Manifest declared path does not exist as a file: {entry.path}")

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_manifest_integrity() only verifies the files listed in manifest.files. As a result, an intake can omit required artifacts (e.g., task JSON/HAR files) from the manifest or add extra undeclared files under the intake tree and still pass hash/size validation. To make the “manifest integrity” guarantee meaningful, consider enforcing that the manifest covers all required files (at least submission.json and all validated task files) and/or that there are no extra files on disk that are not declared in the manifest (excluding manifest.json itself).

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +258
for task_dir in task_dirs:
names = {entry.name for entry in task_dir.iterdir() if entry.is_file()}
if not names:
_fail("C05_TASK_EMPTY_DIR", f"Task directory has no files: {task_dir}")
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_task_invariants() builds names from entry.is_file() only, so directories/symlinks inside a task folder are silently ignored. This allows extra content (including nested directories) to bypass both the “allowed files” check and manifest hashing. Consider validating that each task directory contains only regular files (no subdirectories/symlinks) and failing if any non-file entries exist (and similarly rejecting non-directory entries directly under tasks/).

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +162
def test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch(monkeypatch, tmp_path: Path):
_create_valid_intake(tmp_path, intake_id="intake-xyz")

monkeypatch.setattr(
validator,
"_run_git_diff_paths",
lambda **_: [
"submissions/inbox/intake-xyz/submission.json",
"submissions/inbox/intake-xyz/manifest.json",
"submissions/inbox/intake-xyz/tasks/1/agent_response.json",
],
)

with pytest.raises(validator.PRGateValidationError, match="C06_EVALUATOR_VERSION_MISMATCH"):
validator.run_pr_gate_intake_validation(
tmp_path,
base_sha="abc",
head_sha="def",
expected_evaluator_version="0.0.0",
)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_run_pr_gate_intake_validation_rejects_evaluator_version_mismatch depends on whatever importlib_metadata.version("webarena-verified") returns in the test environment. If the project version ever starts with 0.0.0 (or the expected prefix matches for other reasons), this test will become flaky/incorrect. Consider monkeypatching validator.resolve_evaluator_version() to return a known value and then asserting the mismatch against a different expected prefix.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants