Skip to content

Frontier cs 2.0 blackbox evaluator#116

Merged
joyemang33 merged 2 commits into
mainfrom
frontier-cs-2.0-blackbox-evaluator
May 27, 2026
Merged

Frontier cs 2.0 blackbox evaluator#116
joyemang33 merged 2 commits into
mainfrom
frontier-cs-2.0-blackbox-evaluator

Conversation

@joyemang33
Copy link
Copy Markdown
Contributor

Summary

  • Add a black-box judge sidecar for Frontier-CS 2.0 Harbor tasks so agents can call submit.sh without seeing the evaluator implementation.
  • Harden 2.0 evaluator execution by hiding evaluator source, running submitted solutions in an isolated temp file, dropping privileges when possible, and suppressing solution stdout/stderr and internal tracebacks from judge feedback.
  • Add erdos_demo, a small N = 10 Erdos unit distance task for fast, visual sanity checks.
  • Update 2.0 docs and Harbor adapter docs for the black-box workflow and demo task.

Testing

  • Ran Python syntax checks for the updated 2.0 evaluators, adapter, submit helper, and judge server.
  • Verified frontier list 2.0 shows both erdos_demo and erdos_unit_distance.
  • Verified erdos_demo reference evaluates successfully with score 0.
  • Verified a simple 10-point triangular construction scores 44.4444.
  • Verified a malicious solution attempting to print evaluator contents does not leak evaluator source.
  • Verified the 2.0 Harbor adapter generates both erdos_demo and erdos_unit_distance tasks.

@joyemang33 joyemang33 marked this pull request as ready for review May 27, 2026 15:19
@joyemang33 joyemang33 merged commit b8e58d0 into main May 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant