chore: improve reward eval harness auditing by jbarnes850 · Pull Request #79 · Arc-Computer/atlas-sdk

jbarnes850 · 2025-10-24T02:16:21Z

Summary

add configs/eval/reward_system.yaml so judge presets/combos can be edited without touching code, falling back to baked-in defaults if missing
extend scripts/eval_reward_models.py with a --collect-audit flag, safe audit payload serialization, and automatic Markdown report emission alongside JSON output
document the new flags and config-driven workflow in docs/reward_eval.md

Testing

python -m scripts.eval_reward_models --dataset atlas/data/reward_eval_trajectories.jsonl --judge-combos gemini_pair --baseline gemini_pair --repeats 1 --concurrency 1 --collect-audit --output results/reward/latest_gemini.json

Copilot

Pull Request Overview

This PR introduces configuration-driven judge presets and audit logging for the reward evaluation harness. It moves judge configuration from hardcoded constants into an external YAML file, adds a --collect-audit flag to capture LLM interactions for debugging, and automatically generates Markdown summary reports alongside JSON output.

Key Changes:

Externalized judge presets and combos to configs/eval/reward_system.yaml with fallback to defaults
Added audit collection capability to track LLM prompts/responses during evaluation
Implemented automatic Markdown report generation with timestamp-based naming

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
scripts/eval_reward_models.py	Added YAML config loading, audit collection in evaluator, JSON sanitization helper, and Markdown report generation
configs/eval/reward_system.yaml	New configuration file defining judge presets and combos for reward evaluation
docs/reward_eval.md	Updated documentation to reference config-first workflow and new CLI flags

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot AI review requested due to automatic review settings October 24, 2025 02:16

github-project-automation Bot added this to Arc Project Board & Issue Tracker Oct 24, 2025

Copilot AI reviewed Oct 24, 2025

View reviewed changes

Comment thread scripts/eval_reward_models.py Outdated

Comment thread scripts/eval_reward_models.py

Comment thread scripts/eval_reward_models.py Outdated

jbarnes850 self-assigned this Oct 24, 2025

jbarnes850 added the enhancement New feature or request label Oct 24, 2025

jbarnes850 moved this to In review in Arc Project Board & Issue Tracker Oct 24, 2025

feat: improve reward eval harness auditing and outputs

6801278

jbarnes850 force-pushed the chore/reward-eval-audit-config branch from 060f994 to 6801278 Compare October 24, 2025 02:35

jbarnes850 merged commit c4b7bae into main Oct 24, 2025
1 check passed

github-project-automation Bot moved this from In review to Done in Arc Project Board & Issue Tracker Oct 24, 2025

jbarnes850 deleted the chore/reward-eval-audit-config branch October 24, 2025 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve reward eval harness auditing#79

chore: improve reward eval harness auditing#79
jbarnes850 merged 1 commit intomainfrom
chore/reward-eval-audit-config

jbarnes850 commented Oct 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbarnes850 commented Oct 24, 2025

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants