Add Reviewer 3 (closed-source HTTP API) as a benchmark system by dangng2004 · Pull Request #85 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-05-15T20:31:14Z

Summary

Wires Reviewer 3 into both the perturbation benchmark (run_benchmark.py) and the conference outcomes study (run_competitors.py) so it can be benchmarked on the exact same paper sets as openaireview and coarse.
Perturbation side: LaTeX-as-md sources compile to PDF via pdflatex before submission; cfg knobs (review_mode, poll_*) thread from YAML through Reviewer3System into the adapter via job payload; new comment fields (citedText, title, severity 1–4) get normalized into our schema.
Conference side: new Reviewer3Adapter reuses the perturbation HTTP submit/poll/normalize helpers (no duplication); fixed method key reviewer3__reviewer3 since R3 has no model selector.
Consolidates severity mapping into benchmarks/perturbation/_severity.py — perturbation adapter, compute_auc.py, and report_scaleup.py now share one source of truth (collapses three inlined copies of COARSE_SEVERITY_MAP).
8 perturbation configs (full_<domain>_reviewer3.yaml) mirror the matching full_<domain>_coarse.yaml knobs for max_tokens / min_perturbations.
requests>=2.31 added to [project.optional-dependencies] benchmarks; .env.example documents REVIEWER3_API_KEY / REVIEWER3_USER_ID (UUID from the vendor's web UI session JSON, not an email).

Test plan

Adapter smoke run end-to-end on 1 cs_CC paper via run_benchmark.py (10 min wall, 11 comments, recall 10/17 on LLM judge).
Conference adapter imports + --dry-run lists all 197 v2 papers without submitting.
All 8 perturbation configs load via run_benchmark.load_config(...) and infer system=reviewer3 correctly.
Severity normalization unit checks for openaireview / coarse / reviewer3.
Full launch in flight at PR time: ~222 perturbation cells + 197 conference papers running concurrently.

Operational notes (not in the diff)

conference_study/{manifests,papers,results} are gitignored data dirs that live only in the sibling worktree. To run conference_study locally, symlink them in:

cd benchmarks/conference_study
ln -s ../../../OpenAIReview/benchmarks/conference_study/manifests manifests
ln -s ../../../OpenAIReview/benchmarks/conference_study/papers    papers
ln -s ../../../OpenAIReview/benchmarks/conference_study/results   results

run_competitors.py doesn't load .env (only run_benchmark.py does). Export REVIEWER3_* before launching, or add dotenv.load_dotenv() at the top of run_competitors.py in a follow-up.
Adapter has no 429/backoff handling; a transient rate-limit response fails a job. Re-running the same command is idempotent (completed JSONs are skipped) so stragglers can be retried.

🤖 Generated with Claude Code

Wires reviewer3 into both the perturbation benchmark (run_benchmark.py) and the conference outcomes study (run_competitors.py), so it can be benchmarked on the exact same paper sets as openaireview and coarse. Highlights - Perturbation: LaTeX-as-md sources are compiled to PDF with pdflatex before submission (cs_CC corpus is full LaTeX); other domains pass through. PDF MIME is set; cfg knobs (review_mode, poll_interval_s, poll_timeout_s) are threaded from YAML through Reviewer3System into the adapter via the job payload. - Comment normalization picks up the recently-added Reviewer 3 fields: citedText (-> quote), title, severity (1-4 -> canonical 3-tier scale). - Severity mapping is consolidated into benchmarks/perturbation/_severity.py so the perturbation adapter, compute_auc.py, and report_scaleup.py share one source of truth (collapses three inlined copies of COARSE_SEVERITY_MAP). - Conference adapter reuses the perturbation HTTP submit/poll/normalize helpers via sys.path import; no duplication. R3 has no model selector, so method_key is fixed at reviewer3__reviewer3. - 8 perturbation configs (full_<domain>_reviewer3.yaml) mirror the full_<domain>_coarse.yaml knobs for max_tokens / min_perturbations. - conference_study/configs/reviewer3.yaml pins models: [reviewer3] so the per-(paper, model) loop fires once per paper rather than once per manifest model. - requests>=2.31 added to [project.optional-dependencies] benchmarks. - .env.example documents REVIEWER3_API_KEY / REVIEWER3_USER_ID (UUID, not email — see neurips_2026 setup notes). Operational note (not in this diff) - conference_study/{manifests,papers,results} are gitignored data dirs that live only in the sibling worktree. To run, symlink them in: cd benchmarks/conference_study ln -s ../../../OpenAIReview/benchmarks/conference_study/manifests manifests ln -s ../../../OpenAIReview/benchmarks/conference_study/papers papers ln -s ../../../OpenAIReview/benchmarks/conference_study/results results Known gap - run_competitors.py does not load .env (only run_benchmark.py does); export REVIEWER3_* before launching, or add a dotenv.load_dotenv() at the top of run_competitors.py in a follow-up. Smoke-validated end-to-end on 1 cs_CC paper through the perturbation runner; recall 10/17 on the staged perturbations (LLM judge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…figs Two cleanups so `git status` reflects only real work: Root .gitignore - Add slash-less `benchmarks/conference_study/papers` (the existing `papers/` rule in conference_study/.gitignore uses a trailing slash and wouldn't match a symlink — same situation the file already handles for manifests/ and results/). - Ignore `benchmarks/experimental_perturbations/` (removed from the repo in 6373fad but the local dir lingers). benchmarks/perturbation/.gitignore - Extend the "ephemeral configs" rule beyond `configs/_*` to cover the per-domain configs we generate locally but don't check in: configs/cs_*scaleup*.yaml configs/full_*.yaml configs/grok_*.yaml configs/longtail_*.yaml configs/subset_*.yaml configs/r3_smoke*.yaml Add `!configs/full_*_reviewer3.yaml` exception so the canonical reviewer3 configs that ARE tracked don't get hidden by the bulk rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ilures) The token-based truncation in `prepare_units` cuts the LaTeX-as-md staged file at a token boundary, which routinely leaves the document mid-environment. pdflatex on the staged file then "produces no usable PDF" for a fraction of papers, surfacing as a hard failure in the reviewer3 run. Switches the reviewer3 system to compile the FULL pre-truncation source (`u.src_corrupted`) and then trim the rendered PDF to its first N pages. This matches the `max_pages: 20` convention conference_study/configs/coarse.yaml already uses for coarse, so reviewer3 sees roughly the same content window the other systems see. Three pdflatex robustness fixes layered in: 1. Strip orphan `\input{...}` / `\include{...}` whose target file isn't bundled. pdflatex aborts hard on a missing \input, killing the compile for the whole paper even when the body is fine (paper_005 cs_CC had `\input{mypreamble.tex}`). 2. Inject a defensive preamble of `\providecommand` fallbacks for common author-defined shortcuts (\bbR, \calA, \bfx, \eps, \vvirg, \ootimes, etc.). Authors typically define these in private preamble files we don't have; \providecommand is a no-op when the command is already defined, so the injection is safe blanket coverage. 3. subprocess.run uses bytes (text=False) instead of text=True so the pdflatex log's non-UTF-8 accent bytes don't blow up Python's decoder (paper_009 cs_CC had byte 0xaa at offset ~57k). Changes - Reviewer3System.build_jobs threads `u.src_corrupted` (full path) and `cfg["max_pages"]` into the job payload. - Reviewer3Job adopts `source` + `max_pages` fields; `_submit` / `_ensure_pdf` forward them. - `_ensure_pdf` prefers `source` over `paper` for the compile when set; caches alongside the source with `.trim.pdf` suffix when trimmed. - `_trim_pages_to` (in-place) and `_maybe_trim_pages` (for already-PDF inputs) use pymupdf to cap pages. - `max_pages: 20` added to all 8 `full_*_reviewer3.yaml` configs. - run_benchmark.py Config gains `max_pages: int | None = None`. Smoke-validated on three previously-failing cs_CC papers (2604.19872v1 with missing \input + custom commands, 2604.24325v1 with same pattern, 2604.24879v1 with non-UTF8 bytes in pdflatex output) — all three now produce 20-page trimmed PDFs in 2–4s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…to 1h Two operational fixes after observing the live runs: 1. Conference Reviewer3Adapter wasn't honoring max_pages — it called _submit(pdf) directly without forwarding max_pages, so R3 received the full PDF (sometimes 50+ pages, 5+ MB) and a chunk of those tripped R3's HTTP 413 limit. Adapter now reads `max_pages` from the top-level config (falling back to reviewer3_options.max_pages) and threads it through to _submit -> _ensure_pdf -> _maybe_trim_pages. 2. poll_timeout_s bumped from 1800 (30 min) to 3600 (60 min) in all 8 perturbation configs and conference reviewer3.yaml. Observed wall time per paper under 10-concurrent load was routinely 25-40 min, with a long tail past 30 — causing dozens of false-timeout failures even though R3 was still processing. The session remains live on R3's side regardless, but the adapter abandoned them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three layered fixes for the remaining pdflatex-aborts: 1. `_force_known_documentclass` — rewrite \documentclass[opts]{X} to \documentclass{amsart} when X.cls isn't installed (kpsewhich miss). Many papers use journal classes not in TeX Live (aq-amsart, sn-jnl, atlasdoc, aastex631/701, cas-sc, iopjournal, svjour3, ieeeconf, informs3, revtex4 without -1/-2). amsart is the math-friendly fallback, preserving theorem/lemma envs. Regex now matches only the **first uncommented** \documentclass so commented-out example lines (e.g. q-bio.GN papers carry `%%\documentclass[...]{sn-jnl}` followed by the active variant) don't short-circuit the lookup. 2. `_strip_missing_packages` — same idea for \usepackage{X} when X.sty isn't installed. pdflatex aborts hard on a missing package too; the common offender on hep-ex papers is `\usepackage{jinstpub}`. Comment them out; the body's references to missing-package commands degrade to undefined-control-sequence warnings (pdflatex in nonstopmode still produces a usable PDF). 3. Rescue preamble now runs inside \AtBeginDocument{...} so its \providecommand falls AFTER all \usepackage{...} loads. Previously, the rescue defined \Bbbk before amsfonts loaded, then amsfonts `\DeclareSymbolFont` errored with "Command \Bbbk already defined". Spot-test across 8 domains: 7 of 8 sample papers compile cleanly. The holdout is q-bio.GN paper_001 (uses sn-jnl class with extensive class- specific commands that don't degrade gracefully); 9 of 10 q-bio.GN papers use bundled classes and work. Expected pdflatex-failure cells dropped from 39 to ~3 (one paper × three error types). kpsewhich results are cached per process for the common-class / common-package set so repeated rewrites across the 222-cell run pay the subprocess cost once per (class|package). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…waste Adds `.sid` file alongside each review output. On run: - If `.sid` exists and points to a still-valid R3 session, we POLL that session instead of submitting a new one. Avoids the duplicate-session credit waste observed when runs were killed mid-poll (~34% of credits spent on duplicates across yesterday's runs, per the rescue audit). - If the sid fetch hard-fails (403 "not found" / 404), drop the stale file and submit fresh. - On submit success: write the sid IMMEDIATELY so a SIGKILL between submit and first poll-tick still leaves a recovery path. - On any failure: keep the sid file — next run resumes the same session. - On success: leave the sid file in place as an audit trail (cheap; the out_json's presence is the real "done" marker for skip-completed). Perturbation: - `Reviewer3Job` gains `sid_file: Path | None`. `Reviewer3System.build_jobs` computes `<review_dir>/<stem>.sid` per cell and threads it through. - `_run_one` handles the resume vs submit branch. Conference: - `run_competitors.py` injects `cfg["_sid_file"]` next to the merged paper JSON (under `<results>/.sids/<slug>.<method_key>.sid`). - Conference `Reviewer3Adapter.review()` honors the underscore-prefixed key and persists/resumes the same way. Also drops the conference `max_per_model: 1` back to 5 after confirming R3's throttle from yesterday has lifted (single-paper probe transitioned waiting -> processing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two small additions: 1. `benchmarks/perturbation/rescue_sessions.py` — CLI that walks every `.sid` file under both result trees, fetches its R3 session via the API, and writes results to disk for any session that completed in the meantime (status=completed). Idempotent — `out_json` with content is skipped — so safe to re-run. The pattern emerged when ~30+ R3 sessions completed on the server after our local poll loop had timed out and abandoned them. Rescue recovered them without re-submitting (no duplicate-credit cost). Reviewer3 is the only system in the benchmark with async server-side state, so this stays reviewer3-specific by design. 2. `.gitignore` — add slash-less entries for the three symlinks we create when running R3 from a sibling worktree: - `.venv` (the existing `.venv/` rule misses symlinks) - `benchmarks/perturbation/data` - `benchmarks/perturbation/results` No files were tracked at these paths; this just cleans up `git status` noise when the symlinks are present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dangng2004 and others added 3 commits May 15, 2026 15:30

dangng2004 mentioned this pull request May 15, 2026

reviewer3: compile FULL source + page-trim PDF (fix pdflatex failures) #86

Merged

2 tasks

dangng2004 and others added 4 commits May 15, 2026 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85

Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85
dangng2004 wants to merge 7 commits into
mainfrom
reviewer3-integration

dangng2004 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dangng2004 commented May 15, 2026

Summary

Test plan

Operational notes (not in the diff)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant