Skip to content

Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85

Open
dangng2004 wants to merge 7 commits into
mainfrom
reviewer3-integration
Open

Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85
dangng2004 wants to merge 7 commits into
mainfrom
reviewer3-integration

Conversation

@dangng2004
Copy link
Copy Markdown
Contributor

Summary

  • Wires Reviewer 3 into both the perturbation benchmark (run_benchmark.py) and the conference outcomes study (run_competitors.py) so it can be benchmarked on the exact same paper sets as openaireview and coarse.
  • Perturbation side: LaTeX-as-md sources compile to PDF via pdflatex before submission; cfg knobs (review_mode, poll_*) thread from YAML through Reviewer3System into the adapter via job payload; new comment fields (citedText, title, severity 1–4) get normalized into our schema.
  • Conference side: new Reviewer3Adapter reuses the perturbation HTTP submit/poll/normalize helpers (no duplication); fixed method key reviewer3__reviewer3 since R3 has no model selector.
  • Consolidates severity mapping into benchmarks/perturbation/_severity.py — perturbation adapter, compute_auc.py, and report_scaleup.py now share one source of truth (collapses three inlined copies of COARSE_SEVERITY_MAP).
  • 8 perturbation configs (full_<domain>_reviewer3.yaml) mirror the matching full_<domain>_coarse.yaml knobs for max_tokens / min_perturbations.
  • requests>=2.31 added to [project.optional-dependencies] benchmarks; .env.example documents REVIEWER3_API_KEY / REVIEWER3_USER_ID (UUID from the vendor's web UI session JSON, not an email).

Test plan

  • Adapter smoke run end-to-end on 1 cs_CC paper via run_benchmark.py (10 min wall, 11 comments, recall 10/17 on LLM judge).
  • Conference adapter imports + --dry-run lists all 197 v2 papers without submitting.
  • All 8 perturbation configs load via run_benchmark.load_config(...) and infer system=reviewer3 correctly.
  • Severity normalization unit checks for openaireview / coarse / reviewer3.
  • Full launch in flight at PR time: ~222 perturbation cells + 197 conference papers running concurrently.

Operational notes (not in the diff)

  • conference_study/{manifests,papers,results} are gitignored data dirs that live only in the sibling worktree. To run conference_study locally, symlink them in:
    cd benchmarks/conference_study
    ln -s ../../../OpenAIReview/benchmarks/conference_study/manifests manifests
    ln -s ../../../OpenAIReview/benchmarks/conference_study/papers    papers
    ln -s ../../../OpenAIReview/benchmarks/conference_study/results   results
    
  • run_competitors.py doesn't load .env (only run_benchmark.py does). Export REVIEWER3_* before launching, or add dotenv.load_dotenv() at the top of run_competitors.py in a follow-up.
  • Adapter has no 429/backoff handling; a transient rate-limit response fails a job. Re-running the same command is idempotent (completed JSONs are skipped) so stragglers can be retried.

🤖 Generated with Claude Code

dangng2004 and others added 3 commits May 15, 2026 15:30
Wires reviewer3 into both the perturbation benchmark (run_benchmark.py) and
the conference outcomes study (run_competitors.py), so it can be benchmarked
on the exact same paper sets as openaireview and coarse.

Highlights
- Perturbation: LaTeX-as-md sources are compiled to PDF with pdflatex
  before submission (cs_CC corpus is full LaTeX); other domains pass
  through. PDF MIME is set; cfg knobs (review_mode, poll_interval_s,
  poll_timeout_s) are threaded from YAML through Reviewer3System into
  the adapter via the job payload.
- Comment normalization picks up the recently-added Reviewer 3 fields:
  citedText (-> quote), title, severity (1-4 -> canonical 3-tier scale).
- Severity mapping is consolidated into benchmarks/perturbation/_severity.py
  so the perturbation adapter, compute_auc.py, and report_scaleup.py share
  one source of truth (collapses three inlined copies of COARSE_SEVERITY_MAP).
- Conference adapter reuses the perturbation HTTP submit/poll/normalize
  helpers via sys.path import; no duplication. R3 has no model selector,
  so method_key is fixed at reviewer3__reviewer3.
- 8 perturbation configs (full_<domain>_reviewer3.yaml) mirror the
  full_<domain>_coarse.yaml knobs for max_tokens / min_perturbations.
- conference_study/configs/reviewer3.yaml pins models: [reviewer3] so
  the per-(paper, model) loop fires once per paper rather than once per
  manifest model.
- requests>=2.31 added to [project.optional-dependencies] benchmarks.
- .env.example documents REVIEWER3_API_KEY / REVIEWER3_USER_ID
  (UUID, not email — see neurips_2026 setup notes).

Operational note (not in this diff)
- conference_study/{manifests,papers,results} are gitignored data dirs
  that live only in the sibling worktree. To run, symlink them in:
    cd benchmarks/conference_study
    ln -s ../../../OpenAIReview/benchmarks/conference_study/manifests manifests
    ln -s ../../../OpenAIReview/benchmarks/conference_study/papers    papers
    ln -s ../../../OpenAIReview/benchmarks/conference_study/results   results

Known gap
- run_competitors.py does not load .env (only run_benchmark.py does);
  export REVIEWER3_* before launching, or add a dotenv.load_dotenv() at
  the top of run_competitors.py in a follow-up.

Smoke-validated end-to-end on 1 cs_CC paper through the perturbation
runner; recall 10/17 on the staged perturbations (LLM judge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…figs

Two cleanups so `git status` reflects only real work:

Root .gitignore
- Add slash-less `benchmarks/conference_study/papers` (the existing
  `papers/` rule in conference_study/.gitignore uses a trailing slash and
  wouldn't match a symlink — same situation the file already handles for
  manifests/ and results/).
- Ignore `benchmarks/experimental_perturbations/` (removed from the repo
  in 6373fad but the local dir lingers).

benchmarks/perturbation/.gitignore
- Extend the "ephemeral configs" rule beyond `configs/_*` to cover the
  per-domain configs we generate locally but don't check in:
    configs/cs_*scaleup*.yaml
    configs/full_*.yaml
    configs/grok_*.yaml
    configs/longtail_*.yaml
    configs/subset_*.yaml
    configs/r3_smoke*.yaml
  Add `!configs/full_*_reviewer3.yaml` exception so the canonical
  reviewer3 configs that ARE tracked don't get hidden by the bulk rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ilures)

The token-based truncation in `prepare_units` cuts the LaTeX-as-md staged file
at a token boundary, which routinely leaves the document mid-environment.
pdflatex on the staged file then "produces no usable PDF" for a fraction of
papers, surfacing as a hard failure in the reviewer3 run.

Switches the reviewer3 system to compile the FULL pre-truncation source
(`u.src_corrupted`) and then trim the rendered PDF to its first N pages.
This matches the `max_pages: 20` convention conference_study/configs/coarse.yaml
already uses for coarse, so reviewer3 sees roughly the same content window
the other systems see.

Three pdflatex robustness fixes layered in:

1. Strip orphan `\input{...}` / `\include{...}` whose target file isn't
   bundled. pdflatex aborts hard on a missing \input, killing the compile
   for the whole paper even when the body is fine (paper_005 cs_CC had
   `\input{mypreamble.tex}`).

2. Inject a defensive preamble of `\providecommand` fallbacks for common
   author-defined shortcuts (\bbR, \calA, \bfx, \eps, \vvirg, \ootimes,
   etc.). Authors typically define these in private preamble files we
   don't have; \providecommand is a no-op when the command is already
   defined, so the injection is safe blanket coverage.

3. subprocess.run uses bytes (text=False) instead of text=True so the
   pdflatex log's non-UTF-8 accent bytes don't blow up Python's decoder
   (paper_009 cs_CC had byte 0xaa at offset ~57k).

Changes
- Reviewer3System.build_jobs threads `u.src_corrupted` (full path) and
  `cfg["max_pages"]` into the job payload.
- Reviewer3Job adopts `source` + `max_pages` fields; `_submit` / `_ensure_pdf`
  forward them.
- `_ensure_pdf` prefers `source` over `paper` for the compile when set;
  caches alongside the source with `.trim.pdf` suffix when trimmed.
- `_trim_pages_to` (in-place) and `_maybe_trim_pages` (for already-PDF inputs)
  use pymupdf to cap pages.
- `max_pages: 20` added to all 8 `full_*_reviewer3.yaml` configs.
- run_benchmark.py Config gains `max_pages: int | None = None`.

Smoke-validated on three previously-failing cs_CC papers (2604.19872v1 with
missing \input + custom commands, 2604.24325v1 with same pattern, 2604.24879v1
with non-UTF8 bytes in pdflatex output) — all three now produce 20-page
trimmed PDFs in 2–4s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dangng2004 and others added 4 commits May 15, 2026 20:58
…to 1h

Two operational fixes after observing the live runs:

1. Conference Reviewer3Adapter wasn't honoring max_pages — it called
   _submit(pdf) directly without forwarding max_pages, so R3 received the
   full PDF (sometimes 50+ pages, 5+ MB) and a chunk of those tripped R3's
   HTTP 413 limit. Adapter now reads `max_pages` from the top-level config
   (falling back to reviewer3_options.max_pages) and threads it through to
   _submit -> _ensure_pdf -> _maybe_trim_pages.

2. poll_timeout_s bumped from 1800 (30 min) to 3600 (60 min) in all 8
   perturbation configs and conference reviewer3.yaml. Observed wall time
   per paper under 10-concurrent load was routinely 25-40 min, with a
   long tail past 30 — causing dozens of false-timeout failures even
   though R3 was still processing. The session remains live on R3's side
   regardless, but the adapter abandoned them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three layered fixes for the remaining pdflatex-aborts:

1. `_force_known_documentclass` — rewrite \documentclass[opts]{X} to
   \documentclass{amsart} when X.cls isn't installed (kpsewhich miss).
   Many papers use journal classes not in TeX Live (aq-amsart, sn-jnl,
   atlasdoc, aastex631/701, cas-sc, iopjournal, svjour3, ieeeconf,
   informs3, revtex4 without -1/-2). amsart is the math-friendly fallback,
   preserving theorem/lemma envs.

   Regex now matches only the **first uncommented** \documentclass so
   commented-out example lines (e.g. q-bio.GN papers carry
   `%%\documentclass[...]{sn-jnl}` followed by the active variant) don't
   short-circuit the lookup.

2. `_strip_missing_packages` — same idea for \usepackage{X} when X.sty
   isn't installed. pdflatex aborts hard on a missing package too; the
   common offender on hep-ex papers is `\usepackage{jinstpub}`. Comment
   them out; the body's references to missing-package commands degrade
   to undefined-control-sequence warnings (pdflatex in nonstopmode still
   produces a usable PDF).

3. Rescue preamble now runs inside \AtBeginDocument{...} so its
   \providecommand falls AFTER all \usepackage{...} loads. Previously,
   the rescue defined \Bbbk before amsfonts loaded, then amsfonts
   `\DeclareSymbolFont` errored with "Command \Bbbk already defined".

Spot-test across 8 domains: 7 of 8 sample papers compile cleanly. The
holdout is q-bio.GN paper_001 (uses sn-jnl class with extensive class-
specific commands that don't degrade gracefully); 9 of 10 q-bio.GN papers
use bundled classes and work. Expected pdflatex-failure cells dropped
from 39 to ~3 (one paper × three error types).

kpsewhich results are cached per process for the common-class /
common-package set so repeated rewrites across the 222-cell run pay the
subprocess cost once per (class|package).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…waste

Adds `.sid` file alongside each review output. On run:
- If `.sid` exists and points to a still-valid R3 session, we POLL that
  session instead of submitting a new one. Avoids the duplicate-session
  credit waste observed when runs were killed mid-poll (~34% of credits
  spent on duplicates across yesterday's runs, per the rescue audit).
- If the sid fetch hard-fails (403 "not found" / 404), drop the stale
  file and submit fresh.
- On submit success: write the sid IMMEDIATELY so a SIGKILL between
  submit and first poll-tick still leaves a recovery path.
- On any failure: keep the sid file — next run resumes the same session.
- On success: leave the sid file in place as an audit trail (cheap; the
  out_json's presence is the real "done" marker for skip-completed).

Perturbation:
- `Reviewer3Job` gains `sid_file: Path | None`. `Reviewer3System.build_jobs`
  computes `<review_dir>/<stem>.sid` per cell and threads it through.
- `_run_one` handles the resume vs submit branch.

Conference:
- `run_competitors.py` injects `cfg["_sid_file"]` next to the merged
  paper JSON (under `<results>/.sids/<slug>.<method_key>.sid`).
- Conference `Reviewer3Adapter.review()` honors the underscore-prefixed
  key and persists/resumes the same way.

Also drops the conference `max_per_model: 1` back to 5 after confirming
R3's throttle from yesterday has lifted (single-paper probe transitioned
waiting -> processing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small additions:

1. `benchmarks/perturbation/rescue_sessions.py` — CLI that walks every
   `.sid` file under both result trees, fetches its R3 session via the
   API, and writes results to disk for any session that completed in
   the meantime (status=completed). Idempotent — `out_json` with content
   is skipped — so safe to re-run.

   The pattern emerged when ~30+ R3 sessions completed on the server
   after our local poll loop had timed out and abandoned them.
   Rescue recovered them without re-submitting (no duplicate-credit
   cost). Reviewer3 is the only system in the benchmark with async
   server-side state, so this stays reviewer3-specific by design.

2. `.gitignore` — add slash-less entries for the three symlinks we
   create when running R3 from a sibling worktree:
     - `.venv`           (the existing `.venv/` rule misses symlinks)
     - `benchmarks/perturbation/data`
     - `benchmarks/perturbation/results`
   No files were tracked at these paths; this just cleans up
   `git status` noise when the symlinks are present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant