spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub)#113
Merged
Conversation
…t stub) Mirrors the Phase A spike protocol (scripts/m17_phase_a_validation.py + docs/m17_phase_a_validation_report.md) for E5 — the Multimodal LLM Judge evaluator from docs/automated_eval_design.md §E5. This unlocks the eval loop that issues #92 and #97 are currently blocked on (both require native-speaker listening which doesn't scale). Script (scripts/m17_phase_c_validation.py): - 6-clip set: 4 corpus typology spans (SV/IT/NEG/NEU, all M2a-wettest agg_m_30-45_001) + 2 degraded variants of sp_sv_a_0001_00 (white noise at +10 dB and -10 dB SNR, RMS-matched to the clean source so the judge is scoring spectral content, not amplitude). - Two models: gemini-2.5-pro (audio input) and gpt-4o-audio-preview. - Two reruns per (clip, model) for within-model variance measurement. - Structured JSON output via a fixed schema (8 quality dimensions on a 1-5 scale + artifacts_detected + confidence_in_assessment + summary). - Four gate outcomes per model: refusal_gate ≥ 5/6 clips scored under DV-research framing discrimination_gate mean(corpus) − mean(severe-degraded) ≥ 0.5 variance_gate per-dim std across 2 reruns ≤ 0.5 shay_correlation_gate Spearman ρ ≥ 0.3 vs encoded expected ranking - Gemini BLOCK_NONE safety per design doc §E5 content-sensitivity guidance (research framing only — DV-content tolerance is required to avoid refusals on the metadata; the audio is entirely synthetic and contains no real persons). - Hard budget cap via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD (default $5); aborts pre-flight on estimate, aborts mid-run on cumulative actual. - --dry-run flag prints the prompt + schema without calling any API, for pre-flight prompt audit. - Lazy imports for google-genai / openai so the module is cheap to import even without the deps installed. Report stub (docs/m17_phase_c_validation_report.md): - Same structure as the Phase A report (TL;DR gate table, Reproduce, Clip-set manifest, Gate definitions, Per-model results, Failure-mode notes, Recommendation matrix, Limitations, Cost report). - All numeric cells marked TBD pending real-run results. - Recommendation matrix is pre-filled per gate-outcome scenario so the go/no-go decision is mechanical once the script runs. Cost estimate at default settings: ~$3.05 total ($0.35 Gemini + $2.70 GPT-4o), well within the $5 cap. Dry-run verified end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds the Phase C “multimodal LLM judge” spike artifacts for M17 E5, enabling a repeatable evaluation run (Gemini + OpenAI audio models) with per-model gate outcomes and a report template to unblock issues #92/#97 without additional immediate listening tests.
Changes:
- Introduces
scripts/m17_phase_c_validation.pyto run a 6-clip × 2-model × 2-rerun structured-scoring experiment and compute four acceptance gates. - Adds
docs/m17_phase_c_validation_report.mdas the narrative report stub, meant to be filled from spike outputs instate/spikes/m17_phase_c/.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| scripts/m17_phase_c_validation.py | New Phase C spike runner: clip prep (incl. degradations), model calls, cost/budget handling, gate evaluation, and auto-report generation. |
| docs/m17_phase_c_validation_report.md | New report stub describing gates, clip set, reproduction steps, and how to interpret outcomes. |
Comments suppressed due to low confidence (2)
scripts/m17_phase_c_validation.py:223
- The docstring and inline docs refer to “structured JSON responses against a Pydantic schema” and include
pydanticin the install command, but the script never imports/uses Pydantic; it builds a plain JSON Schema dict. Please either switch to an actual Pydantic model (and derive JSON Schema from it) or update the docs/install instructions to remove the Pydantic reference so the setup instructions match reality.
# --- Pydantic schema ---------------------------------------------------------
def make_response_schema() -> dict:
"""Build the JSON Schema used to constrain both Gemini and OpenAI output."""
return {
"type": "object",
"properties": {
**{d: {"type": "integer", "minimum": 1, "maximum": 5} for d in DIMENSIONS},
"artifacts_detected": {"type": "boolean"},
"artifact_notes": {"type": "string"},
"confidence_in_assessment": {"type": "integer", "minimum": 1, "maximum": 5},
"summary": {"type": "string"},
},
"required": [
*DIMENSIONS,
"artifacts_detected",
"artifact_notes",
"confidence_in_assessment",
"summary",
],
"additionalProperties": False,
}
scripts/m17_phase_c_validation.py:544
- Same as Gemini path: if JSON decoding fails, the run is marked refused but
erroris left None. Recording the JSONDecodeError (or at least a sentinel like "invalid_json") would make it much easier to triage schema/prompt issues from true policy refusals.
raw = resp.choices[0].message.content or ""
try:
parsed = json.loads(raw)
refused = False
except json.JSONDecodeError:
parsed = None
refused = True
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+845
to
+852
| if cumulative_usd >= budget_cap: | ||
| print( | ||
| f"ABORT: cumulative ${cumulative_usd:.2f} ≥ budget ${budget_cap:.2f}. " | ||
| "Partial results will still be written.", | ||
| flush=True, | ||
| ) | ||
| break | ||
| if model == "gemini": |
Comment on lines
+331
to
+339
| # Per Anthropic/Google/OpenAI pricing snapshots as of January 2026. Update if | ||
| # the spike is rerun against newer model versions. These are upper-bound | ||
| # estimates: actual cost is metered per-token by each provider and the | ||
| # script's running total uses that, not these. | ||
| GEMINI_AUDIO_USD_PER_MIN = 0.0125 # gemini-2.5-pro audio input | ||
| GEMINI_OUTPUT_USD_PER_KTOK = 0.0050 | ||
| OPENAI_AUDIO_INPUT_USD_PER_MIN = 0.10 # gpt-4o-audio-preview | ||
| OPENAI_OUTPUT_USD_PER_KTOK = 0.020 | ||
|
|
Comment on lines
+720
to
+721
| for model, g in payload["gates"].items(): | ||
| ref = f"{g['refusal_gate']['scored_clips']}/{g['refusal_gate']['min_required']}" |
| .venv/bin/python scripts/m17_phase_c_validation.py | ||
| .venv/bin/python scripts/m17_phase_c_validation.py --dry-run # prompts only, no API calls | ||
|
|
||
| Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns. |
Comment on lines
+31
to
+49
| ```bash | ||
| uv pip install --python .venv/bin/python \ | ||
| google-genai openai pydantic soundfile numpy scipy | ||
|
|
||
| export GEMINI_API_KEY=... | ||
| export OPENAI_API_KEY=... | ||
| export SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD=5.0 # hard cap | ||
|
|
||
| # Optional dry run — prints the prompt + JSON schema, no API calls, no spend. | ||
| .venv/bin/python scripts/m17_phase_c_validation.py --dry-run | ||
|
|
||
| # Full run. | ||
| .venv/bin/python scripts/m17_phase_c_validation.py | ||
| ``` | ||
|
|
||
| The script prepares 6 clip records (4 corpus typology spans + 2 degraded | ||
| variants of `sp_sv_a_0001_00`), sends each to each model twice, parses | ||
| structured JSON responses against a Pydantic schema, and writes gate | ||
| outcomes to `results.json` + an auto-generated markdown summary. |
Comment on lines
+457
to
+464
| raw = resp.text or "" | ||
| try: | ||
| parsed = json.loads(raw) | ||
| refused = False | ||
| except json.JSONDecodeError: | ||
| parsed = None | ||
| refused = True | ||
|
|
Comment on lines
+86
to
+88
| # ordinal: 4 = best perceived, 1 = worst. Ties are deliberate where evidence | ||
| # doesn't separate clips. Used only for the Spearman gate; if you don't have | ||
| # a strong prior for a clip, leave it tied with neighbours rather than guess. |
…ore running
Critical:
- Replace wn_snr_+10db mid-anchor with synth_rate_slow_0.7x — tests synthesis
defect detection (unnatural tempo + pitch-shift + truncated arc) instead of
measuring only whether the model can hear noise
- Add --probe-metadata-bias flag: runs a no_arc prompt variant on corpus clips
to measure whether emotional_expression / escalation_arc are scored from audio
or inferred from the intensity-arc label; delta is reported per-clip per-dim
in both results.json and report_auto.md
High:
- Add failure_reason field to JudgeResult ("ok" / "content_refusal" /
"json_parse_error" / "api_error") so only content refusals count against the
refusal gate; api_error and json_parse_error are retried, not penalised
- Incremental write: results_partial.jsonl is appended after each call;
--resume loads prior results and skips completed (clip, model, run, variant)
tuples — an API failure or budget-cap abort no longer loses all spend
- Drop shay_correlation and variance from overall_pass (advisory only);
add notes: variance is trivially PASS at TEMPERATURE=0 greedy decoding,
Shay rho is not statistically significant at n=4
- Discrimination gate now tests two independent arms — noise_corruption and
synth_failure — reporting both separations; passing either arm clears the gate
Low:
- Fix report stub: gate table columns aligned with report_auto.md output,
hand-written per-run tables removed (replaced with pointers to auto-report),
limitations section updated to reflect all of the above
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…imination gate Update the gate bullet-list in the module docstring to match what the code actually does after the review-fixes commit: two discrimination arms (noise_corruption comparability + synth_failure real target), advisory notes on variance and Shay-correlation, failure_reason semantics for refusal gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #113 in repository https://github.com/DataHackIL/SynthBanshee
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: scripts/m17_phase_c_validation.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267579
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Mid-run budget abort only breaks out of the innermost rerun loop, so the script will continue iterating over subsequent clips/models and repeatedly print the ABORT message without actually stopping the run. Consider breaking out of all loops (e.g., return immediately, raise SystemExit, or use a flag checked at each nesting level) so the budget cap reliably halts further API calls and reduces confusing output.
## COPILOT-2
Location: scripts/m17_phase_c_validation.py:429
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267639
Root author: copilot-pull-request-reviewer
Comment:
The comments claim the “running total uses” provider-metered costs, but usd_cost is computed from hard-coded price constants plus audio duration + output tokens only. This can undercount (e.g., ignores text prompt tokens and any provider-specific audio token accounting), making the mid-run budget cap unreliable. Either (a) compute cost from provider usage fields that correspond to billable units (if available) or (b) clearly label usd_cost/cumulative_usd as an estimate and enforce the budget conservatively (e.g., include prompt tokens, add a safety margin, or stop based on estimated remaining spend).
## COPILOT-3
Location: scripts/m17_phase_c_validation.py:957
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267673
Root author: copilot-pull-request-reviewer
Comment:
Refusal gate display uses `{scored_clips}/{min_required}` (e.g., `6/5`), which is confusing because the criterion is “≥ 5/6 clips scored”. It would be clearer to display `{scored_clips}/{len(clips)}` (and optionally keep the threshold separately) so the table communicates the actual denominator.
## COPILOT-4
Location: scripts/m17_phase_c_validation.py:49
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267704
Root author: copilot-pull-request-reviewer
Comment:
The top-level docstring’s cost estimate (“~$0.50–1.50 total across both models”) appears inconsistent with the script’s own pricing constants (OpenAI audio input alone is $0.10/min, so ~25+ minutes of audio × reruns can exceed that). Please update the stated range (or remove it) so it doesn’t mislead someone about expected spend.
This issue also appears on line 202 of the same file.
## COPILOT-5
Location: docs/m17_phase_c_validation_report.md:57
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267738
Root author: copilot-pull-request-reviewer
Comment:
This report (and the reproduce install command) says the script “parses structured JSON responses against a Pydantic schema” and lists `pydantic` as a dependency, but `scripts/m17_phase_c_validation.py` currently uses a hand-built JSON Schema dict and does not import Pydantic. Update the report to match the implementation (or update the script to actually use Pydantic-derived schema) so the reproduction steps are accurate.
## COPILOT-6
Location: scripts/m17_phase_c_validation.py:610
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267759
Root author: copilot-pull-request-reviewer
Comment:
On JSON parse failure, the result is marked refused but `error` remains None, which makes it harder to debug whether the model refused vs. returned malformed JSON vs. a truncation issue. Consider setting `error` (e.g., to the JSONDecodeError message) when decoding fails so partial runs have actionable diagnostics.
This issue also appears on line 537 of the same file.
## COPILOT-7
Location: scripts/m17_phase_c_validation.py:100
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267787
Root author: copilot-pull-request-reviewer
Comment:
The comment says `expected_quality_rank` is an ordinal where 4=best and 1=worst, but the degraded severe clip is assigned rank 0. Either adjust the encoding comment (e.g., allow 0 as “worse than worst corpus”) or keep the ranks within the documented 1–4 range to avoid confusion when interpreting `results.json` / manifest tables.Run metadata: |
Comment on lines
+49
to
+50
| Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns. | ||
| The hard cap defaults to $5; raise/lower via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD. |
Comment on lines
+611
to
+617
| usage = getattr(resp, "usage_metadata", None) | ||
| in_tok = getattr(usage, "prompt_token_count", None) or 0 | ||
| out_tok = getattr(usage, "candidates_token_count", None) or 0 | ||
| cost = (clip.duration_s / 60.0) * GEMINI_AUDIO_USD_PER_MIN + ( | ||
| out_tok / 1000.0 | ||
| ) * GEMINI_OUTPUT_USD_PER_KTOK | ||
|
|
Comment on lines
+706
to
+708
| cost = (clip.duration_s / 60.0) * OPENAI_AUDIO_INPUT_USD_PER_MIN + ( | ||
| out_tok / 1000.0 | ||
| ) * OPENAI_OUTPUT_USD_PER_KTOK |
Comment on lines
+784
to
+788
| scored_clips = { | ||
| r.clip_label for r in runs if r.failure_reason not in ("content_refusal",) and r.parsed | ||
| } | ||
| content_refusals = [r for r in runs if r.failure_reason == "content_refusal"] | ||
| refusal_pass = len(scored_clips) >= REFUSAL_GATE_MIN_SCORED |
| ] | ||
| print(f"call plan: {len(call_plan)} total, {len(pending)} pending", flush=True) | ||
|
|
||
| # --- Run ----------------------------------------------------------------- |
| Setup: | ||
|
|
||
| uv pip install --python .venv/bin/python \\ | ||
| google-genai openai pydantic soundfile numpy scipy |
|
|
||
| ```bash | ||
| uv pip install --python .venv/bin/python \ | ||
| google-genai openai pydantic soundfile numpy scipy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Issues #92 and #97 are both blocked on the same input: a native-speaker listening test. The May-3 and May-6 sessions already consumed Shay's listening time on overlapping symptoms and we still don't have a closed loop on the I3-I5 distress-cue question or the broader naturalness backlog. Every TTS-side lever (mstts:express-as styles, disfluency density, per-phrase prosody jitter, Google Chirp VIC, …) currently requires a fresh listening test to validate — which means most of them stay unvalidated.
The eval-loop gap is well-scoped in
docs/automated_eval_design.md§E5 (Multimodal LLM Judge). Phase A landed E1 (ASR) and blocked E2 (UTMOS) indocs/m17_phase_a_validation_report.md. E5 has been done once, manually, on a single clip (docs/debug_run_1/llm_feedbacks.md, 2026-04-14) — never as a repeatable workflow.What this PR does
Lands the Phase C spike artifacts ready for execution. Same shape as the Phase A spike that produced
docs/m17_phase_a_validation_report.md:scripts/m17_phase_c_validation.py— the spike runner. 6 clips × 2 models (Gemini 2.5 Pro + GPT-4o-audio-preview) × 2 reruns. Structured JSON output via a fixed 8-dimension schema. Four independent gates per model:BLOCK_NONEfor HARM_CATEGORY_DANGEROUS_CONTENT etc. per §E5 mitigation).overall_qualityand Shay's encoded expected ranking on the 4 corpus clips. Encoded fromdocs/audio_quality_feedback.md,docs/debug_run_1/llm_feedbacks.md, and issues tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92/TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal #97 priors.docs/m17_phase_c_validation_report.md— narrative report stub. All numeric cells marked TBD; recommendation matrix pre-filled per gate-outcome scenario so the go/no-go is mechanical once the script runs.Clip set
sp_sv_a_0001_00sp_it_a_0001_00sp_neg_a_0001_00has_violence: false.sp_neu_a_0001_00sp_sv_a_0001_00__wn_snr_+10dbsp_sv_a_0001_00__wn_snr_-10dbDegradations use the same
apply_white_noise+rms_normalize_to_matchhelpers as Phase A so the spectral content (not loudness) is what the judge has to discriminate.Pre-flight cost
Hard cap:
SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD = $5(env var, configurable). Script aborts pre-flight if estimate exceeds cap and mid-run if cumulative actual exceeds.What happens after this lands
GEMINI_API_KEY+OPENAI_API_KEYin.envrc. Single execution, ~$3 of spend, ~30 min wall-clock. Fills in the TBD cells indocs/m17_phase_c_validation_report.mdand writesstate/spikes/m17_phase_c/results.json.synthbanshee/eval/llm_judge.pymodule, anchor-set regression CI workflow, calibration against Shay's listening tests.Test plan
--dry-runmode prints the prompt + JSON schema without calling any API — verified locally; output reviewed for clarity and content-safety framingstate/spikes/m17_phase_c/sp_sv_a_0001_00__wn_snr_*.wavunder the gitignored spike dirruff check+ruff format --checkpass onscripts/m17_phase_c_validation.pymypy scripts/m17_phase_c_validation.pypassesNot in scope for this PR
GEMINI_API_KEY/OPENAI_API_KEYto.envrc— defer until you've decided to run the spendsynthbanshee/eval/llm_judge.pymodule — Phase C MVP, conditional on this spike passingdocs/implementation_plan.mdto reflect Phase C kickoff — handle separately after the spike runs