spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub) by shaypal5 · Pull Request #113 · DataHackIL/SynthBanshee

shaypal5 · 2026-05-13T20:28:35Z

Problem

Issues #92 and #97 are both blocked on the same input: a native-speaker listening test. The May-3 and May-6 sessions already consumed Shay's listening time on overlapping symptoms and we still don't have a closed loop on the I3-I5 distress-cue question or the broader naturalness backlog. Every TTS-side lever (mstts:express-as styles, disfluency density, per-phrase prosody jitter, Google Chirp VIC, …) currently requires a fresh listening test to validate — which means most of them stay unvalidated.

The eval-loop gap is well-scoped in docs/automated_eval_design.md §E5 (Multimodal LLM Judge). Phase A landed E1 (ASR) and blocked E2 (UTMOS) in docs/m17_phase_a_validation_report.md. E5 has been done once, manually, on a single clip (docs/debug_run_1/llm_feedbacks.md, 2026-04-14) — never as a repeatable workflow.

What this PR does

Lands the Phase C spike artifacts ready for execution. Same shape as the Phase A spike that produced docs/m17_phase_a_validation_report.md:

scripts/m17_phase_c_validation.py — the spike runner. 6 clips × 2 models (Gemini 2.5 Pro + GPT-4o-audio-preview) × 2 reruns. Structured JSON output via a fixed 8-dimension schema. Four independent gates per model:
- Refusal: ≥ 5/6 clips scored under DV-research framing (Gemini BLOCK_NONE for HARM_CATEGORY_DANGEROUS_CONTENT etc. per §E5 mitigation).
- Discrimination: mean(corpus overall_quality) − mean(severe-degraded overall_quality) ≥ 0.5. Same threshold as the Phase A UTMOS gate that failed.
- Reproducibility: per-dimension std across 2 reruns of the same clip ≤ 0.5.
- Shay-correlation: Spearman ρ ≥ 0.3 between model overall_quality and Shay's encoded expected ranking on the 4 corpus clips. Encoded from docs/audio_quality_feedback.md, docs/debug_run_1/llm_feedbacks.md, and issues tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92/TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal #97 priors.
docs/m17_phase_c_validation_report.md — narrative report stub. All numeric cells marked TBD; recommendation matrix pre-filled per gate-outcome scenario so the go/no-go is mechanical once the script runs.

Clip set

Label	Source	Kind	Why
`sp_sv_a_0001_00`	corpus	SV	The canonical SV scene Shay + Gemini + Claude + ChatGPT reviewed manually in April. Provides ground-truth listening data to calibrate against.
`sp_it_a_0001_00`	corpus	IT	The I3-I5 distress-absent clip from issue #97.
`sp_neg_a_0001_00`	corpus	NEG	Hard-negative class: intense but `has_violence: false`.
`sp_neu_a_0001_00`	corpus	NEU	Neutral baseline — expected highest perceived quality.
`sp_sv_a_0001_00__wn_snr_+10db`	degraded	SV	Mild noise — mid-anchor.
`sp_sv_a_0001_00__wn_snr_-10db`	degraded	SV	Severe noise — discrimination negative anchor.

Degradations use the same apply_white_noise + rms_normalize_to_match helpers as Phase A so the spectral content (not loudness) is what the judge has to discriminate.

Pre-flight cost

estimated cost: $3.05  (25.8 min audio × 2 reruns × 2 models)
  gemini: $0.35
  openai: $2.70

Hard cap: SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD = $5 (env var, configurable). Script aborts pre-flight if estimate exceeds cap and mid-run if cumulative actual exceeds.

What happens after this lands

Wire keys + run the spike. GEMINI_API_KEY + OPENAI_API_KEY in .envrc. Single execution, ~$3 of spend, ~30 min wall-clock. Fills in the TBD cells in docs/m17_phase_c_validation_report.md and writes state/spikes/m17_phase_c/results.json.
Make the go/no-go call from the gate table. The report's Recommendation matrix is pre-filled per outcome scenario, so this is mechanical.
If GO → open follow-up issues for the Phase C MVP: synthbanshee/eval/llm_judge.py module, anchor-set regression CI workflow, calibration against Shay's listening tests.
If NO-GO → record the failure mode and pivot the eval roadmap (paid linguistic raters, crowd eval, or wait for next-generation multimodal models).

Test plan

--dry-run mode prints the prompt + JSON schema without calling any API — verified locally; output reviewed for clarity and content-safety framing
Cost estimate matches expected pricing for Gemini 2.5 Pro audio + GPT-4o-audio-preview at the 6-clip × 2-rerun × 2-model fan-out
Degraded variant materialisation writes state/spikes/m17_phase_c/sp_sv_a_0001_00__wn_snr_*.wav under the gitignored spike dir
ruff check + ruff format --check pass on scripts/m17_phase_c_validation.py
mypy scripts/m17_phase_c_validation.py passes
Deferred until keys are wired — single end-to-end run on real APIs; fill TBD cells in the report; make go/no-go call

Not in scope for this PR

Wiring GEMINI_API_KEY / OPENAI_API_KEY to .envrc — defer until you've decided to run the spend
Production synthbanshee/eval/llm_judge.py module — Phase C MVP, conditional on this spike passing
Anchor-set regression CI workflow — same, MVP-phase
Updating docs/implementation_plan.md to reflect Phase C kickoff — handle separately after the spike runs

…t stub) Mirrors the Phase A spike protocol (scripts/m17_phase_a_validation.py + docs/m17_phase_a_validation_report.md) for E5 — the Multimodal LLM Judge evaluator from docs/automated_eval_design.md §E5. This unlocks the eval loop that issues #92 and #97 are currently blocked on (both require native-speaker listening which doesn't scale). Script (scripts/m17_phase_c_validation.py): - 6-clip set: 4 corpus typology spans (SV/IT/NEG/NEU, all M2a-wettest agg_m_30-45_001) + 2 degraded variants of sp_sv_a_0001_00 (white noise at +10 dB and -10 dB SNR, RMS-matched to the clean source so the judge is scoring spectral content, not amplitude). - Two models: gemini-2.5-pro (audio input) and gpt-4o-audio-preview. - Two reruns per (clip, model) for within-model variance measurement. - Structured JSON output via a fixed schema (8 quality dimensions on a 1-5 scale + artifacts_detected + confidence_in_assessment + summary). - Four gate outcomes per model: refusal_gate ≥ 5/6 clips scored under DV-research framing discrimination_gate mean(corpus) − mean(severe-degraded) ≥ 0.5 variance_gate per-dim std across 2 reruns ≤ 0.5 shay_correlation_gate Spearman ρ ≥ 0.3 vs encoded expected ranking - Gemini BLOCK_NONE safety per design doc §E5 content-sensitivity guidance (research framing only — DV-content tolerance is required to avoid refusals on the metadata; the audio is entirely synthetic and contains no real persons). - Hard budget cap via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD (default $5); aborts pre-flight on estimate, aborts mid-run on cumulative actual. - --dry-run flag prints the prompt + schema without calling any API, for pre-flight prompt audit. - Lazy imports for google-genai / openai so the module is cheap to import even without the deps installed. Report stub (docs/m17_phase_c_validation_report.md): - Same structure as the Phase A report (TL;DR gate table, Reproduce, Clip-set manifest, Gate definitions, Per-model results, Failure-mode notes, Recommendation matrix, Limitations, Cost report). - All numeric cells marked TBD pending real-run results. - Recommendation matrix is pre-filled per gate-outcome scenario so the go/no-go decision is mechanical once the script runs. Cost estimate at default settings: ~$3.05 total ($0.35 Gemini + $2.70 GPT-4o), well within the $5 cap. Dry-run verified end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds the Phase C “multimodal LLM judge” spike artifacts for M17 E5, enabling a repeatable evaluation run (Gemini + OpenAI audio models) with per-model gate outcomes and a report template to unblock issues #92/#97 without additional immediate listening tests.

Changes:

Introduces scripts/m17_phase_c_validation.py to run a 6-clip × 2-model × 2-rerun structured-scoring experiment and compute four acceptance gates.
Adds docs/m17_phase_c_validation_report.md as the narrative report stub, meant to be filled from spike outputs in state/spikes/m17_phase_c/.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
scripts/m17_phase_c_validation.py	New Phase C spike runner: clip prep (incl. degradations), model calls, cost/budget handling, gate evaluation, and auto-report generation.
docs/m17_phase_c_validation_report.md	New report stub describing gates, clip set, reproduction steps, and how to interpret outcomes.

Comments suppressed due to low confidence (2)

scripts/m17_phase_c_validation.py:223

The docstring and inline docs refer to “structured JSON responses against a Pydantic schema” and include pydantic in the install command, but the script never imports/uses Pydantic; it builds a plain JSON Schema dict. Please either switch to an actual Pydantic model (and derive JSON Schema from it) or update the docs/install instructions to remove the Pydantic reference so the setup instructions match reality.

# --- Pydantic schema ---------------------------------------------------------
def make_response_schema() -> dict:
    """Build the JSON Schema used to constrain both Gemini and OpenAI output."""
    return {
        "type": "object",
        "properties": {
            **{d: {"type": "integer", "minimum": 1, "maximum": 5} for d in DIMENSIONS},
            "artifacts_detected": {"type": "boolean"},
            "artifact_notes": {"type": "string"},
            "confidence_in_assessment": {"type": "integer", "minimum": 1, "maximum": 5},
            "summary": {"type": "string"},
        },
        "required": [
            *DIMENSIONS,
            "artifacts_detected",
            "artifact_notes",
            "confidence_in_assessment",
            "summary",
        ],
        "additionalProperties": False,
    }

scripts/m17_phase_c_validation.py:544

Same as Gemini path: if JSON decoding fails, the run is marked refused but error is left None. Recording the JSONDecodeError (or at least a sentinel like "invalid_json") would make it much easier to triage schema/prompt issues from true policy refusals.

    raw = resp.choices[0].message.content or ""
    try:
        parsed = json.loads(raw)
        refused = False
    except json.JSONDecodeError:
        parsed = None
        refused = True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                if cumulative_usd >= budget_cap:
+                    print(
+                        f"ABORT: cumulative ${cumulative_usd:.2f} ≥ budget ${budget_cap:.2f}. "
+                        "Partial results will still be written.",
+                        flush=True,
+                    )
+                    break
+                if model == "gemini":


+# Per Anthropic/Google/OpenAI pricing snapshots as of January 2026. Update if
+# the spike is rerun against newer model versions. These are upper-bound
+# estimates: actual cost is metered per-token by each provider and the
+# script's running total uses that, not these.
+GEMINI_AUDIO_USD_PER_MIN = 0.0125  # gemini-2.5-pro audio input
+GEMINI_OUTPUT_USD_PER_KTOK = 0.0050
+OPENAI_AUDIO_INPUT_USD_PER_MIN = 0.10  # gpt-4o-audio-preview
+OPENAI_OUTPUT_USD_PER_KTOK = 0.020
+


+    for model, g in payload["gates"].items():
+        ref = f"{g['refusal_gate']['scored_clips']}/{g['refusal_gate']['min_required']}"


+    .venv/bin/python scripts/m17_phase_c_validation.py
+    .venv/bin/python scripts/m17_phase_c_validation.py --dry-run    # prompts only, no API calls
+
+Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns.


+```bash
+uv pip install --python .venv/bin/python \
+    google-genai openai pydantic soundfile numpy scipy
+
+export GEMINI_API_KEY=...
+export OPENAI_API_KEY=...
+export SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD=5.0   # hard cap
+
+# Optional dry run — prints the prompt + JSON schema, no API calls, no spend.
+.venv/bin/python scripts/m17_phase_c_validation.py --dry-run
+
+# Full run.
+.venv/bin/python scripts/m17_phase_c_validation.py
+```
+
+The script prepares 6 clip records (4 corpus typology spans + 2 degraded
+variants of `sp_sv_a_0001_00`), sends each to each model twice, parses
+structured JSON responses against a Pydantic schema, and writes gate
+outcomes to `results.json` + an auto-generated markdown summary.


+    raw = resp.text or ""
+    try:
+        parsed = json.loads(raw)
+        refused = False
+    except json.JSONDecodeError:
+        parsed = None
+        refused = True
+


+# ordinal: 4 = best perceived, 1 = worst. Ties are deliberate where evidence
+# doesn't separate clips. Used only for the Spearman gate; if you don't have
+# a strong prior for a clip, leave it tied with neighbours rather than guess.


…ore running Critical: - Replace wn_snr_+10db mid-anchor with synth_rate_slow_0.7x — tests synthesis defect detection (unnatural tempo + pitch-shift + truncated arc) instead of measuring only whether the model can hear noise - Add --probe-metadata-bias flag: runs a no_arc prompt variant on corpus clips to measure whether emotional_expression / escalation_arc are scored from audio or inferred from the intensity-arc label; delta is reported per-clip per-dim in both results.json and report_auto.md High: - Add failure_reason field to JudgeResult ("ok" / "content_refusal" / "json_parse_error" / "api_error") so only content refusals count against the refusal gate; api_error and json_parse_error are retried, not penalised - Incremental write: results_partial.jsonl is appended after each call; --resume loads prior results and skips completed (clip, model, run, variant) tuples — an API failure or budget-cap abort no longer loses all spend - Drop shay_correlation and variance from overall_pass (advisory only); add notes: variance is trivially PASS at TEMPERATURE=0 greedy decoding, Shay rho is not statistically significant at n=4 - Discrimination gate now tests two independent arms — noise_corruption and synth_failure — reporting both separations; passing either arm clears the gate Low: - Fix report stub: gate table columns aligned with report_auto.md output, hand-written per-run tables removed (replaced with pointers to auto-report), limitations section updated to reflect all of the above Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…imination gate Update the gate bullet-list in the module docstring to match what the code actually does after the review-fixes commit: two discrimination arms (noise_corruption comparability + synth_failure real target), advisory notes on variance and Shay-correlation, failure_reason semantics for refusal gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-13T21:00:45Z

pr-agent-context report:

This run includes unresolved review comments on PR #113 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/m17_phase_c_validation.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267579
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Mid-run budget abort only breaks out of the innermost rerun loop, so the script will continue iterating over subsequent clips/models and repeatedly print the ABORT message without actually stopping the run. Consider breaking out of all loops (e.g., return immediately, raise SystemExit, or use a flag checked at each nesting level) so the budget cap reliably halts further API calls and reduces confusing output.

## COPILOT-2
Location: scripts/m17_phase_c_validation.py:429
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267639
Root author: copilot-pull-request-reviewer

Comment:
    The comments claim the “running total uses” provider-metered costs, but usd_cost is computed from hard-coded price constants plus audio duration + output tokens only. This can undercount (e.g., ignores text prompt tokens and any provider-specific audio token accounting), making the mid-run budget cap unreliable. Either (a) compute cost from provider usage fields that correspond to billable units (if available) or (b) clearly label usd_cost/cumulative_usd as an estimate and enforce the budget conservatively (e.g., include prompt tokens, add a safety margin, or stop based on estimated remaining spend).

## COPILOT-3
Location: scripts/m17_phase_c_validation.py:957
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267673
Root author: copilot-pull-request-reviewer

Comment:
    Refusal gate display uses `{scored_clips}/{min_required}` (e.g., `6/5`), which is confusing because the criterion is “≥ 5/6 clips scored”. It would be clearer to display `{scored_clips}/{len(clips)}` (and optionally keep the threshold separately) so the table communicates the actual denominator.

## COPILOT-4
Location: scripts/m17_phase_c_validation.py:49
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267704
Root author: copilot-pull-request-reviewer

Comment:
    The top-level docstring’s cost estimate (“~$0.50–1.50 total across both models”) appears inconsistent with the script’s own pricing constants (OpenAI audio input alone is $0.10/min, so ~25+ minutes of audio × reruns can exceed that). Please update the stated range (or remove it) so it doesn’t mislead someone about expected spend.

    This issue also appears on line 202 of the same file.

## COPILOT-5
Location: docs/m17_phase_c_validation_report.md:57
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267738
Root author: copilot-pull-request-reviewer

Comment:
    This report (and the reproduce install command) says the script “parses structured JSON responses against a Pydantic schema” and lists `pydantic` as a dependency, but `scripts/m17_phase_c_validation.py` currently uses a hand-built JSON Schema dict and does not import Pydantic. Update the report to match the implementation (or update the script to actually use Pydantic-derived schema) so the reproduction steps are accurate.

## COPILOT-6
Location: scripts/m17_phase_c_validation.py:610
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267759
Root author: copilot-pull-request-reviewer

Comment:
    On JSON parse failure, the result is marked refused but `error` remains None, which makes it harder to debug whether the model refused vs. returned malformed JSON vs. a truncation issue. Consider setting `error` (e.g., to the JSONDecodeError message) when decoding fails so partial runs have actionable diagnostics.

    This issue also appears on line 537 of the same file.

## COPILOT-7
Location: scripts/m17_phase_c_validation.py:100
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267787
Root author: copilot-pull-request-reviewer

Comment:
    The comment says `expected_quality_rank` is an ordinal where 4=best and 1=worst, but the degraded severe clip is assigned rank 0. Either adjust the encoding comment (e.g., allow 0 as “worse than worst corpus”) or keep the ranks within the documented 1–4 range to avoid confusion when interpreting `results.json` / manifest tables.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25826048300 attempt 1
Comment timestamp: 2026-05-13T20:59:57.420454+00:00
PR head commit: 7598cb08412d8e4ee826eddf9d654c0029a34fbe

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

+Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns.
+The hard cap defaults to $5; raise/lower via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD.


+    usage = getattr(resp, "usage_metadata", None)
+    in_tok = getattr(usage, "prompt_token_count", None) or 0
+    out_tok = getattr(usage, "candidates_token_count", None) or 0
+    cost = (clip.duration_s / 60.0) * GEMINI_AUDIO_USD_PER_MIN + (
+        out_tok / 1000.0
+    ) * GEMINI_OUTPUT_USD_PER_KTOK
+


+    cost = (clip.duration_s / 60.0) * OPENAI_AUDIO_INPUT_USD_PER_MIN + (
+        out_tok / 1000.0
+    ) * OPENAI_OUTPUT_USD_PER_KTOK


+        scored_clips = {
+            r.clip_label for r in runs if r.failure_reason not in ("content_refusal",) and r.parsed
+        }
+        content_refusals = [r for r in runs if r.failure_reason == "content_refusal"]
+        refusal_pass = len(scored_clips) >= REFUSAL_GATE_MIN_SCORED


+    ]
+    print(f"call plan: {len(call_plan)} total, {len(pending)} pending", flush=True)
+
+    # --- Run -----------------------------------------------------------------


+Setup:
+
+    uv pip install --python .venv/bin/python \\
+        google-genai openai pydantic soundfile numpy scipy


+
+```bash
+uv pip install --python .venv/bin/python \
+    google-genai openai pydantic soundfile numpy scipy


Copilot AI review requested due to automatic review settings May 13, 2026 20:28

shaypal5 added planning comp: tts TTS rendering, SSML, Azure/Google providers labels May 13, 2026

Copilot started reviewing on behalf of shaypal5 May 13, 2026 20:29 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shaypal5 added this to the M17 milestone May 13, 2026

shaypal5 added comp: eval Automated evaluation, LLM judges, MOS/UTMOS, ASR metrics and removed comp: tts TTS rendering, SSML, Azure/Google providers labels May 13, 2026

Copilot AI review requested due to automatic review settings May 13, 2026 20:58

Copilot started reviewing on behalf of shaypal5 May 13, 2026 20:58 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

shaypal5 merged commit fc65a83 into main May 13, 2026
7 checks passed

shaypal5 deleted the spike/m17-phase-c-llm-judge branch May 13, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub)#113

spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub)#113
shaypal5 merged 3 commits into
mainfrom
spike/m17-phase-c-llm-judge

shaypal5 commented May 13, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

This comment has been minimized.

github-actions Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for model, g in payload["gates"].items():
		ref = f"{g['refusal_gate']['scored_clips']}/{g['refusal_gate']['min_required']}"

		Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns.
		The hard cap defaults to $5; raise/lower via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD.

Conversation

shaypal5 commented May 13, 2026

Problem

What this PR does

Clip set

Pre-flight cost

What happens after this lands

Test plan

Not in scope for this PR

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment has been minimized.

github-actions Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants