fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs by shaypal5 · Pull Request #79 · DataHackIL/SynthBanshee

shaypal5 · 2026-05-05T10:55:51Z

Summary

Tiny follow-up to PR #77 (M17 Phase A validation spike). Addresses 4 of the 5 unresolved Copilot threads that landed post-merge on that PR (3 are duplicates of the same underlying issue).

Changes

File	Purpose
`scripts/m17_phase_a_validation.py`	Compute `sample_clean_mean` over the degradation-sample clips and use it for `primary_separation` / `any_separation_passes` instead of the all-clip `clean_mean`. Expose both means + the sample clip IDs in the JSON. Auto-template uses paired comparison too.
`docs/m17_phase_a_validation_report.md`	Replace `lines 418–422` reference with `§"Acceptance Criteria"` for stability against design-doc edits.

Why

The canonical report tables in PR #77 already used the paired same-5 comparison (so its published conclusions are unaffected), but the script's JSON output + auto-template computed primary_separation = clean_mean − degradation_mean where clean_mean was over all 10 clips and degradation_mean over only 5. That's an unpaired comparison and yields a different number from the one the report cites.

Three Copilot threads on the merged PR flagged this as the same issue; this PR resolves all three.

Verification

Recomputed paired separations from the merged JSON match what the canonical report publishes within rounding:

Degradation	Report cites	Now in JSON
white noise −20 dB SNR	+0.067	+0.067
white noise −10 dB SNR	+0.135	+0.133
white noise 0 dB SNR	+0.137	+0.136
white noise +10 dB SNR	+0.208	+0.207
2 kHz lowpass	+0.153	+0.152

Differences are <= 0.002 and explained by the canonical report rounding to 3 decimals at a different intermediate step. No conclusion in PR #77 changes.

What this PR is NOT

Not addressing the 4th unresolved Copilot thread (torch.hub.load(trust_repo=True)). That's the fourth recurrence of the same supply-chain concern; it was already pinned to an immutable commit + acknowledged in the report's Limitations §8 in PR spike(eval): M17 Phase A validation — Whisper + UTMOS on Hebrew clips #77. Adding an opt-in env-var gate would be pure ceremony for a one-shot spike script.

Test plan

pytest -q clean — 1707 passed, 1 skipped (unrelated).
Script compiles + ruff clean.
Paired separations from existing JSON match canonical report numbers within rounding.

🤖 Generated with Claude Code

Tiny follow-up to PR #77. Addresses two of the five post-merge Copilot threads on that PR; the third (torch.hub.load trust_repo=True) is a fourth recurrence of an already-treated concern (pinned to immutable commit + documented in report Limitations §8) and is intentionally not re-addressed here. UTMOS sample-population mismatch (3 of the 5 threads, all the same issue): The script computed `clean_mean` over all 10 clips but each degradation's `mean_utmos` over only the first 5 (DEGRADATION_SAMPLE_N). The resulting `primary_separation` and `any_separation_passes` fields in results.json were therefore comparing different populations. The canonical PR #77 report tables already used the paired same-5 comparison (so the shipped conclusions are unaffected), but the script's emitted JSON + auto-template were inconsistent with the report. Now: introduce `sample_clean_mean` (mean over the sample clips only), use it for the gate computations, and expose both `clean_mean` and `sample_clean_mean` in the JSON along with the new `degradation_sample_clip_ids` list so a reader can verify the populations match. Auto-template report uses the sample mean for the clean − deg column. Verified against existing JSON: paired separations are +0.067 / +0.133 / +0.136 / +0.207 / +0.152 (white-noise -20/-10/0/+10 plus 2 kHz lowpass), matching the canonical report's published numbers within rounding (the report cites +0.067 / +0.135 / +0.137 / +0.208 / +0.153). Doc-refs (1 thread): swap "lines 418–422" reference in the report's intro for the stable section heading "§Acceptance Criteria". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR is a small follow-up to the M17 Phase A validation spike script/report, fixing an unpaired UTMOS separation calculation by switching the gate/auto-template to use a clean mean computed over the same degradation-sample clip subset, and stabilizing a design-doc reference in the narrative report.

Changes:

Compute sample_clean_mean over the degradation-sample clips and use it for UTMOS separations/gates (primary_separation, any_separation_passes) and the auto-generated report table.
Expose sample_clean_mean and degradation_sample_clip_ids in the emitted JSON for transparency/reproducibility.
Replace a brittle line-range reference in the report with a stable section reference to the design doc’s “Acceptance Criteria”.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`scripts/m17_phase_a_validation.py`	Switches UTMOS separation math to a paired (same-clip) clean baseline; adds JSON fields to make the pairing explicit and updates the auto-report accordingly.
`docs/m17_phase_a_validation_report.md`	Updates the design-doc citation to a stable section reference instead of line numbers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-05-05T10:58:33Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #79 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: pull request opened
Workflow run: 25372254592 attempt 1
Comment timestamp: 2026-05-05T10:57:45.081252+00:00
PR head commit: 3edc2250d7c54846db142ed246e79bd5e5ac9e59

…ft (#96) * docs: sync .agent-plan.md to current state (M16 done, M17 in flight) The state tracker was last updated 2026-04-22 and claimed the active task was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16 have all merged, and M17 (automated evaluation) has had its design (#73), Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90) land — anyone reading this file got an 8-milestone-stale picture. Rewrote: - Current system state — list every merged V3 milestone with one-line summaries; promoted the loudness contract (#78) and effective-prosody cap (#87) to the architectural-invariants list since both are load-bearing for any agent editing TTS or preprocessing code. - Active / next task — replaced "M8b" with the M17 ASR regression thread, noted PR #90 as the partial fix and #91 (rate-floor lift R) as the queued next experiment, including the WER ≤ 0.10 + listening-test pass criterion. - Open threads table — new section listing the threads agents most often need to know about (M12 gate, M17 full automation, #62 word merging, #72 SSML parse, #88 CI ASR deferral). - Context pointers — added splendor-brief as the orientation entrypoint and the .venv-vs-~/.local PATH trap. - CI / Workflow notes — added the Tier-3 ASR local-only policy summary so PRs touching audio-rendering files don't merge without it. This file is a quick-orientation summary — added a header line marking the design-doc tracker as authoritative when details disagree, so the next drift gets caught in a tracker diff rather than a stale summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: delete .agent-plan.md (Option A; supersedes prior commit) Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that the file fundamentally fails the duplication test: milestone status duplicates docs/audio_generation_v3_design.md → Implementation Tracker; open-thread state duplicates GitHub Issues + splendor brief; pointers duplicate AGENTS.md; and the only non-duplicate section (architectural invariants) is already covered in AGENTS.md (FLOAT subtype line 72, MixedScene shift line 37, validate_audio peak line 85, full #78 loudness contract line 35, full #87 effective-prosody cap line 73). The rewrite also reproduced the "if details disagree, the tracker wins" disclaimer antipattern — the same one PR #94's review removed from docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter is "I am allowed to be wrong relative to the canonical source" is structured drift bait. The original .agent-plan.md got 15 days stale; "rewrite more carefully" was the wrong response. Changes: - Delete .agent-plan.md (75-line summary that duplicated load-bearing state held authoritatively elsewhere). - Update .claude/skills/open-feature-pr.md step 2 to drop the ".agent-plan.md" fallback for milestone-ID inference; the branch name + parent issue's milestone field already cover it, and pointing at the design-doc tracker as authoritative is more durable than pointing at a manually-maintained summary. - Splendor maintenance: source forget src-9d9759e5ad... --apply. Removes the orphan source manifest, wiki summary page, and wiki index entry. Residual cross-references in 5 planning tasks and 5 wiki pages remain — splendor surfaces them but doesn't auto-clean; they'll regenerate on next ingest of those sources. Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md are already there. The remaining "invariants" (5-tuple mixer API post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs _normalize_peak naming) are mixer-internal details that belong in module docstrings, not global agent rules — and AGENTS.md has its own M8a drift bug (line 71 still says "4-tuple") that's better fixed in a dedicated PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a 4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py` with six named fields (`wav_bytes`, `amount_s`, `speaker_id`, `rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong since at least M8a (added `mix_mode`); #74's Lombard tilt then added `intensity`. Per the dataclass docstring, the named-fields move from positional tuple was deliberate so call sites and reviewers can't transpose args silently. Surfaced during PR #96 review (delete-.agent-plan.md) — the original .agent-plan.md and AGENTS.md disagreed about the segment API; turns out they were both stale, and the design-doc tracker (#74 row line 219) shows the dataclass move that neither AGENTS.md nor .agent-plan.md caught. Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of docs/spec.md that ingest --changed picked up) is intentionally NOT in this commit — it's 16 state/wiki/planning files of churn, scoped for a dedicated splendor maintenance follow-up rather than mixed into a doc fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR) Copilot's review of commit b20eaac caught a cross-reference convention break: the codebase consistently uses #65 (the issue) when referencing the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84, 88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c (line 215). PR #74 is the implementation that closed the issue, but the cross-link convention is to point at the issue. One-character fix: #74 → #65 in the M3a TTS bullet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 5, 2026 10:55

shaypal5 added this to the M17 milestone May 5, 2026

shaypal5 added bugfix type: fix Bug fix type: docs Documentation only labels May 5, 2026

Copilot started reviewing on behalf of shaypal5 May 5, 2026 10:56 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

shaypal5 merged commit 8bdb28d into main May 5, 2026
10 checks passed

shaypal5 deleted the fix/m17-spike-paired-utmos-and-doc-refs branch May 5, 2026 10:58

This was referenced May 5, 2026

investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83

Closed

docs: delete .agent-plan.md + fix AGENTS.md M3a/M8a/#74 mixer-API drift #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs#79

fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs#79
shaypal5 merged 1 commit into
mainfrom
fix/m17-spike-paired-utmos-and-doc-refs

shaypal5 commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 5, 2026

Summary

Changes

Why

Verification

What this PR is NOT

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants