fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs#79
Merged
Conversation
Tiny follow-up to PR #77. Addresses two of the five post-merge Copilot threads on that PR; the third (torch.hub.load trust_repo=True) is a fourth recurrence of an already-treated concern (pinned to immutable commit + documented in report Limitations §8) and is intentionally not re-addressed here. UTMOS sample-population mismatch (3 of the 5 threads, all the same issue): The script computed `clean_mean` over all 10 clips but each degradation's `mean_utmos` over only the first 5 (DEGRADATION_SAMPLE_N). The resulting `primary_separation` and `any_separation_passes` fields in results.json were therefore comparing different populations. The canonical PR #77 report tables already used the paired same-5 comparison (so the shipped conclusions are unaffected), but the script's emitted JSON + auto-template were inconsistent with the report. Now: introduce `sample_clean_mean` (mean over the sample clips only), use it for the gate computations, and expose both `clean_mean` and `sample_clean_mean` in the JSON along with the new `degradation_sample_clip_ids` list so a reader can verify the populations match. Auto-template report uses the sample mean for the clean − deg column. Verified against existing JSON: paired separations are +0.067 / +0.133 / +0.136 / +0.207 / +0.152 (white-noise -20/-10/0/+10 plus 2 kHz lowpass), matching the canonical report's published numbers within rounding (the report cites +0.067 / +0.135 / +0.137 / +0.208 / +0.153). Doc-refs (1 thread): swap "lines 418–422" reference in the report's intro for the stable section heading "§Acceptance Criteria". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR is a small follow-up to the M17 Phase A validation spike script/report, fixing an unpaired UTMOS separation calculation by switching the gate/auto-template to use a clean mean computed over the same degradation-sample clip subset, and stabilizing a design-doc reference in the narrative report.
Changes:
- Compute
sample_clean_meanover the degradation-sample clips and use it for UTMOS separations/gates (primary_separation,any_separation_passes) and the auto-generated report table. - Expose
sample_clean_meananddegradation_sample_clip_idsin the emitted JSON for transparency/reproducibility. - Replace a brittle line-range reference in the report with a stable section reference to the design doc’s “Acceptance Criteria”.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
scripts/m17_phase_a_validation.py |
Switches UTMOS separation math to a paired (same-clip) clean baseline; adds JSON fields to make the pairing explicit and updates the auto-report accordingly. |
docs/m17_phase_a_validation_report.md |
Updates the design-doc citation to a stable section reference instead of line numbers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #79 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.Run metadata: |
shaypal5
added a commit
that referenced
this pull request
May 7, 2026
…ft (#96) * docs: sync .agent-plan.md to current state (M16 done, M17 in flight) The state tracker was last updated 2026-04-22 and claimed the active task was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16 have all merged, and M17 (automated evaluation) has had its design (#73), Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90) land — anyone reading this file got an 8-milestone-stale picture. Rewrote: - Current system state — list every merged V3 milestone with one-line summaries; promoted the loudness contract (#78) and effective-prosody cap (#87) to the architectural-invariants list since both are load-bearing for any agent editing TTS or preprocessing code. - Active / next task — replaced "M8b" with the M17 ASR regression thread, noted PR #90 as the partial fix and #91 (rate-floor lift R) as the queued next experiment, including the WER ≤ 0.10 + listening-test pass criterion. - Open threads table — new section listing the threads agents most often need to know about (M12 gate, M17 full automation, #62 word merging, #72 SSML parse, #88 CI ASR deferral). - Context pointers — added splendor-brief as the orientation entrypoint and the .venv-vs-~/.local PATH trap. - CI / Workflow notes — added the Tier-3 ASR local-only policy summary so PRs touching audio-rendering files don't merge without it. This file is a quick-orientation summary — added a header line marking the design-doc tracker as authoritative when details disagree, so the next drift gets caught in a tracker diff rather than a stale summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: delete .agent-plan.md (Option A; supersedes prior commit) Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that the file fundamentally fails the duplication test: milestone status duplicates docs/audio_generation_v3_design.md → Implementation Tracker; open-thread state duplicates GitHub Issues + splendor brief; pointers duplicate AGENTS.md; and the only non-duplicate section (architectural invariants) is already covered in AGENTS.md (FLOAT subtype line 72, MixedScene shift line 37, validate_audio peak line 85, full #78 loudness contract line 35, full #87 effective-prosody cap line 73). The rewrite also reproduced the "if details disagree, the tracker wins" disclaimer antipattern — the same one PR #94's review removed from docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter is "I am allowed to be wrong relative to the canonical source" is structured drift bait. The original .agent-plan.md got 15 days stale; "rewrite more carefully" was the wrong response. Changes: - Delete .agent-plan.md (75-line summary that duplicated load-bearing state held authoritatively elsewhere). - Update .claude/skills/open-feature-pr.md step 2 to drop the ".agent-plan.md" fallback for milestone-ID inference; the branch name + parent issue's milestone field already cover it, and pointing at the design-doc tracker as authoritative is more durable than pointing at a manually-maintained summary. - Splendor maintenance: source forget src-9d9759e5ad... --apply. Removes the orphan source manifest, wiki summary page, and wiki index entry. Residual cross-references in 5 planning tasks and 5 wiki pages remain — splendor surfaces them but doesn't auto-clean; they'll regenerate on next ingest of those sources. Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md are already there. The remaining "invariants" (5-tuple mixer API post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs _normalize_peak naming) are mixer-internal details that belong in module docstrings, not global agent rules — and AGENTS.md has its own M8a drift bug (line 71 still says "4-tuple") that's better fixed in a dedicated PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a 4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py` with six named fields (`wav_bytes`, `amount_s`, `speaker_id`, `rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong since at least M8a (added `mix_mode`); #74's Lombard tilt then added `intensity`. Per the dataclass docstring, the named-fields move from positional tuple was deliberate so call sites and reviewers can't transpose args silently. Surfaced during PR #96 review (delete-.agent-plan.md) — the original .agent-plan.md and AGENTS.md disagreed about the segment API; turns out they were both stale, and the design-doc tracker (#74 row line 219) shows the dataclass move that neither AGENTS.md nor .agent-plan.md caught. Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of docs/spec.md that ingest --changed picked up) is intentionally NOT in this commit — it's 16 state/wiki/planning files of churn, scoped for a dedicated splendor maintenance follow-up rather than mixed into a doc fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR) Copilot's review of commit b20eaac caught a cross-reference convention break: the codebase consistently uses #65 (the issue) when referencing the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84, 88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c (line 215). PR #74 is the implementation that closed the issue, but the cross-link convention is to point at the issue. One-character fix: #74 → #65 in the M3a TTS bullet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tiny follow-up to PR #77 (M17 Phase A validation spike). Addresses 4 of the 5 unresolved Copilot threads that landed post-merge on that PR (3 are duplicates of the same underlying issue).
Changes
scripts/m17_phase_a_validation.pysample_clean_meanover the degradation-sample clips and use it forprimary_separation/any_separation_passesinstead of the all-clipclean_mean. Expose both means + the sample clip IDs in the JSON. Auto-template uses paired comparison too.docs/m17_phase_a_validation_report.mdlines 418–422reference with§"Acceptance Criteria"for stability against design-doc edits.Why
The canonical report tables in PR #77 already used the paired same-5 comparison (so its published conclusions are unaffected), but the script's JSON output + auto-template computed
primary_separation = clean_mean − degradation_meanwhereclean_meanwas over all 10 clips anddegradation_meanover only 5. That's an unpaired comparison and yields a different number from the one the report cites.Three Copilot threads on the merged PR flagged this as the same issue; this PR resolves all three.
Verification
Recomputed paired separations from the merged JSON match what the canonical report publishes within rounding:
Differences are <= 0.002 and explained by the canonical report rounding to 3 decimals at a different intermediate step. No conclusion in PR #77 changes.
What this PR is NOT
torch.hub.load(trust_repo=True)). That's the fourth recurrence of the same supply-chain concern; it was already pinned to an immutable commit + acknowledged in the report's Limitations §8 in PR spike(eval): M17 Phase A validation — Whisper + UTMOS on Hebrew clips #77. Adding an opt-in env-var gate would be pure ceremony for a one-shot spike script.Test plan
pytest -qclean — 1707 passed, 1 skipped (unrelated).🤖 Generated with Claude Code