Skip to content

fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs#79

Merged
shaypal5 merged 1 commit into
mainfrom
fix/m17-spike-paired-utmos-and-doc-refs
May 5, 2026
Merged

fix(eval): #M17 spike — paired UTMOS comparison + stable doc refs#79
shaypal5 merged 1 commit into
mainfrom
fix/m17-spike-paired-utmos-and-doc-refs

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 5, 2026

Summary

Tiny follow-up to PR #77 (M17 Phase A validation spike). Addresses 4 of the 5 unresolved Copilot threads that landed post-merge on that PR (3 are duplicates of the same underlying issue).

Changes

File Purpose
scripts/m17_phase_a_validation.py Compute sample_clean_mean over the degradation-sample clips and use it for primary_separation / any_separation_passes instead of the all-clip clean_mean. Expose both means + the sample clip IDs in the JSON. Auto-template uses paired comparison too.
docs/m17_phase_a_validation_report.md Replace lines 418–422 reference with §"Acceptance Criteria" for stability against design-doc edits.

Why

The canonical report tables in PR #77 already used the paired same-5 comparison (so its published conclusions are unaffected), but the script's JSON output + auto-template computed primary_separation = clean_mean − degradation_mean where clean_mean was over all 10 clips and degradation_mean over only 5. That's an unpaired comparison and yields a different number from the one the report cites.

Three Copilot threads on the merged PR flagged this as the same issue; this PR resolves all three.

Verification

Recomputed paired separations from the merged JSON match what the canonical report publishes within rounding:

Degradation Report cites Now in JSON
white noise −20 dB SNR +0.067 +0.067
white noise −10 dB SNR +0.135 +0.133
white noise 0 dB SNR +0.137 +0.136
white noise +10 dB SNR +0.208 +0.207
2 kHz lowpass +0.153 +0.152

Differences are <= 0.002 and explained by the canonical report rounding to 3 decimals at a different intermediate step. No conclusion in PR #77 changes.

What this PR is NOT

  • Not addressing the 4th unresolved Copilot thread (torch.hub.load(trust_repo=True)). That's the fourth recurrence of the same supply-chain concern; it was already pinned to an immutable commit + acknowledged in the report's Limitations §8 in PR spike(eval): M17 Phase A validation — Whisper + UTMOS on Hebrew clips #77. Adding an opt-in env-var gate would be pure ceremony for a one-shot spike script.

Test plan

  • pytest -q clean — 1707 passed, 1 skipped (unrelated).
  • Script compiles + ruff clean.
  • Paired separations from existing JSON match canonical report numbers within rounding.

🤖 Generated with Claude Code

Tiny follow-up to PR #77.  Addresses two of the five post-merge Copilot
threads on that PR; the third (torch.hub.load trust_repo=True) is a
fourth recurrence of an already-treated concern (pinned to immutable
commit + documented in report Limitations §8) and is intentionally
not re-addressed here.

UTMOS sample-population mismatch (3 of the 5 threads, all the same issue):
The script computed `clean_mean` over all 10 clips but each degradation's
`mean_utmos` over only the first 5 (DEGRADATION_SAMPLE_N).  The resulting
`primary_separation` and `any_separation_passes` fields in results.json
were therefore comparing different populations.  The canonical PR #77
report tables already used the paired same-5 comparison (so the
shipped conclusions are unaffected), but the script's emitted JSON +
auto-template were inconsistent with the report.

Now: introduce `sample_clean_mean` (mean over the sample clips only),
use it for the gate computations, and expose both `clean_mean` and
`sample_clean_mean` in the JSON along with the new
`degradation_sample_clip_ids` list so a reader can verify the
populations match.  Auto-template report uses the sample mean for the
clean − deg column.

Verified against existing JSON: paired separations are
+0.067 / +0.133 / +0.136 / +0.207 / +0.152 (white-noise -20/-10/0/+10
plus 2 kHz lowpass), matching the canonical report's published numbers
within rounding (the report cites +0.067 / +0.135 / +0.137 / +0.208 /
+0.153).

Doc-refs (1 thread): swap "lines 418–422" reference in the report's
intro for the stable section heading "§Acceptance Criteria".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 10:55
@shaypal5 shaypal5 added this to the M17 milestone May 5, 2026
@shaypal5 shaypal5 added bugfix type: fix Bug fix type: docs Documentation only labels May 5, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a small follow-up to the M17 Phase A validation spike script/report, fixing an unpaired UTMOS separation calculation by switching the gate/auto-template to use a clean mean computed over the same degradation-sample clip subset, and stabilizing a design-doc reference in the narrative report.

Changes:

  • Compute sample_clean_mean over the degradation-sample clips and use it for UTMOS separations/gates (primary_separation, any_separation_passes) and the auto-generated report table.
  • Expose sample_clean_mean and degradation_sample_clip_ids in the emitted JSON for transparency/reproducibility.
  • Replace a brittle line-range reference in the report with a stable section reference to the design doc’s “Acceptance Criteria”.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
scripts/m17_phase_a_validation.py Switches UTMOS separation math to a paired (same-clip) clean baseline; adds JSON fields to make the pairing explicit and updates the auto-report accordingly.
docs/m17_phase_a_validation_report.md Updates the design-doc citation to a stable section reference instead of line numbers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shaypal5 shaypal5 merged commit 8bdb28d into main May 5, 2026
10 checks passed
@shaypal5 shaypal5 deleted the fix/m17-spike-paired-utmos-and-doc-refs branch May 5, 2026 10:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #79 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: pull request opened
Workflow run: 25372254592 attempt 1
Comment timestamp: 2026-05-05T10:57:45.081252+00:00
PR head commit: 3edc2250d7c54846db142ed246e79bd5e5ac9e59

shaypal5 added a commit that referenced this pull request May 7, 2026
…ft (#96)

* docs: sync .agent-plan.md to current state (M16 done, M17 in flight)

The state tracker was last updated 2026-04-22 and claimed the active task
was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16
have all merged, and M17 (automated evaluation) has had its design (#73),
Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90)
land — anyone reading this file got an 8-milestone-stale picture.

Rewrote:
- Current system state — list every merged V3 milestone with one-line
  summaries; promoted the loudness contract (#78) and effective-prosody
  cap (#87) to the architectural-invariants list since both are
  load-bearing for any agent editing TTS or preprocessing code.
- Active / next task — replaced "M8b" with the M17 ASR regression thread,
  noted PR #90 as the partial fix and #91 (rate-floor lift R) as the
  queued next experiment, including the WER ≤ 0.10 + listening-test pass
  criterion.
- Open threads table — new section listing the threads agents most often
  need to know about (M12 gate, M17 full automation, #62 word merging,
  #72 SSML parse, #88 CI ASR deferral).
- Context pointers — added splendor-brief as the orientation entrypoint
  and the .venv-vs-~/.local PATH trap.
- CI / Workflow notes — added the Tier-3 ASR local-only policy summary
  so PRs touching audio-rendering files don't merge without it.

This file is a quick-orientation summary — added a header line marking
the design-doc tracker as authoritative when details disagree, so the
next drift gets caught in a tracker diff rather than a stale summary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: delete .agent-plan.md (Option A; supersedes prior commit)

Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that
the file fundamentally fails the duplication test: milestone status
duplicates docs/audio_generation_v3_design.md → Implementation Tracker;
open-thread state duplicates GitHub Issues + splendor brief; pointers
duplicate AGENTS.md; and the only non-duplicate section
(architectural invariants) is already covered in AGENTS.md (FLOAT
subtype line 72, MixedScene shift line 37, validate_audio peak line 85,
full #78 loudness contract line 35, full #87 effective-prosody cap
line 73).

The rewrite also reproduced the "if details disagree, the tracker wins"
disclaimer antipattern — the same one PR #94's review removed from
docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter
is "I am allowed to be wrong relative to the canonical source" is
structured drift bait. The original .agent-plan.md got 15 days stale;
"rewrite more carefully" was the wrong response.

Changes:
- Delete .agent-plan.md (75-line summary that duplicated load-bearing
  state held authoritatively elsewhere).
- Update .claude/skills/open-feature-pr.md step 2 to drop the
  ".agent-plan.md" fallback for milestone-ID inference; the branch
  name + parent issue's milestone field already cover it, and pointing
  at the design-doc tracker as authoritative is more durable than
  pointing at a manually-maintained summary.
- Splendor maintenance: source forget src-9d9759e5ad... --apply.
  Removes the orphan source manifest, wiki summary page, and wiki
  index entry. Residual cross-references in 5 planning tasks and 5
  wiki pages remain — splendor surfaces them but doesn't auto-clean;
  they'll regenerate on next ingest of those sources.

Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md
are already there. The remaining "invariants" (5-tuple mixer API
post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs
_normalize_peak naming) are mixer-internal details that belong in
module docstrings, not global agent rules — and AGENTS.md has its own
M8a drift bug (line 71 still says "4-tuple") that's better fixed in a
dedicated PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description

AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a
4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual
current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py`
with six named fields (`wav_bytes`, `amount_s`, `speaker_id`,
`rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong
since at least M8a (added `mix_mode`); #74's Lombard tilt then added
`intensity`. Per the dataclass docstring, the named-fields move from
positional tuple was deliberate so call sites and reviewers can't
transpose args silently.

Surfaced during PR #96 review (delete-.agent-plan.md) — the original
.agent-plan.md and AGENTS.md disagreed about the segment API; turns out
they were both stale, and the design-doc tracker (#74 row line 219)
shows the dataclass move that neither AGENTS.md nor .agent-plan.md
caught.

Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of
docs/spec.md that ingest --changed picked up) is intentionally NOT in
this commit — it's 16 state/wiki/planning files of churn, scoped for a
dedicated splendor maintenance follow-up rather than mixed into a doc
fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR)

Copilot's review of commit b20eaac caught a cross-reference convention
break: the codebase consistently uses #65 (the issue) when referencing
the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84,
88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c
(line 215). PR #74 is the implementation that closed the issue, but
the cross-link convention is to point at the issue.

One-character fix: #74#65 in the M3a TTS bullet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix type: docs Documentation only type: fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants