fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83)#82
Conversation
…n target peak (M3c) M3a per-turn RMS targeting + M3b limiter-only normalization left M3a-shaped Tier A clips peaking ~6 dB below the −1 dBFS ceiling. Whisper under- transcribed (WER 0.04 → 0.28, length-ratio 0.76) and UTMOS read the loosely-limited clips ~0.9 MOS higher, dominating M17 evaluation. Fix is a *single global gain* post-mix loudness step — preserves per-turn RMS ratios exactly (M3a's whole point), unlike the legacy per-segment peak normalizer M3b removed. Configurable target with a sane default. Changes: - synthbanshee/augment/preprocessing.py: new peak_normalize_to_target() helper; preprocess() runs it before the existing −1 dBFS safety limiter - synthbanshee/config/preprocessing_config.py: target_peak_dbfs field, default −2.0 dBFS, range [−12.0, −1.0] - synthbanshee/cli.py: Stage 3b post-augment normalize uses the same helper + same config so Tier A and Tier B/C exit at the same peak, fixing a pre-existing asymmetry that hid the regression on Tier A - tests/unit/test_preprocessing.py: invert the M3b-era "quiet clip not scaled up" assertion, add direct unit tests for the helper, validate config range constraints - tests/integration/test_loudness_regression.py: end-to-end guard asserting peak lands in [−3, −1] dBFS and per-turn RMS contrast survives a single global gain - AGENTS.md, docs/spec.md §3 + §3.1, docs/audio_generation_v3_design.md §4.7 + tracker M3c row: spec text in sync with the two-stage policy Verification (sp_neu_a_0001 same-config A/B): 2026-04-15 reference peak −1.00 dBFS rms −20.37 dBFS 2026-05-05 before fix peak −5.62 dBFS rms −26.68 dBFS 2026-05-05 after fix peak −2.00 dBFS rms −23.06 dBFS Closes #78 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #78 by restoring absolute clip loudness after M3a/M3b changes, while preserving M3a’s per-turn RMS contrast. It introduces a single global post-mix gain stage that targets a configurable peak level, followed by the existing safety peak limiter.
Changes:
- Add
peak_normalize_to_target()and apply it inpreprocess()before the existing-1.0 dBFSpeak limiter. - Introduce
PreprocessingConfig.target_peak_dbfs(default-2.0, bounded[-12, -1]) and wire it through CLI Tier A and Tier B/C paths. - Add/adjust unit + integration tests and update docs to reflect the two-stage loudness policy.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
synthbanshee/augment/preprocessing.py |
Adds global peak-target normalization step (5a) before the existing safety limiter (5b). |
synthbanshee/config/preprocessing_config.py |
Adds target_peak_dbfs config with validation bounds and documentation. |
synthbanshee/cli.py |
Ensures a concrete preprocessing config exists and reuses the shared normalization helper for Tier B/C post-augmentation. |
tests/unit/test_preprocessing.py |
Updates assertions for new behavior; adds direct unit tests for peak_normalize_to_target() and config validation. |
tests/integration/test_loudness_regression.py |
Adds end-to-end regression tests for peak range and per-turn RMS contrast preservation. |
docs/spec.md |
Updates amplitude row and splits loudness step into 5a/5b. |
docs/audio_generation_v3_design.md |
Documents M3c and the single-gain invariant/history. |
AGENTS.md |
Updates “Audio format” loudness invariant to the new two-stage policy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…metadata trail (lever probe killed Whisper hypothesis) Self-review of the prior commit found the Whisper-fix framing empirically false. Reframed the PR around the actual deliverables (loudness contract clarity + metadata trail) and addressed all blocking review points. Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3): across 8 single-global-gain variants of the same source audio at peaks from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762). Loudness in the spec range is invisible to Whisper's log-mel extractor. The actual ASR regression is content-driven (likely M15 / #70 prosody changes) and is now tracked separately in #83. Review fixes applied: - #4 Metadata trail (the structural fix that prevents #78 recurring): GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and bumps `normalization_strategy` to "per_turn_rms_v2_target_peak". Without this field, the original regression hid for three weeks behind unchanged generator_version. - #5 Pydantic upper bound −1.0 → −1.5. 0.5 dB headroom over the safety limiter ceiling guarantees the two stages cannot collide under float-arithmetic noise. - #6 Step names flattened to literal tokens (peak_normalize, peak_limit, silence_pad — no embedded numeric parameters). Configured value moves to PreprocessingResult.target_peak_dbfs as structured data so QA tooling reads policy via field access, not regex on a step string. - #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now a guaranteed no-op in normal flow; the target step is what does the work. - #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour. Sine's 3 dB crest let regressions slip past peak-anchored asserts. - #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS contrast). Original #78 deviation was 4.6 dB so 0.5 dB only caught it by luck; 0.1 dB catches subtle regressions. - #10 Tier B/C cli.py path coverage: two new tests in test_tier_b_pipeline.py asserting (a) post-augment peak lands at the configured target and (b) GenerationMetadata carries the new normalization_strategy + loudness_target_peak_dbfs fields. - #11 Dropped fictional "M3c" milestone tag everywhere. Tracker row in audio_generation_v3_design.md is now indexed by #78. peak_normalize_to_target docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment). - #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in PR body, AGENTS.md, and the §4.7 spec change. This is incidentally better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR / noise mixing), but it IS a change. Verification: - pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete) - ruff: clean - mypy: 36 unique errors, identical to main (zero new) - Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly, with metadata fields populated as expected. Follow-up issue filed: #83 — investigate the actual Whisper WER cause (prosody/duration, not loudness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #82 in repository https://github.com/DataHackIL/SynthBanshee
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: synthbanshee/cli.py:461
URL: https://github.com/DataHackIL/SynthBanshee/pull/82#discussion_r3190003807
Root author: copilot-pull-request-reviewer
Comment:
Stage 3b re-normalizes the augmented samples with peak_normalize_to_target() but then writes PCM_16 directly without applying the −1.0 dBFS safety limiter used in preprocess(). If target_peak_dbfs is ever set to −1.0 (allowed by config) or if PCM_16 quantization nudges the peak upward, Tier B/C outputs could end up slightly above the ceiling and diverge from the documented “5a then 5b” policy. Consider applying the same safety limiting step after normalization (or sharing a small helper that does normalize+limit) before writing clip_wav.
## COPILOT-2
Location: docs/spec.md
URL: https://github.com/DataHackIL/SynthBanshee/pull/82#discussion_r3190003868
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The updated loudness-normalization step is documented here, but steps 1–4 in this same pipeline list still describe an older implementation (e.g., torchaudio/SoX resample, 7.5 kHz low-pass, Wiener spectral subtraction defaults). The current preprocessing implementation in synthbanshee/augment/preprocessing.py uses scipy resample_poly, an 80 Hz high-pass, and optional Wiener denoise. To avoid the spec contradicting the actual pipeline, please update steps 1–4 (or add an explicit note that §3.1 is historical and not the current implementation).Run metadata: |
#82 COPILOT-1) cli.py Stage 3b previously ran only the target-peak step (5a), not the safety limiter (5b), giving the documented two-stage policy a paper-vs- reality gap. The limiter is provably a no-op given Pydantic bounds (target ≤ −1.5, ceiling = −1.0, PCM_16 quantisation cannot bridge the 0.5 dB margin), but applying it uniformly closes the asymmetry between Tier A (preprocess()) and Tier B/C (cli.py) and prevents a future spec/code drift from silently violating §4.7. PR #82 review thread COPILOT-2 (docs/spec.md §3.1 steps 1-4 stale since M14) is correct but pre-existing and unrelated to #78 — filed as #84 and resolved out-of-scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #82 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.Run metadata: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| duration_seconds: float | ||
| peak_dbfs: float | ||
| """Measured peak in the written PCM_16 file.""" | ||
|
|
||
| target_peak_dbfs: float | None = None | ||
| """Target peak the loudness step aimed for (#78), in dBFS. Pair this | ||
| with ``peak_dbfs`` to compute deviation; QA tooling consumes this as | ||
| structured data instead of regexing a step-name string.""" |
| """Ordered list of step *names* (literal tokens like ``peak_normalize`` | ||
| and ``peak_limit``). Numeric parameters are *not* embedded in the | ||
| step name — they live in dedicated fields like ``target_peak_dbfs`` | ||
| and ``peak_dbfs`` so QA grep contracts don't break under config | ||
| overrides.""" | ||
|
|
| from synthbanshee.augment.pipeline import augment_scene | ||
| from synthbanshee.augment.preprocessing import ( | ||
| _PEAK_DBFS, | ||
| _peak_limit, | ||
| peak_normalize_to_target, | ||
| ) |
| correlating against ``preprocessing_applied.normalized_dbfs`` (which is | ||
| *measured*, not *targeted*).""" |
| ### 3.1 Preprocessing Pipeline (ordered) | ||
|
|
||
| All clips must pass through this pipeline before delivery. The "dirty" pre-pipeline file must be retained in `assets/` for robustness testing. | ||
|
|
||
| 1. **Resample** — convert to 16,000 Hz (SoX `rate` with VHQ quality, or `torchaudio.functional.resample`) | ||
| 2. **Downmix** — stereo → mono (average channels) | ||
| 3. **Spectral filter** — low-pass at 7,500 Hz to remove irrelevant high-frequency noise from budget sensors (Butterworth order 4) | ||
| 4. **Denoising** — spectral subtraction (Wiener filtering) to remove electrical hum; parameterize noise profile from silent leading segment | ||
| 5. **Peak limit** — attenuate to ≤ −1.0 dBFS if the signal exceeds that ceiling; never scale up. This preserves the within-scene loudness trajectory established by per-turn RMS gain (M3a). A forced scale-up would collapse intensity-level amplitude differences. | ||
| 5. **Loudness normalization** (#78) — two stages: | ||
| - **5a. Peak-normalize to target.** Apply a single global gain so the absolute peak lands at `PreprocessingConfig.target_peak_dbfs` (default −2.0 dBFS). A *single* gain preserves per-turn RMS *ratios* exactly, so the within-scene loudness trajectory established by per-turn RMS gain (M3a) survives — only the absolute level shifts. This step replaces the M3b "limiter only, never scale up" behaviour: pre-#78 the spec had only an upper bound on peak, leaving the absolute level unspecified; two clips could legitimately sit 6 dB apart and both be in-spec. | ||
| - **5b. Safety limiter.** Attenuate any sample exceeding −1.0 dBFS. For in-spec target values (`target_peak_dbfs ∈ [−12.0, −1.5]`) this is a guaranteed no-op (0.5 dB margin); it remains as defence-in-depth against upstream over-range samples. |
| clips with peaks well below the −1.0 dBFS ceiling (~−6 dBFS), which | ||
| confused downstream Whisper/UTMOS scoring. ``preprocess()`` now | ||
| peak-normalizes the mixed scene to this target before the safety | ||
| limiter, while the per-turn RMS contrast within the clip is preserved | ||
| by applying a single global gain. |
…oor + helium range (#90) The bisect on PR #86 showed the residual sp_it_a_0001 WER regression (0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding with #51's M15 style_map values, producing effective pitch +14 % to +17 % and rate 1.27-1.33x at high-intensity turns. That range simultaneously sounds cartoonish to listeners (May-3 listening test "helium / oompa- loompa") and trips Whisper-large-v3's silence-detection heuristic — the classic length-ratio collapse to ~0.7 that hid the bug for weeks. This PR ships a partial fix: a runtime effective-prosody cap that addresses the canonical Whisper-backdoor fingerprint and the helium- range pitch concern, plus the two detection layers Shay asked for to catch this class of regression in the future. It does NOT fully restore high-intensity WER to the pre-#51 baseline — see #89 for the follow-up workstream. ## Tier-3 Whisper validation (`sp_it_a_0001`) | variant | dur | WER | length_ratio | hyp / ref | |---|---:|---:|---:|---:| | 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 | | post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 | | this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 | - Length-ratio recovers above the qa-report --asr 0.85 threshold. - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline of 0.056. Failure mode shifts from silence-detector trip (~30 % of words missing) to substitution noise — distinct mechanism requiring a paired listening test to fix without breaking M15 naturalness calibration. Tracked in #89 with insights and four proposed approaches. ## The fix — effective-prosody runtime cap `synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post- state, post-randomization prosody before SSML emission: - pitch in [-3.0, +2.0] st (~ +/- 12 % Azure) - rate in [0.85, 1.20] - volume left to the existing +/-50 % Azure clamp (Whisper internally normalizes loudness, per #82's lever probe — not a Whisper-trip dimension). Caps are anchored to the pre-#51 effective envelope, which produced the 04-15 reference clips with WER 0.04-0.08. Tighter caps would diverge further from M15 listening-test calibration; looser caps would re-trip Whisper. Each cap activation logs a warning and is recorded per turn. ## Detection layer 1 — static prosody-cap activations in metadata - `DialogueTurn.effective_prosody_caps` carries per-turn cap events. - `cli.py` rolls them up into `ClipMetadata.generation_metadata.effective_prosody_caps` (new `EffectiveProsodyCapEvent` model in labels/schema.py). - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)" table per clip — runs on every batch, no Azure / Whisper required. Tier-3 render of sp_it_a_0001 recorded 14 cap activations across 7 high-intensity turns; metadata example in PR description. ## Detection layer 2 — `qa-report --asr` Whisper backdoor check New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded `WhisperRunner` and `compute_asr_metrics`. `qa-report --asr` runs Whisper-large-v3 on every clip in a directory, flags clips whose length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the #87 fingerprint sat at ~0.71). Heavy dependencies isolated in the new `eval-asr` optional extra so normal generation/QA stays light. Per the policy decision documented in CLAUDE.md ("ASR sanity check policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see GH issue #88 for the deferred CI re-evaluation triggers. ## Tests - tests/unit/test_effective_prosody_cap.py: 11 tests covering the helper unit, render_utterance integration, and render_scene event propagation to DialogueTurn. - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying cap-event aggregation in qa-report. - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer, AsrMetrics threshold semantics, and bracket-line stripping in the reference parser. Heavy Whisper inference is exercised by the Tier-3 local run, not these tests. - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean. ## Docs - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do" bullets pinning the cap thresholds and the Tier-3 local-only policy. - pyproject.toml: new `eval-asr` optional extra. Reduces #87 (does not fully close — see #89 for the residual WER work). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs(spec): #84 sync §3.1 steps 1-4 with post-M14 preprocessing §3.1 still described the pre-M14 pipeline (SoX/torchaudio resample, 7.5 kHz low-pass, on-by-default Wiener) — anyone reading spec.md to understand the pipeline got pre-M14 information. PR #48 (M14, 2026-05-01) replaced step 3 with an 80 Hz high-pass and made step 4 (Wiener) opt-in by default; step 5 (loudness) was already synced by #78/#82. Rewrote steps 1-4 to match `synthbanshee/augment/preprocessing.py:preprocess()` and added a one-liner pointing readers at the code as the authoritative source so future drift surfaces faster. Closes #84. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): #84 review revisions — drop M14 narrative, implementation refs, drift disclaimer Self-review of the first commit on this PR identified three structural problems that recreated the kind of drift the issue was about: 1. The "if this section disagrees with the code, the code wins" sentence licensed future drift instead of forbidding it. Replaced with a non-absolving pointer plus an explicit constraint that PRs touching either side MUST update the other in the same change. 2. Embedded M14 changelog narrative ("Note: M14 (PR #48, 2026-05-01) removed the legacy 7,500 Hz low-pass filter ...") — specs describe current contracts, not their evolution. The full rationale already lives in wiki/topics/research-synthesis.md (cited from preprocessing.py line 4) and in the M14 PR description; duplicating it here would rot on the next pipeline change. Removed. 3. Implementation references (scipy.signal.resample_poly, PreprocessingConfig.wiener_denoise, "torchaudio is forbidden, see AGENTS.md") couple the spec to library and field-name choices — exactly the drift coupling #84 was about. Replaced library-name with "polyphase filter"; replaced field name with "Configuration: `PreprocessingConfig`"; removed the agent-rule reference (belongs in AGENTS.md, not in the audio contract spec). Other small fixes: relabelled step 4 as "Conditional Wiener denoising" since "optional" inside an "(ordered)" pipeline is a contradiction; documented the resample-skip behaviour (preprocessing.py line 221); fixed inaccurate HPF rationale ("small phone microphones cannot capture" — they over-represent sub-bass, not fail to capture it); restored the "16,000 Hz" unit convention from §3 line 97. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): #84 review revisions v2 — Copilot suggestions on §3.1 Three constructive style suggestions from copilot-pull-request-reviewer on PR #94, all valid. Applied as suggested with one twist on the third. 1. Header line was packing 4 logical statements into one paragraph (pipeline requirement + dirty-file retention + implementation pointer + drift constraint). Split into a lead sentence and a bolded "Implementation:" block on its own line. Also clarified "either" by spelling out "the implementation or this section". 2. Step 3: expanded "sos form" → "in second-order-sections (SOS) form" so a non-DSP reader can parse the spec without external lookup. 3. Step 4: replaced the underspecified "applied only when explicitly enabled" with the concrete handle "toggle via a boolean flag on PreprocessingConfig". Subsumes the trailing "Configuration: PreprocessingConfig" sentence (the same information now reads inline). Still avoids naming the field, preserving the implementation-decoupling from the v1 revision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…ft (#96) * docs: sync .agent-plan.md to current state (M16 done, M17 in flight) The state tracker was last updated 2026-04-22 and claimed the active task was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16 have all merged, and M17 (automated evaluation) has had its design (#73), Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90) land — anyone reading this file got an 8-milestone-stale picture. Rewrote: - Current system state — list every merged V3 milestone with one-line summaries; promoted the loudness contract (#78) and effective-prosody cap (#87) to the architectural-invariants list since both are load-bearing for any agent editing TTS or preprocessing code. - Active / next task — replaced "M8b" with the M17 ASR regression thread, noted PR #90 as the partial fix and #91 (rate-floor lift R) as the queued next experiment, including the WER ≤ 0.10 + listening-test pass criterion. - Open threads table — new section listing the threads agents most often need to know about (M12 gate, M17 full automation, #62 word merging, #72 SSML parse, #88 CI ASR deferral). - Context pointers — added splendor-brief as the orientation entrypoint and the .venv-vs-~/.local PATH trap. - CI / Workflow notes — added the Tier-3 ASR local-only policy summary so PRs touching audio-rendering files don't merge without it. This file is a quick-orientation summary — added a header line marking the design-doc tracker as authoritative when details disagree, so the next drift gets caught in a tracker diff rather than a stale summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: delete .agent-plan.md (Option A; supersedes prior commit) Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that the file fundamentally fails the duplication test: milestone status duplicates docs/audio_generation_v3_design.md → Implementation Tracker; open-thread state duplicates GitHub Issues + splendor brief; pointers duplicate AGENTS.md; and the only non-duplicate section (architectural invariants) is already covered in AGENTS.md (FLOAT subtype line 72, MixedScene shift line 37, validate_audio peak line 85, full #78 loudness contract line 35, full #87 effective-prosody cap line 73). The rewrite also reproduced the "if details disagree, the tracker wins" disclaimer antipattern — the same one PR #94's review removed from docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter is "I am allowed to be wrong relative to the canonical source" is structured drift bait. The original .agent-plan.md got 15 days stale; "rewrite more carefully" was the wrong response. Changes: - Delete .agent-plan.md (75-line summary that duplicated load-bearing state held authoritatively elsewhere). - Update .claude/skills/open-feature-pr.md step 2 to drop the ".agent-plan.md" fallback for milestone-ID inference; the branch name + parent issue's milestone field already cover it, and pointing at the design-doc tracker as authoritative is more durable than pointing at a manually-maintained summary. - Splendor maintenance: source forget src-9d9759e5ad... --apply. Removes the orphan source manifest, wiki summary page, and wiki index entry. Residual cross-references in 5 planning tasks and 5 wiki pages remain — splendor surfaces them but doesn't auto-clean; they'll regenerate on next ingest of those sources. Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md are already there. The remaining "invariants" (5-tuple mixer API post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs _normalize_peak naming) are mixer-internal details that belong in module docstrings, not global agent rules — and AGENTS.md has its own M8a drift bug (line 71 still says "4-tuple") that's better fixed in a dedicated PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a 4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py` with six named fields (`wav_bytes`, `amount_s`, `speaker_id`, `rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong since at least M8a (added `mix_mode`); #74's Lombard tilt then added `intensity`. Per the dataclass docstring, the named-fields move from positional tuple was deliberate so call sites and reviewers can't transpose args silently. Surfaced during PR #96 review (delete-.agent-plan.md) — the original .agent-plan.md and AGENTS.md disagreed about the segment API; turns out they were both stale, and the design-doc tracker (#74 row line 219) shows the dataclass move that neither AGENTS.md nor .agent-plan.md caught. Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of docs/spec.md that ingest --changed picked up) is intentionally NOT in this commit — it's 16 state/wiki/planning files of churn, scoped for a dedicated splendor maintenance follow-up rather than mixed into a doc fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR) Copilot's review of commit b20eaac caught a cross-reference convention break: the codebase consistently uses #65 (the issue) when referencing the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84, 88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c (line 215). PR #74 is the implementation that closed the issue, but the cross-link convention is to point at the issue. One-character fix: #74 → #65 in the M3a TTS bullet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Closes #78. Splits off #83 (the actual Whisper WER cause).
What this PR is now
A loudness-contract + metadata-trail fix. Not an ASR fix.
The original framing claimed this would recover Whisper WER (the bug report cited 0.04 → 0.28 on 2026-05-05 clips). An empirical lever probe done before merging falsified that hypothesis cleanly — see Empirical finding below. The reframe is honest about what's actually fixed: the spec previously had no language to call "clip peak floats wherever per-turn RMS lands" wrong, and there was no metadata field carrying the loudness policy used for a clip — so the original regression hid for three weeks behind unchanged
generator_version.This PR closes both gaps.
Empirical finding (the original ASR claim is wrong)
Lever probe — same source audio, single global gain to 8 different peak/RMS targets, openai/whisper-large-v3 greedy:
Seven of eight rows produce byte-identical Whisper hypotheses. Whisper's log-mel extractor internally normalizes — peak/RMS in the spec range is invisible to it. The Whisper drop is content-driven (prosody / duration), not level-driven; #78 saw two regressions land in the same window of PRs and treated their correlation as causation. The actual Whisper cause is now tracked separately in #83 (most likely #70's inter-word breaks + M15 rate tuning).
What this PR actually delivers
1. Clip-level loudness contract
PreprocessingConfig.target_peak_dbfs: float = -2.0(range[-12.0, -1.5]).preprocess()runs a single global gain to land the absolute peak at this target before the existing −1.0 dBFS safety limiter. The 0.5 dB margin between target upper bound and limiter ceiling guarantees the limiter is a no-op in normal flow — two stages can no longer collide.The single-gain step is mathematically incapable of compressing per-turn RMS ratios; M3a's within-clip loudness trajectory survives untouched, only absolute level shifts.
2. Metadata trail — the structural fix
GenerationMetadataaddsloudness_target_peak_dbfs: float | Noneand bumpsnormalization_strategyfromper_turn_rms_v1→per_turn_rms_v2_target_peak. Future loudness drift is diagnosable from{clip_id}.jsonalone, without git archaeology — addresses the structural pattern that let #78 hide for three weeks behind unchanged metadata.PreprocessingResultcarriestarget_peak_dbfsas structured data so QA tooling consumes it via field access rather than regex on a step-name string.3. Tier A / Tier B/C consolidation
cli.py:449-453previously hardcoded a separate post-augment peak-normalize at −1.0 dBFS for Tier B/C — a private constant, no headroom, no metadata trail. Now routed through the samepeak_normalize_to_targethelper with the samePreprocessingConfig.target_peak_dbfs. All tiers exit at the same absolute peak.Behaviour change for Tier B/C: clips previously normalized to −1.0 dBFS now normalize to −2.0 dBFS (the new default). This is incidentally an improvement (1 dB headroom for inter-sample peaks from room IR + noise mixing where there was none), but it IS a change. Per-project profile overrides remain available.
4. Step-name hygiene
steps_appliedstrings are now literal tokens —peak_normalize,peak_limit,silence_pad— with no embedded numeric parameters. QA tooling that greps these strings no longer breaks under config overrides.Files changed
synthbanshee/augment/preprocessing.pypeak_normalize_to_target()helper;preprocess()step 5a;PreprocessingResult.target_peak_dbfsfield; literal step-name tokenssynthbanshee/config/preprocessing_config.pytarget_peak_dbfsfield with Pydantic range[-12.0, -1.5]synthbanshee/labels/schema.pyGenerationMetadata.loudness_target_peak_dbfs+ bumpednormalization_strategysynthbanshee/cli.pytests/unit/test_preprocessing.pytests/integration/test_loudness_regression.pytests/integration/test_tier_b_pipeline.pyAGENTS.md§ Audio formatdocs/spec.md§3 + §3.1docs/audio_generation_v3_design.md§4.7 + trackerTest plan
pytest— 1720 passed / 1 skipped (added 4 new tests, removed 1 obsolete)pytest tests/unit/test_preprocessing.py tests/integration/test_loudness_regression.py tests/integration/test_tier_b_pipeline.py— 47 passedruff check synthbanshee tests— cleanmypy synthbanshee tests— 36 unique errors, identical to main (zero new)sp_neu_a_0001on this branch produces:normalization_strategy = "per_turn_rms_v2_target_peak"in metadataloudness_target_peak_dbfs = -2.0in metadataWhat this PR does NOT do
Honest framing note
The original commit on this PR (8bdb28d's child commit) claimed this fix would recover Whisper. The follow-up commit on this branch (7260d5a) reflects the lever-probe finding and reframes the PR around what's actually delivered. Both commits are kept in the history rather than amended so the review trail stays honest about how the framing changed mid-flight.