Skip to content

fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83)#82

Merged
shaypal5 merged 3 commits into
mainfrom
fix/78-loudness-target-peak
May 5, 2026
Merged

fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83)#82
shaypal5 merged 3 commits into
mainfrom
fix/78-loudness-target-peak

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 5, 2026

Closes #78. Splits off #83 (the actual Whisper WER cause).

What this PR is now

A loudness-contract + metadata-trail fix. Not an ASR fix.

The original framing claimed this would recover Whisper WER (the bug report cited 0.04 → 0.28 on 2026-05-05 clips). An empirical lever probe done before merging falsified that hypothesis cleanly — see Empirical finding below. The reframe is honest about what's actually fixed: the spec previously had no language to call "clip peak floats wherever per-turn RMS lands" wrong, and there was no metadata field carrying the loudness policy used for a clip — so the original regression hid for three weeks behind unchanged generator_version.

This PR closes both gaps.

Empirical finding (the original ASR claim is wrong)

Lever probe — same source audio, single global gain to 8 different peak/RMS targets, openai/whisper-large-v3 greedy:

variant peak rms WER length-ratio
04-15 reference (untouched) −1.00 −20.37 0.079 1.005
post-fix as is −2.00 −23.06 0.286 0.762
scaled to peak −1 dBFS −1.00 −22.06 0.286 0.762
scaled to peak −3 dBFS −3.00 −24.06 0.286 0.762
scaled to peak −6 dBFS −6.00 −27.06 0.286 0.762
scaled to RMS −20.4 dBFS (= 04-15) +0.69 (clipped) −20.37 0.286 0.762
scaled to RMS −16 dBFS +5.06 (clipped) −16.00 0.286 0.762
scaled to RMS −14 dBFS +7.06 (clipped) −14.00 0.180 0.862

Seven of eight rows produce byte-identical Whisper hypotheses. Whisper's log-mel extractor internally normalizes — peak/RMS in the spec range is invisible to it. The Whisper drop is content-driven (prosody / duration), not level-driven; #78 saw two regressions land in the same window of PRs and treated their correlation as causation. The actual Whisper cause is now tracked separately in #83 (most likely #70's inter-word breaks + M15 rate tuning).

What this PR actually delivers

1. Clip-level loudness contract

PreprocessingConfig.target_peak_dbfs: float = -2.0 (range [-12.0, -1.5]). preprocess() runs a single global gain to land the absolute peak at this target before the existing −1.0 dBFS safety limiter. The 0.5 dB margin between target upper bound and limiter ceiling guarantees the limiter is a no-op in normal flow — two stages can no longer collide.

The single-gain step is mathematically incapable of compressing per-turn RMS ratios; M3a's within-clip loudness trajectory survives untouched, only absolute level shifts.

2. Metadata trail — the structural fix

GenerationMetadata adds loudness_target_peak_dbfs: float | None and bumps normalization_strategy from per_turn_rms_v1per_turn_rms_v2_target_peak. Future loudness drift is diagnosable from {clip_id}.json alone, without git archaeology — addresses the structural pattern that let #78 hide for three weeks behind unchanged metadata. PreprocessingResult carries target_peak_dbfs as structured data so QA tooling consumes it via field access rather than regex on a step-name string.

3. Tier A / Tier B/C consolidation

cli.py:449-453 previously hardcoded a separate post-augment peak-normalize at −1.0 dBFS for Tier B/C — a private constant, no headroom, no metadata trail. Now routed through the same peak_normalize_to_target helper with the same PreprocessingConfig.target_peak_dbfs. All tiers exit at the same absolute peak.

Behaviour change for Tier B/C: clips previously normalized to −1.0 dBFS now normalize to −2.0 dBFS (the new default). This is incidentally an improvement (1 dB headroom for inter-sample peaks from room IR + noise mixing where there was none), but it IS a change. Per-project profile overrides remain available.

4. Step-name hygiene

steps_applied strings are now literal tokens — peak_normalize, peak_limit, silence_pad — with no embedded numeric parameters. QA tooling that greps these strings no longer breaks under config overrides.

Files changed

File Change
synthbanshee/augment/preprocessing.py New peak_normalize_to_target() helper; preprocess() step 5a; PreprocessingResult.target_peak_dbfs field; literal step-name tokens
synthbanshee/config/preprocessing_config.py target_peak_dbfs field with Pydantic range [-12.0, -1.5]
synthbanshee/labels/schema.py GenerationMetadata.loudness_target_peak_dbfs + bumped normalization_strategy
synthbanshee/cli.py Stage 3b uses shared helper + same config; metadata wiring for new fields
tests/unit/test_preprocessing.py Inverted M3b-era assertion; helper unit tests; range validation including new −1.5 boundary
tests/integration/test_loudness_regression.py Realistic-crest fixture (bandpass Gaussian, 18 dB crest); 0.1 dB peak / 0.2 dB RMS-contrast tolerances; structured-field + literal-step-name assertions
tests/integration/test_tier_b_pipeline.py New: post-augment peak lands at target + metadata fields wired
AGENTS.md § Audio format Loudness invariant + Tier B/C behaviour-change callout + empirical Whisper note
docs/spec.md §3 + §3.1 Two-stage policy with margin between target and ceiling
docs/audio_generation_v3_design.md §4.7 + tracker Tracker row indexed by #78 (no fictional milestone tag); §4.7 rewritten to capture the lever-probe finding and the loudness/Whisper independence

Test plan

  • pytest — 1720 passed / 1 skipped (added 4 new tests, removed 1 obsolete)
  • pytest tests/unit/test_preprocessing.py tests/integration/test_loudness_regression.py tests/integration/test_tier_b_pipeline.py — 47 passed
  • ruff check synthbanshee tests — clean
  • mypy synthbanshee tests — 36 unique errors, identical to main (zero new)
  • Live re-render of sp_neu_a_0001 on this branch produces:
    • peak = −2.000 dBFS exactly
    • normalization_strategy = "per_turn_rms_v2_target_peak" in metadata
    • loudness_target_peak_dbfs = -2.0 in metadata
  • Lever probe confirms loudness independence from Whisper — see investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83 for kickoff data.
  • Per-turn RMS contrast preserved (regression test asserts gap within 0.2 dB of input gap).

What this PR does NOT do

Honest framing note

The original commit on this PR (8bdb28d's child commit) claimed this fix would recover Whisper. The follow-up commit on this branch (7260d5a) reflects the lever-probe finding and reframes the PR around what's actually delivered. Both commits are kept in the history rather than amended so the review trail stays honest about how the framing changed mid-flight.

…n target peak (M3c)

M3a per-turn RMS targeting + M3b limiter-only normalization left M3a-shaped
Tier A clips peaking ~6 dB below the −1 dBFS ceiling.  Whisper under-
transcribed (WER 0.04 → 0.28, length-ratio 0.76) and UTMOS read the
loosely-limited clips ~0.9 MOS higher, dominating M17 evaluation.

Fix is a *single global gain* post-mix loudness step — preserves per-turn
RMS ratios exactly (M3a's whole point), unlike the legacy per-segment peak
normalizer M3b removed.  Configurable target with a sane default.

Changes:
- synthbanshee/augment/preprocessing.py: new peak_normalize_to_target()
  helper; preprocess() runs it before the existing −1 dBFS safety limiter
- synthbanshee/config/preprocessing_config.py: target_peak_dbfs field,
  default −2.0 dBFS, range [−12.0, −1.0]
- synthbanshee/cli.py: Stage 3b post-augment normalize uses the same
  helper + same config so Tier A and Tier B/C exit at the same peak,
  fixing a pre-existing asymmetry that hid the regression on Tier A
- tests/unit/test_preprocessing.py: invert the M3b-era "quiet clip not
  scaled up" assertion, add direct unit tests for the helper, validate
  config range constraints
- tests/integration/test_loudness_regression.py: end-to-end guard
  asserting peak lands in [−3, −1] dBFS and per-turn RMS contrast
  survives a single global gain
- AGENTS.md, docs/spec.md §3 + §3.1, docs/audio_generation_v3_design.md
  §4.7 + tracker M3c row: spec text in sync with the two-stage policy

Verification (sp_neu_a_0001 same-config A/B):
  2026-04-15 reference          peak −1.00 dBFS  rms −20.37 dBFS
  2026-05-05 before fix         peak −5.62 dBFS  rms −26.68 dBFS
  2026-05-05 after fix          peak −2.00 dBFS  rms −23.06 dBFS

Closes #78

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 16:17
@shaypal5 shaypal5 added this to the M17 milestone May 5, 2026
@shaypal5 shaypal5 added bugfix comp: mixer SceneMixer, MixMode, gap controller comp: preprocessing Audio preprocessing and augmentation labels May 5, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #78 by restoring absolute clip loudness after M3a/M3b changes, while preserving M3a’s per-turn RMS contrast. It introduces a single global post-mix gain stage that targets a configurable peak level, followed by the existing safety peak limiter.

Changes:

  • Add peak_normalize_to_target() and apply it in preprocess() before the existing -1.0 dBFS peak limiter.
  • Introduce PreprocessingConfig.target_peak_dbfs (default -2.0, bounded [-12, -1]) and wire it through CLI Tier A and Tier B/C paths.
  • Add/adjust unit + integration tests and update docs to reflect the two-stage loudness policy.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
synthbanshee/augment/preprocessing.py Adds global peak-target normalization step (5a) before the existing safety limiter (5b).
synthbanshee/config/preprocessing_config.py Adds target_peak_dbfs config with validation bounds and documentation.
synthbanshee/cli.py Ensures a concrete preprocessing config exists and reuses the shared normalization helper for Tier B/C post-augmentation.
tests/unit/test_preprocessing.py Updates assertions for new behavior; adds direct unit tests for peak_normalize_to_target() and config validation.
tests/integration/test_loudness_regression.py Adds end-to-end regression tests for peak range and per-turn RMS contrast preservation.
docs/spec.md Updates amplitude row and splits loudness step into 5a/5b.
docs/audio_generation_v3_design.md Documents M3c and the single-gain invariant/history.
AGENTS.md Updates “Audio format” loudness invariant to the new two-stage policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/cli.py
Comment thread docs/spec.md Outdated
…metadata trail (lever probe killed Whisper hypothesis)

Self-review of the prior commit found the Whisper-fix framing empirically
false.  Reframed the PR around the actual deliverables (loudness contract
clarity + metadata trail) and addressed all blocking review points.

Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3):
across 8 single-global-gain variants of the same source audio at peaks
from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced
byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762).
Loudness in the spec range is invisible to Whisper's log-mel extractor.
The actual ASR regression is content-driven (likely M15 / #70 prosody
changes) and is now tracked separately in #83.

Review fixes applied:
- #4 Metadata trail (the structural fix that prevents #78 recurring):
  GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and
  bumps `normalization_strategy` to "per_turn_rms_v2_target_peak".
  Without this field, the original regression hid for three weeks
  behind unchanged generator_version.
- #5 Pydantic upper bound −1.0 → −1.5.  0.5 dB headroom over the safety
  limiter ceiling guarantees the two stages cannot collide under
  float-arithmetic noise.
- #6 Step names flattened to literal tokens (peak_normalize, peak_limit,
  silence_pad — no embedded numeric parameters).  Configured value moves
  to PreprocessingResult.target_peak_dbfs as structured data so QA
  tooling reads policy via field access, not regex on a step string.
- #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to
  `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now
  a guaranteed no-op in normal flow; the target step is what does the
  work.
- #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise
  targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour.
  Sine's 3 dB crest let regressions slip past peak-anchored asserts.
- #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS
  contrast).  Original #78 deviation was 4.6 dB so 0.5 dB only caught
  it by luck; 0.1 dB catches subtle regressions.
- #10 Tier B/C cli.py path coverage: two new tests in
  test_tier_b_pipeline.py asserting (a) post-augment peak lands at the
  configured target and (b) GenerationMetadata carries the new
  normalization_strategy + loudness_target_peak_dbfs fields.
- #11 Dropped fictional "M3c" milestone tag everywhere.  Tracker row in
  audio_generation_v3_design.md is now indexed by #78.  peak_normalize_to_target
  docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment).
- #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in
  PR body, AGENTS.md, and the §4.7 spec change.  This is incidentally
  better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR
  / noise mixing), but it IS a change.

Verification:
- pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete)
- ruff: clean
- mypy: 36 unique errors, identical to main (zero new)
- Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly,
  with metadata fields populated as expected.

Follow-up issue filed: #83 — investigate the actual Whisper WER cause
(prosody/duration, not loudness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 changed the title fix(preprocessing): #78 restore absolute clip loudness via single-gain target peak (M3c) fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83) May 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

pr-agent-context report:

This run includes unresolved review comments on PR #82 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: synthbanshee/cli.py:461
URL: https://github.com/DataHackIL/SynthBanshee/pull/82#discussion_r3190003807
Root author: copilot-pull-request-reviewer

Comment:
    Stage 3b re-normalizes the augmented samples with peak_normalize_to_target() but then writes PCM_16 directly without applying the −1.0 dBFS safety limiter used in preprocess(). If target_peak_dbfs is ever set to −1.0 (allowed by config) or if PCM_16 quantization nudges the peak upward, Tier B/C outputs could end up slightly above the ceiling and diverge from the documented “5a then 5b” policy. Consider applying the same safety limiting step after normalization (or sharing a small helper that does normalize+limit) before writing clip_wav.

## COPILOT-2
Location: docs/spec.md
URL: https://github.com/DataHackIL/SynthBanshee/pull/82#discussion_r3190003868
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The updated loudness-normalization step is documented here, but steps 1–4 in this same pipeline list still describe an older implementation (e.g., torchaudio/SoX resample, 7.5 kHz low-pass, Wiener spectral subtraction defaults). The current preprocessing implementation in synthbanshee/augment/preprocessing.py uses scipy resample_poly, an 80 Hz high-pass, and optional Wiener denoise. To avoid the spec contradicting the actual pipeline, please update steps 1–4 (or add an explicit note that §3.1 is historical and not the current implementation).

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25391058831 attempt 1
Comment timestamp: 2026-05-05T17:15:01.591232+00:00
PR head commit: 7260d5a220222ae74ec0404debff3881eb909faf

#82 COPILOT-1)

cli.py Stage 3b previously ran only the target-peak step (5a), not the
safety limiter (5b), giving the documented two-stage policy a paper-vs-
reality gap.  The limiter is provably a no-op given Pydantic bounds
(target ≤ −1.5, ceiling = −1.0, PCM_16 quantisation cannot bridge the
0.5 dB margin), but applying it uniformly closes the asymmetry between
Tier A (preprocess()) and Tier B/C (cli.py) and prevents a future
spec/code drift from silently violating §4.7.

PR #82 review thread COPILOT-2 (docs/spec.md §3.1 steps 1-4 stale since
M14) is correct but pre-existing and unrelated to #78 — filed as #84
and resolved out-of-scope here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 17:22
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #82 in repository https://github.com/DataHackIL/SynthBanshee. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25391532942 attempt 1
Comment timestamp: 2026-05-05T17:25:04.832855+00:00
PR head commit: 2b5bdef6f33e2f69c1214b1b21c04a6e8835709e

@shaypal5 shaypal5 merged commit 94086a8 into main May 5, 2026
8 checks passed
@shaypal5 shaypal5 deleted the fix/78-loudness-target-peak branch May 5, 2026 17:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 59 to +66
duration_seconds: float
peak_dbfs: float
"""Measured peak in the written PCM_16 file."""

target_peak_dbfs: float | None = None
"""Target peak the loudness step aimed for (#78), in dBFS. Pair this
with ``peak_dbfs`` to compute deviation; QA tooling consumes this as
structured data instead of regexing a step-name string."""
Comment on lines +70 to +75
"""Ordered list of step *names* (literal tokens like ``peak_normalize``
and ``peak_limit``). Numeric parameters are *not* embedded in the
step name — they live in dedicated fields like ``target_peak_dbfs``
and ``peak_dbfs`` so QA grep contracts don't break under config
overrides."""

Comment thread synthbanshee/cli.py
Comment on lines 426 to +431
from synthbanshee.augment.pipeline import augment_scene
from synthbanshee.augment.preprocessing import (
_PEAK_DBFS,
_peak_limit,
peak_normalize_to_target,
)
Comment on lines +158 to +159
correlating against ``preprocessing_applied.normalized_dbfs`` (which is
*measured*, not *targeted*)."""
Comment thread docs/spec.md
Comment on lines 106 to +116
### 3.1 Preprocessing Pipeline (ordered)

All clips must pass through this pipeline before delivery. The "dirty" pre-pipeline file must be retained in `assets/` for robustness testing.

1. **Resample** — convert to 16,000 Hz (SoX `rate` with VHQ quality, or `torchaudio.functional.resample`)
2. **Downmix** — stereo → mono (average channels)
3. **Spectral filter** — low-pass at 7,500 Hz to remove irrelevant high-frequency noise from budget sensors (Butterworth order 4)
4. **Denoising** — spectral subtraction (Wiener filtering) to remove electrical hum; parameterize noise profile from silent leading segment
5. **Peak limit** — attenuate to ≤ −1.0 dBFS if the signal exceeds that ceiling; never scale up. This preserves the within-scene loudness trajectory established by per-turn RMS gain (M3a). A forced scale-up would collapse intensity-level amplitude differences.
5. **Loudness normalization** (#78) — two stages:
- **5a. Peak-normalize to target.** Apply a single global gain so the absolute peak lands at `PreprocessingConfig.target_peak_dbfs` (default −2.0 dBFS). A *single* gain preserves per-turn RMS *ratios* exactly, so the within-scene loudness trajectory established by per-turn RMS gain (M3a) survives — only the absolute level shifts. This step replaces the M3b "limiter only, never scale up" behaviour: pre-#78 the spec had only an upper bound on peak, leaving the absolute level unspecified; two clips could legitimately sit 6 dB apart and both be in-spec.
- **5b. Safety limiter.** Attenuate any sample exceeding −1.0 dBFS. For in-spec target values (`target_peak_dbfs ∈ [−12.0, −1.5]`) this is a guaranteed no-op (0.5 dB margin); it remains as defence-in-depth against upstream over-range samples.
Comment on lines +20 to +24
clips with peaks well below the −1.0 dBFS ceiling (~−6 dBFS), which
confused downstream Whisper/UTMOS scoring. ``preprocess()`` now
peak-normalizes the mixed scene to this target before the safety
limiter, while the per-turn RMS contrast within the clip is preserved
by applying a single global gain.
shaypal5 added a commit that referenced this pull request May 6, 2026
…oor + helium range (#90)

The bisect on PR #86 showed the residual sp_it_a_0001 WER regression
(0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding
with #51's M15 style_map values, producing effective pitch +14 % to +17 %
and rate 1.27-1.33x at high-intensity turns.  That range simultaneously
sounds cartoonish to listeners (May-3 listening test "helium / oompa-
loompa") and trips Whisper-large-v3's silence-detection heuristic — the
classic length-ratio collapse to ~0.7 that hid the bug for weeks.

This PR ships a partial fix: a runtime effective-prosody cap that
addresses the canonical Whisper-backdoor fingerprint and the helium-
range pitch concern, plus the two detection layers Shay asked for to
catch this class of regression in the future.  It does NOT fully
restore high-intensity WER to the pre-#51 baseline — see #89 for the
follow-up workstream.

## Tier-3 Whisper validation (`sp_it_a_0001`)

| variant | dur | WER | length_ratio | hyp / ref |
|---|---:|---:|---:|---:|
| 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 |
| post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 |
| this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 |

  - Length-ratio recovers above the qa-report --asr 0.85 threshold.
  - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline
    of 0.056.  Failure mode shifts from silence-detector trip
    (~30 % of words missing) to substitution noise — distinct mechanism
    requiring a paired listening test to fix without breaking M15
    naturalness calibration.  Tracked in #89 with insights and four
    proposed approaches.

## The fix — effective-prosody runtime cap

`synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post-
state, post-randomization prosody before SSML emission:

  - pitch in [-3.0, +2.0] st  (~ +/- 12 % Azure)
  - rate  in [0.85, 1.20]
  - volume left to the existing +/-50 % Azure clamp (Whisper internally
    normalizes loudness, per #82's lever probe — not a Whisper-trip
    dimension).

Caps are anchored to the pre-#51 effective envelope, which produced the
04-15 reference clips with WER 0.04-0.08.  Tighter caps would diverge
further from M15 listening-test calibration; looser caps would re-trip
Whisper.  Each cap activation logs a warning and is recorded per turn.

## Detection layer 1 — static prosody-cap activations in metadata

  - `DialogueTurn.effective_prosody_caps` carries per-turn cap events.
  - `cli.py` rolls them up into
    `ClipMetadata.generation_metadata.effective_prosody_caps`
    (new `EffectiveProsodyCapEvent` model in labels/schema.py).
  - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)"
    table per clip — runs on every batch, no Azure / Whisper required.
    Tier-3 render of sp_it_a_0001 recorded 14 cap activations across
    7 high-intensity turns; metadata example in PR description.

## Detection layer 2 — `qa-report --asr` Whisper backdoor check

New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded
`WhisperRunner` and `compute_asr_metrics`.  `qa-report --asr` runs
Whisper-large-v3 on every clip in a directory, flags clips whose
length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the
#87 fingerprint sat at ~0.71).  Heavy dependencies isolated in the new
`eval-asr` optional extra so normal generation/QA stays light.

Per the policy decision documented in CLAUDE.md ("ASR sanity check
policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see
GH issue #88 for the deferred CI re-evaluation triggers.

## Tests

  - tests/unit/test_effective_prosody_cap.py: 11 tests covering the
    helper unit, render_utterance integration, and render_scene event
    propagation to DialogueTurn.
  - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying
    cap-event aggregation in qa-report.
  - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer,
    AsrMetrics threshold semantics, and bracket-line stripping in the
    reference parser.  Heavy Whisper inference is exercised by the
    Tier-3 local run, not these tests.
  - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean.

## Docs

  - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do"
    bullets pinning the cap thresholds and the Tier-3 local-only policy.
  - pyproject.toml: new `eval-asr` optional extra.

Reduces #87 (does not fully close — see #89 for the residual WER work).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 7, 2026
* docs(spec): #84 sync §3.1 steps 1-4 with post-M14 preprocessing

§3.1 still described the pre-M14 pipeline (SoX/torchaudio resample, 7.5 kHz
low-pass, on-by-default Wiener) — anyone reading spec.md to understand the
pipeline got pre-M14 information. PR #48 (M14, 2026-05-01) replaced step 3
with an 80 Hz high-pass and made step 4 (Wiener) opt-in by default; step 5
(loudness) was already synced by #78/#82.

Rewrote steps 1-4 to match `synthbanshee/augment/preprocessing.py:preprocess()`
and added a one-liner pointing readers at the code as the authoritative
source so future drift surfaces faster.

Closes #84.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): #84 review revisions — drop M14 narrative, implementation refs, drift disclaimer

Self-review of the first commit on this PR identified three structural
problems that recreated the kind of drift the issue was about:

1. The "if this section disagrees with the code, the code wins" sentence
   licensed future drift instead of forbidding it. Replaced with a
   non-absolving pointer plus an explicit constraint that PRs touching
   either side MUST update the other in the same change.
2. Embedded M14 changelog narrative ("Note: M14 (PR #48, 2026-05-01)
   removed the legacy 7,500 Hz low-pass filter ...") — specs describe
   current contracts, not their evolution. The full rationale already
   lives in wiki/topics/research-synthesis.md (cited from preprocessing.py
   line 4) and in the M14 PR description; duplicating it here would rot
   on the next pipeline change. Removed.
3. Implementation references (scipy.signal.resample_poly,
   PreprocessingConfig.wiener_denoise, "torchaudio is forbidden, see
   AGENTS.md") couple the spec to library and field-name choices —
   exactly the drift coupling #84 was about. Replaced library-name with
   "polyphase filter"; replaced field name with "Configuration:
   `PreprocessingConfig`"; removed the agent-rule reference (belongs
   in AGENTS.md, not in the audio contract spec).

Other small fixes: relabelled step 4 as "Conditional Wiener denoising"
since "optional" inside an "(ordered)" pipeline is a contradiction;
documented the resample-skip behaviour (preprocessing.py line 221);
fixed inaccurate HPF rationale ("small phone microphones cannot
capture" — they over-represent sub-bass, not fail to capture it);
restored the "16,000 Hz" unit convention from §3 line 97.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): #84 review revisions v2 — Copilot suggestions on §3.1

Three constructive style suggestions from copilot-pull-request-reviewer
on PR #94, all valid. Applied as suggested with one twist on the third.

1. Header line was packing 4 logical statements into one paragraph
   (pipeline requirement + dirty-file retention + implementation pointer
   + drift constraint). Split into a lead sentence and a bolded
   "Implementation:" block on its own line. Also clarified "either" by
   spelling out "the implementation or this section".
2. Step 3: expanded "sos form" → "in second-order-sections (SOS) form"
   so a non-DSP reader can parse the spec without external lookup.
3. Step 4: replaced the underspecified "applied only when explicitly
   enabled" with the concrete handle "toggle via a boolean flag on
   PreprocessingConfig". Subsumes the trailing "Configuration:
   PreprocessingConfig" sentence (the same information now reads
   inline). Still avoids naming the field, preserving the
   implementation-decoupling from the v1 revision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 7, 2026
…ft (#96)

* docs: sync .agent-plan.md to current state (M16 done, M17 in flight)

The state tracker was last updated 2026-04-22 and claimed the active task
was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16
have all merged, and M17 (automated evaluation) has had its design (#73),
Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90)
land — anyone reading this file got an 8-milestone-stale picture.

Rewrote:
- Current system state — list every merged V3 milestone with one-line
  summaries; promoted the loudness contract (#78) and effective-prosody
  cap (#87) to the architectural-invariants list since both are
  load-bearing for any agent editing TTS or preprocessing code.
- Active / next task — replaced "M8b" with the M17 ASR regression thread,
  noted PR #90 as the partial fix and #91 (rate-floor lift R) as the
  queued next experiment, including the WER ≤ 0.10 + listening-test pass
  criterion.
- Open threads table — new section listing the threads agents most often
  need to know about (M12 gate, M17 full automation, #62 word merging,
  #72 SSML parse, #88 CI ASR deferral).
- Context pointers — added splendor-brief as the orientation entrypoint
  and the .venv-vs-~/.local PATH trap.
- CI / Workflow notes — added the Tier-3 ASR local-only policy summary
  so PRs touching audio-rendering files don't merge without it.

This file is a quick-orientation summary — added a header line marking
the design-doc tracker as authoritative when details disagree, so the
next drift gets caught in a tracker diff rather than a stale summary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: delete .agent-plan.md (Option A; supersedes prior commit)

Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that
the file fundamentally fails the duplication test: milestone status
duplicates docs/audio_generation_v3_design.md → Implementation Tracker;
open-thread state duplicates GitHub Issues + splendor brief; pointers
duplicate AGENTS.md; and the only non-duplicate section
(architectural invariants) is already covered in AGENTS.md (FLOAT
subtype line 72, MixedScene shift line 37, validate_audio peak line 85,
full #78 loudness contract line 35, full #87 effective-prosody cap
line 73).

The rewrite also reproduced the "if details disagree, the tracker wins"
disclaimer antipattern — the same one PR #94's review removed from
docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter
is "I am allowed to be wrong relative to the canonical source" is
structured drift bait. The original .agent-plan.md got 15 days stale;
"rewrite more carefully" was the wrong response.

Changes:
- Delete .agent-plan.md (75-line summary that duplicated load-bearing
  state held authoritatively elsewhere).
- Update .claude/skills/open-feature-pr.md step 2 to drop the
  ".agent-plan.md" fallback for milestone-ID inference; the branch
  name + parent issue's milestone field already cover it, and pointing
  at the design-doc tracker as authoritative is more durable than
  pointing at a manually-maintained summary.
- Splendor maintenance: source forget src-9d9759e5ad... --apply.
  Removes the orphan source manifest, wiki summary page, and wiki
  index entry. Residual cross-references in 5 planning tasks and 5
  wiki pages remain — splendor surfaces them but doesn't auto-clean;
  they'll regenerate on next ingest of those sources.

Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md
are already there. The remaining "invariants" (5-tuple mixer API
post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs
_normalize_peak naming) are mixer-internal details that belong in
module docstrings, not global agent rules — and AGENTS.md has its own
M8a drift bug (line 71 still says "4-tuple") that's better fixed in a
dedicated PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description

AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a
4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual
current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py`
with six named fields (`wav_bytes`, `amount_s`, `speaker_id`,
`rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong
since at least M8a (added `mix_mode`); #74's Lombard tilt then added
`intensity`. Per the dataclass docstring, the named-fields move from
positional tuple was deliberate so call sites and reviewers can't
transpose args silently.

Surfaced during PR #96 review (delete-.agent-plan.md) — the original
.agent-plan.md and AGENTS.md disagreed about the segment API; turns out
they were both stale, and the design-doc tracker (#74 row line 219)
shows the dataclass move that neither AGENTS.md nor .agent-plan.md
caught.

Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of
docs/spec.md that ingest --changed picked up) is intentionally NOT in
this commit — it's 16 state/wiki/planning files of churn, scoped for a
dedicated splendor maintenance follow-up rather than mixed into a doc
fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR)

Copilot's review of commit b20eaac caught a cross-reference convention
break: the codebase consistently uses #65 (the issue) when referencing
the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84,
88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c
(line 215). PR #74 is the implementation that closed the issue, but
the cross-link convention is to point at the issue.

One-character fix: #74#65 in the M3a TTS bullet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix comp: mixer SceneMixer, MixMode, gap controller comp: preprocessing Audio preprocessing and augmentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(tts): rendered clip loudness regressed by ~6 dB between 2026-04-15 and 2026-05-05

2 participants