Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/audio_generation_v3_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,11 @@ All P0–P2 work is tracked here. Milestones that span multiple PRs are listed o
| **M9b** | P2 | ✅ Done | ≥ 3 voice variants per gender across dataset | ≥ 4 additional speaker YAMLs in `configs/speakers/` (≥ 2 male, ≥ 2 female, including ≥ 1 Google variant per gender); update scene configs to distribute speaker IDs across scenes; `voice_family` column in manifest CSV; `generate-batch` distributes across voice variants | Small | No |
| **M10a** | P2 | ✅ Done | Catches clip-level prosody regressions | Extend `qa.py` with per-clip acoustic metrics: F0 median/std by speaker by intensity (librosa pyin), RMS and LUFS by turn (pyloudnorm); per-clip warning flags: `vic_f0_high`, `agg_no_escalation`; `SegmentMeasurement` NamedTuple for typed `_measure_segment` return | Medium | No |
| **M10b** | P2 | ✅ Done | Catches systematic run-level biases | Run-level aggregation in `QAReport`: F0/RMS/LUFS distributions by role/typology/project; voice and backend diversity counts; `mix_mode` distribution; outlier clips; `qa-report --run-summary` CLI table; `WARN_NO_OVERLAP` and `WARN_EMOTION_DOWNGRADE` run-level flags | Medium | No |
| **M11** | P2 | 🔲 Not started | Enables ablation studies and debugging across pipeline versions | `GenerationMetadata` dataclass (§4.11); written to `{clip_id}.json` under `generation_metadata` key; backward-compatible — existing V1 clips not invalidated | Small | No |
| **M11** | P2 | ✅ Done | Enables ablation studies and debugging across pipeline versions | `GenerationMetadata` Pydantic `BaseModel` (§4.11); written to `{clip_id}.json` under `generation_metadata` key; backward-compatible — existing V1 clips not invalidated | Small | No |
| **M12** | P2 | 🧪 Experimental | Reduces VIC HNR at I3–I5 if listening-test gate passes | `add_breathiness()` in `synthbanshee/augment/voice_texture.py`; wire for VIC turns via `SpeakerState.breathiness_level`; listening-test gate — 20 clips, native Hebrew listener, blind A/B (§8.3); `breathiness_applied` flag in `GenerationMetadata`; disabled by default if gate fails | Medium | No |
| **M13** | P2 | 🔲 Not started | She-Proves / Elephant generate audio appropriate to their distinct acoustic regimes | `project_profile` field in `RunConfig`; gap, overlap probability, loudness targets, preprocessing config, and augmentation config all carry project-specific defaults; two profile YAML files in `configs/run_configs/`; new profiles addable without code changes | Small | No |
| **M13** | P2 | ✅ Done | She-Proves / Elephant generate audio appropriate to their distinct acoustic regimes | `project_profile` field in `RunConfig`; gap, overlap probability, loudness targets, preprocessing config, and augmentation config all carry project-specific defaults; two profile YAML files in `configs/run_configs/`; new profiles addable without code changes | Small | No |
| **M14** | P0 | ✅ Done | Fixes muffled audio, click artifacts, and voice identity shifts | Replace 7.5 kHz LPF with 80 Hz HPF in `preprocessing.py`; default `wiener_denoise=False` in `PreprocessingConfig`; add 10ms (160 sample) edge fades at turn boundaries in `mixer.py`; set `supports_style_tags=False` in `AzureProvider` capabilities (disables `express-as` for he-IL voices); update unit tests | Small | No |
| **M15** | P1 | 🔲 Not started | Tunes SSML prosody to research-validated Hebrew parameters | Update `style_map` values in speaker YAMLs per research consensus table (rate, pitch, volume, F0 range by intensity); update `SpeakerState` drift bounds (max 2.0 st unexplained drift); add turn-level quality gates: sustained-vowel detection (>2.8 s reject), F0 guardrails (male [80,180] Hz, female [150,290] Hz), click detection | Medium | No |
| **M15** | P1 | ✅ Done | Tunes SSML prosody to research-validated Hebrew parameters | Update `style_map` values in speaker YAMLs per research consensus table (rate, pitch, volume, F0 range by intensity); update `SpeakerState` drift bounds (max 2.0 st unexplained drift); add turn-level quality gates: sustained-vowel detection (>2.8 s reject), F0 guardrails (male [80,180] Hz, female [150,290] Hz), click detection | Medium | No |
| **M16** | P2 | 🔲 Not started | Adds realistic acoustic environments to Tier B clips | Implement `room_sim.py` with pyroomacoustics (RT60 0.25–0.7 s, shoebox rooms, phone-on-table early reflection model); implement `device_profiles.py` (phone EQ: 80 Hz HPF, presence boost +2–4 dB @ 2.5–3.5 kHz, gentle high shelf above 6.5 kHz); implement `noise_mixer.py` (SNR distribution: 50% 18–30 dB, 30% 10–18 dB, 10% 5–10 dB, 10% 30–40 dB); optional codec simulation (Opus/AMR-NB) | Large | No |

---
Expand All @@ -72,7 +72,7 @@ The scripts, label taxonomy, pipeline stage decomposition, and cache system are
- **P1 (realism core):** Add stateful conversational dynamics; preserve escalation cues
- **P2 (diversity and observability):** Widen voice diversity; harden QA; add release gates

**V3.1 additions (2026-04-30, post-research):** Three new milestones (M14–M16) based on cross-referenced findings from three independent research reports on Hebrew synthetic speech naturalness (Gemini, GPT-5.2 thinking, GPT-5.5 Pro). See `wiki/topics/research-synthesis.md` for full parameter tables and citations. Recommended order: M14 → M11 → M15 → M13 → M16 → M12.
**V3.1 additions (2026-04-30, post-research):** Three new milestones (M14–M16) based on cross-referenced findings from three independent research reports on Hebrew synthetic speech naturalness (Gemini, GPT-5.2 thinking, GPT-5.5 Pro). See `wiki/topics/research-synthesis.md` for full parameter tables and citations. Original recommended order: M14 → M11 → M15 → M13 → M16 → M12. As of 2026-05-01, M14, M11, M15, and M13 are all merged; remaining: M16 (Tier B augmentation) and M12 (breathiness, gated).

---

Expand Down Expand Up @@ -612,7 +612,7 @@ Per-clip warning flags:
#### M11: Dataset Provenance Metadata
**Impact:** Enables ablation studies and debugging across pipeline versions.
**Scope:**
- `GenerationMetadata` dataclass (§4.11)
- `GenerationMetadata` Pydantic `BaseModel` (§4.11)
- Write to `{clip_id}.json` under `generation_metadata` key
- Backward compatible (existing V1 clips are not invalidated)

Expand Down
4 changes: 1 addition & 3 deletions wiki/topics/audio-quality-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ kind: topic
title: Audio Quality Issues — Known Problems and Root Causes
page_id: topic-audio-quality-issues
status: active
review_state: human-authored
review_state: human-reviewed
source_refs:
- src-b98ddf4495b0916b6c9cdc3804daf32554f40b26dc0ba217c5e3d1664ee5ddf4
- src-42711de5f7b0e5a5b70b02d8fcd01097a6901aa5b5d14d61b351b6bbbd94ab89
tags: [audio-quality, tts, ssml, preprocessing, feedback]
created: '2026-04-30'
updated: '2026-04-30'
---

# Audio Quality Issues — Known Problems and Root Causes
Expand Down
4 changes: 1 addition & 3 deletions wiki/topics/preprocessing-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ kind: topic
title: Preprocessing Pipeline — Filters, Denoising, and Audio Quality Impact
page_id: topic-preprocessing-pipeline
status: active
review_state: human-authored
review_state: human-reviewed
source_refs:
- src-5184b8f4c65ccf204f748e694ad677f7ae3792c2e91a861e636ede81b7ba3c1f
- src-42711de5f7b0e5a5b70b02d8fcd01097a6901aa5b5d14d61b351b6bbbd94ab89
tags: [preprocessing, audio, filter, denoiser, muffled, quality]
created: '2026-04-30'
updated: '2026-04-30'
---

# Preprocessing Pipeline — Filters, Denoising, and Audio Quality Impact
Expand Down
4 changes: 1 addition & 3 deletions wiki/topics/research-synthesis.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,12 @@ kind: topic
title: Research Synthesis — Cross-Referenced Findings from Three Independent Reports
page_id: topic-research-synthesis
status: active
review_state: human-authored
review_state: human-reviewed
source_refs:
- src-d7392e10f92c3d658171300d2475c96103de715667bc17b429910da9ba8cea40
- src-0032ce484cf3a12b55535ba22c39eba309352f6564765fe5f6c2c87431294996
- src-5539b824c2a485f676fe6097f25b5de407229f8c05d3632e58860a815bb28963
tags: [research, synthesis, tts, prosody, preprocessing, room-acoustics, quality-gates]
created: '2026-04-30'
updated: '2026-04-30'
---

# Research Synthesis — Cross-Referenced Findings
Expand Down
4 changes: 1 addition & 3 deletions wiki/topics/ssml-prosody-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,12 @@ kind: topic
title: SSML Prosody Parameters — Current Settings and Tuning
page_id: topic-ssml-prosody-params
status: active
review_state: human-authored
review_state: human-reviewed
source_refs:
- src-3d18f514d3e6bae4424eb2a3a8ff318976d5d5834a19fe6ddc16e9e8e6baebc4
- src-bdb2f9f939e8892e2c60a517264027e1e7b86c33cfce2f41015c8f0a90c463eb
- src-42711de5f7b0e5a5b70b02d8fcd01097a6901aa5b5d14d61b351b6bbbd94ab89
tags: [ssml, prosody, tts, azure, pitch, rate, volume]
created: '2026-04-30'
updated: '2026-04-30'
---

# SSML Prosody Parameters — Current Settings and Tuning
Expand Down
Loading