Skip to content

feat(tts): parallel streaming pipeline with prestart workers and overlap viz #4

Merged
dqwang122 merged 8 commits into
mainfrom
ytli_417
May 15, 2026
Merged

feat(tts): parallel streaming pipeline with prestart workers and overlap viz #4
dqwang122 merged 8 commits into
mainfrom
ytli_417

Conversation

@WinstonLiyt
Copy link
Copy Markdown
Collaborator

Description

Overhaul of the streaming TTS chunk pipeline so that one chunk can be refined by multiple workers concurrently, and add a richer overlap visualizer to inspect what actually overlaps on the wall clock.

Highlights:

  • Shared candidate pool (_ChunkRefineContext): a normal worker and an optional prestart worker push TTS candidates into one pool; the first to confirm an in-range candidate via try_adopt() wins.
  • Ratio-based prestart: at iter i, if chars(i+2)/chars(i+1) ≥ 2.0 or chars(i+2) ≥ 1000, kick off a prestart worker for chunk i+2 so it has two extra chunks of audio time to refine. The existing last-chunk prestart is now expressed in the same framework (prestart_kind ∈ {"", "ratio", "last"}).
  • Early-cut + speed adjust: a chunk whose fastspeech estimate exceeds 1.25× target is split mid-flight, and a chosen candidate outside tolerance is re-TTSed with clamped speed ∈ [0.85, 1.15] if that closes the gap.
  • Word-based chunk merging: MIN_CHUNK_WORDS=30 (replaces MIN_CHUNK_CHARS=50); an undersized last chunk is folded into its predecessor.
  • EnvConfig flag: streaming_tts is now a real EnvConfig field instead of a free-floating flag.
  • Overlap viz: per-worker fs/llm/tts streams produce two parallel lanes (normal + prestart) instead of one, the playback bar is labelled with the prestart kind, and a stack_pngs.py helper plus overlap_viz.sh updates make multi-run comparison easier.

Motivation and Context

Streaming TTS was previously serial within a chunk: refine → TTS, with only the very last chunk getting a head start. Long mid-stream chunks that needed many refines still blew the audio budget of the chunk playing before them, causing silence gaps. Adopting a shared candidate pool lets us start refining a long chunk two iterations early and keep the normal worker running, so we adopt whichever lands first. The visualizer changes were needed because the old single-lane plot could not show two workers racing against the same playback deadline.

How Has This Been Tested?

  • Ran the streaming pipeline end-to-end on representative debate configs and verified the generated overlap timeline matches expectations (prestart lane visible, chosen candidate marked, no gaps where a prestart was adopted).

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

…nk pre-start

Add early-cut splitting for oversized chunks, TTS speed adjustment post-processing,
fast initial compression for short-budget rounds, pre-start of last chunk, and
next-chunk context in LLM rewrites. Also fix overlap_viz.sh to accept multiple args.
- remove long-chunk splitting and fast-initial-compress paths
- chunk 0 always uses sequential TTS without refinement
- track prep_start_lead_s on ChunkProfile for adopted pre-started chunks
- default enable_early_cut to False
CompareEnv now reads streaming_tts from its own config instead of
inheriting from each debater's config, so baseline and test runs share
the same streaming setting.
- overlap_viz_par.py renders pre-started last chunks on a separate
  lane using prep_start_lead_s, so they don't visually collide with
  the main thread bars of preceding chunks
- overlap_viz.sh stacks per-chunk-dir PNGs into a single combined
  overlap_timeline_combined.png when multiple dirs match
- add stack_pngs.py helper that vertically stacks PNGs
- ouragents: fall back to self.config.model when helper_model is None,
  not just when the attribute is missing
- utils/model: pass num_retries=3 to litellm.completion
- utils/tool: include traceback in extraction-retry warning
Refactor TTS refinement around a shared _ChunkRefineContext where a
prestart worker (kicked off two iters early on ratio or absolute size
triggers, plus the existing last-chunk variant) and the normal worker
contribute candidates to the same pool. First worker to confirm an
in-range candidate wins via try_adopt(). Per-worker fs/llm/tts streams
are recorded so overlap_viz_par renders both lanes in parallel with the
chosen kind labelled on playback. Chunk merging now uses word count
(MIN_CHUNK_WORDS=30) and also folds an undersized last chunk into its
predecessor.
@dqwang122 dqwang122 merged commit f56f18b into main May 15, 2026
3 checks passed
@dqwang122 dqwang122 deleted the ytli_417 branch May 15, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants