json2vtt.py turns machine-transcribed JSON (ElevenLabs, Avanegar, Speechmatics, or a flat word list) into broadcast-grade WebVTT or SRT captions.
It enforces hard timing/layout limits while offering smart heuristics for natural phrase breaks.
- Multiple input schemas – supports:
- ElevenLabs diarisation (
segments[] → words[]). - Generic
{"words": [...]}arrays (e.g. Avanegar). - Flat list of word objects –
[{"word": "hello", "start": 0.0, "end": 0.5}, …].
- ElevenLabs diarisation (
- Quality guardrails –
- Max 7 s per cue / 2 lines / 42 chars per line.
- CPS limit, dedupe window, speaker-change gap.
- Natural phrase splitting – soft flush on punctuation, standalone dashes, or long pauses.
- Adaptive gap strategy – default "shrink" avoids cumulative drift; legacy "shift" available.
- Validation pass – flags overlaps, gap violations, and timeline drift (strict mode aborts on failure).
- Frame-accurate snap – optional
--fpsto align to video frame boundaries. - Clean CLI – every heuristic can be tuned or disabled via flags.
python json2vtt.py my_transcript.json my_subs.vttDefaults: WebVTT, speaker prefixes off (change in code), smart gap handling, pause flush = 0.8 s.
python json2vtt.py my_transcript.json my_subs.srt --srtQuality knobs:
--dedupe-window SEC Merge duplicate word within SEC (default 0.30)
--min-cue SEC Merge cues shorter than SEC (0 disables)
--max-cps NUM Characters-per-second limit (0 disables)
--gap SEC Min gap at speaker change (default 0.08)
--gap-strategy MODE shrink|shift (default shrink)
--pause-flush SEC Flush cue after silence ≥ SEC (0 disables)
Style toggles:
--no-software-flush Disable early flush at punctuation
--no-dash-flush Ignore standalone dashes as boundaries
--soft-flush-threshold F Fraction of max-dur before commas trigger flush
--no-speaker Suppress ">> Speaker:" prefixes
--srt Output SRT instead of VTT
--fps FPS Frame-snap timestamps
Validation & logging:
--strict Abort if validation fails
--verbose Extra debug output
Run `python json2vtt.py --help` for the full list.
- ElevenLabs diarisation
{
"segments": [
{
"speaker": {"name": "Alice"},
"words": [{"text": "Hello", "start": 0.0, "end": 0.5}, ...]
}
]
}- Generic words array
{"words": [{"text": "Hello", "start": 0.0, "end": 0.5}]}- Avanegar / Flat list
[{"word": "Hello", "start": 0.0, "end": 0.5}, {"word": "world", ...}]- Lower the pause threshold & snap to 25 fps
python json2vtt.py talk.json talk.vtt --pause-flush 0.5 --fps 25- Aggressive splitting (flush on every dash)
python json2vtt.py doc.json out.srt --srt --dash-gap 0 --min-cue 0.4- CI-safe run (abort on any issue)
python json2vtt.py episode.json episode.vtt --strict --verboseQ: My cues drift late over time.
A: Use the default --gap-strategy shrink or reduce --gap.
Q: I don't want dashes to break cues.
A: Add --no-dash-flush.
Q: Can I keep speaker prefixes?
A: Uncomment SPEAKER_PREFIX_FMT in the source or pass --no-speaker to suppress.
Q: Frame accuracy?
A: Use --fps with your video's frame rate; timestamps snap to frame boundaries.
MIT