Audio-Visual Speech Recognition pipeline with speaker diarization, LLM-powered fusion, and a fully swappable component registry.
Video input
│
▼
[Speaker Diarization] ← swap: config.diar_backend
│ (segments: speaker, start, end)
├──────────────────────────┐
▼ ▼
[ASR Engine] [VSR Engine] ← swap: config.asr_backend / vsr_backend
faster-whisper VALLR / AV-HuBERT
│ │
└─────────────┬────────────┘
▼
[Sentence Embedder] all-MiniLM-L6-v2
cosine similarity
│
┌───────────┼──────────────┐
▼ ▼ ▼
sim≥0.75 0.5–0.75 sim<0.5
ASR wins LLM recon. conflict + flag
│
▼
[LLM Reconciliation] ← swap: config.llm_provider
GPT-4o-mini / Claude
context-aware correction
│
▼
TranscriptSegment
(text, strategy, sim, flagged)
│
JSON + SRT output
moduler_AVSR/
├── main.py # CLI entry point
├── pipeline.py # Orchestrator (Diarize → ASR/VSR → Fuse → Output)
├── avsr_config.yaml # Example YAML config
├── requirements.txt
│
├── core/
│ ├── config.py # PipelineConfig dataclass (all knobs in one place)
│ └── registry.py # Abstract interfaces + plugin registry
│
├── engines/
│ ├── asr/asr_engines.py # Whisper, Parakeet, MoonShine, Mock
│ ├── vsr/vsr_engines.py # VALLR, AV-HuBERT, Auto-AVSR, Mock
│ └── diarization/diarizers.py # pyannote, NeMo, Mock
│
├── fusion/
│ └── fusion.py # FusionModule + LLMClient
│
├── utils/
│ ├── preprocessing.py # Audio/video slicing, mouth ROI extraction
│ └── output.py # JSON + SRT export
│
├── benchmarks/
│ ├── benchmark.py # Benchmarking CLI (WER/CER vs SOTA)
│ ├── datasets.py # Loaders: LRS2/3, LibriSpeech, VoxCeleb2, custom
│ ├── metrics.py # WER, CER, RTF + published baseline numbers
│ ├── noise.py # SNR noise injection (white/pink/babble)
│ ├── results/ # CSV + JSON benchmark outputs (gitignored)
│ └── sample_data/sample.csv # Minimal CSV for smoke-testing
│
└── tests/
├── test_fusion.py # Unit tests: FusionModule strategies (no downloads)
└── test_pipeline_integration.py # Integration tests: full pipeline with mock backends
python -m venv .venv && .venv/Scripts/activate # Windows
# source .venv/bin/activate # Linux/Mac
pip install -r requirements.txt
# Minimal run (pyannote diarizer + Whisper ASR + VALLR VSR + GPT-4o-mini fusion)
python main.py \
--video input.mp4 \
--hf_token hf_xxx \
--llm_api_key sk-xxx
# Swap ASR to Parakeet, use a local Ollama LLM
python main.py \
--video input.mp4 \
--asr parakeet \
--llm local \
--llm_base_url http://localhost:11434/v1 \
--llm_model llama3
# No LLM (uses token_merge fallback for mid-similarity conflicts)
python main.py --video input.mp4 --llm none
# YAML config file
python main.py --video input.mp4 --config avsr_config.yaml
# Parallel ASR+VSR threads (1.5–2x faster on GPU)
python main.py --video input.mp4 --parallel
# List all registered backends
python main.py --list_backends| Slot | Key | Model | Notes |
|---|---|---|---|
| ASR | whisper |
faster-whisper large-v3-turbo | Default. Best multilingual |
| ASR | parakeet |
nvidia/parakeet-tdt-1.1b | Fastest English-only (RTF ~0.02) |
| ASR | moonshine |
Useful Labs MoonShine | Low-VRAM edge devices |
| ASR | mock |
— | Testing, no download needed |
| VSR | vallr |
VALLR | Default. Subprocess CLI |
| VSR | av_hubert |
Meta AV-HuBERT | Best accuracy; requires face crop |
| VSR | auto_avsr |
Auto-AVSR | Clean Python API, SOTA on LRS2/3 |
| VSR | mock |
— | Testing, no download needed |
| Diar | pyannote |
pyannote 3.1 | Default. Requires HF token |
| Diar | nemo |
NeMo MSDD | Better on overlapping speech |
| Diar | mock |
— | Testing, returns 3 fixed segments |
| LLM | openai |
gpt-4o-mini (default) | Reconciliation via OpenAI API |
| LLM | anthropic |
claude-3-haiku | Alternative |
| LLM | local |
any OpenAI-compat | Ollama, vLLM, LM Studio, etc. |
| LLM | none |
— | Token-merge fallback, no API key |
Each output segment includes a strategy field explaining how the final text was chosen:
| Strategy | Condition | Meaning |
|---|---|---|
asr_dominant |
sim ≥ 0.75 | Both sources agree; ASR text used (cleaner) |
llm_reconciled |
0.5–0.75, LLM on | LLM resolved the ambiguity |
token_merge |
0.5–0.75, no LLM | Diff-based merge; conflicts shown as [word_a|word_b] |
conflict_asr_fallback |
sim < 0.5 | Genuine disagreement; ASR used, segment flagged for review |
single_modality |
one source empty | Only one input available |
The registry is decorator-based — no other file needs to change.
# engines/asr/asr_engines.py
@register_asr("my_model")
class MyASR(ASREngine):
def __init__(self, config):
self.model = load_my_model(config.device)
def transcribe(self, audio_path: str) -> str:
return self.model.infer(audio_path)Then use it immediately:
python main.py --video input.mp4 --asr my_modelSame pattern applies to @register_vsr and @register_diar.
# Unit tests only (fast, ~2s, no model downloads)
.venv/Scripts/python -m pytest tests/test_fusion.py -v
# Integration tests (loads sentence-transformers once, ~26s)
.venv/Scripts/python -m pytest tests/test_pipeline_integration.py -v
# All tests
.venv/Scripts/python -m pytest tests/ -vEvaluate WER/CER against standard datasets and compare against published systems (Whisper-Flamingo, AV-HuBERT, Auto-AVSR, etc.).
# Smoke test with mock backends (no models, no real data)
python benchmarks/benchmark.py \
--dataset custom_csv \
--data_root benchmarks/sample_data/sample.csv \
--asr mock --vsr mock --llm none
# Real evaluation on LRS3 with noise sweep
python benchmarks/benchmark.py \
--dataset lrs3 \
--data_root /data/lrs3 \
--asr whisper --vsr vallr \
--hf_token hf_xxx \
--snr_levels clean 20dB 10dB 0dB \
--noise_types white babble
# LibriSpeech ASR-only
python benchmarks/benchmark.py \
--dataset librispeech \
--data_root /data/LibriSpeech \
--split test-clean \
--asr whisper --vsr mock --llm none \
--max_samples 500Supported datasets: lrs2, lrs3, librispeech, voxceleb2, custom_csv, custom_json
Custom CSV format:
video_path,audio_path,transcript,split,id,duration
/data/001.mp4,,hello world,test,001,3.5
Results are saved to benchmarks/results/<dataset>_<timestamp>.csv and _detail.json.
The comparison table is printed automatically at the end of each run:
======================================================================
COMPARISON TABLE (WER %)
======================================================================
System Modality WER %
-------------------------------------------------------
This pipeline (A) A 3.1 <- YOUR PIPELINE
This pipeline (V) V 28.4 <- YOUR PIPELINE
This pipeline (AV) AV 2.8 <- YOUR PIPELINE
-------------------------------------------------------
Whisper-Flamingo (AV) AV 2.3
AV-HuBERT (AV) AV 1.4
Auto-AVSR (AV) AV 0.9
Whisper large-v3 (A) A 2.7
======================================================================
All settings live in core/config.py (or override via CLI / YAML):
# avsr_config.yaml
asr_backend: whisper # whisper | parakeet | moonshine | mock
vsr_backend: vallr # vallr | av_hubert | auto_avsr | mock
diar_backend: pyannote # pyannote | nemo | mock
hf_token: "hf_xxx" # required for pyannote
whisper_model: large-v3-turbo
whisper_language: en
llm_provider: openai # openai | anthropic | local | none
llm_model: gpt-4o-mini
llm_api_key: "sk-xxx"
sim_threshold_high: 0.75 # above -> ASR wins outright
sim_threshold_med: 0.50 # between -> LLM reconciliation
# below -> conflict, ASR fallback + flagged
output_json: output_fusion.json
output_srt: output_fusion.srtJSON (output_fusion.json):
[
{
"speaker": "SPEAKER_00",
"start": 0.0,
"end": 3.2,
"text": "he took the pill",
"strategy": "llm_reconciled",
"similarity": 0.63,
"confidence": 0.73,
"flagged": false,
"asr_raw": "he took the pill",
"vsr_raw": "he took the bill"
}
]SRT (output_fusion.srt): standard subtitle format with speaker labels; conflicted segments marked with ⚠.
Core: torch, faster-whisper, pyannote.audio, sentence-transformers, pydub, moviepy, pyyaml
Optional by backend:
- Parakeet / NeMo diarizer:
nemo_toolkit[asr] - Mouth ROI (AV-HuBERT, Auto-AVSR):
mediapipe,opencv-python - OpenAI LLM:
openai - Anthropic LLM:
anthropic - Benchmarking:
jiwer