moduler_AVSR

Audio-Visual Speech Recognition pipeline with speaker diarization, LLM-powered fusion, and a fully swappable component registry.

Architecture

Video input
    │
    ▼
[Speaker Diarization]          ← swap: config.diar_backend
    │  (segments: speaker, start, end)
    ├──────────────────────────┐
    ▼                          ▼
[ASR Engine]             [VSR Engine]      ← swap: config.asr_backend / vsr_backend
faster-whisper           VALLR / AV-HuBERT
    │                          │
    └─────────────┬────────────┘
                  ▼
     [Sentence Embedder]  all-MiniLM-L6-v2
         cosine similarity
                  │
      ┌───────────┼──────────────┐
      ▼           ▼              ▼
   sim≥0.75    0.5–0.75       sim<0.5
   ASR wins    LLM recon.     conflict + flag
                  │
                  ▼
     [LLM Reconciliation]       ← swap: config.llm_provider
     GPT-4o-mini / Claude
     context-aware correction
                  │
                  ▼
     TranscriptSegment
     (text, strategy, sim, flagged)
                  │
           JSON + SRT output

Project structure

moduler_AVSR/
├── main.py                           # CLI entry point
├── pipeline.py                       # Orchestrator (Diarize → ASR/VSR → Fuse → Output)
├── avsr_config.yaml                  # Example YAML config
├── requirements.txt
│
├── core/
│   ├── config.py                     # PipelineConfig dataclass (all knobs in one place)
│   └── registry.py                   # Abstract interfaces + plugin registry
│
├── engines/
│   ├── asr/asr_engines.py            # Whisper, Parakeet, MoonShine, Mock
│   ├── vsr/vsr_engines.py            # VALLR, AV-HuBERT, Auto-AVSR, Mock
│   └── diarization/diarizers.py      # pyannote, NeMo, Mock
│
├── fusion/
│   └── fusion.py                     # FusionModule + LLMClient
│
├── utils/
│   ├── preprocessing.py              # Audio/video slicing, mouth ROI extraction
│   └── output.py                     # JSON + SRT export
│
├── benchmarks/
│   ├── benchmark.py                  # Benchmarking CLI (WER/CER vs SOTA)
│   ├── datasets.py                   # Loaders: LRS2/3, LibriSpeech, VoxCeleb2, custom
│   ├── metrics.py                    # WER, CER, RTF + published baseline numbers
│   ├── noise.py                      # SNR noise injection (white/pink/babble)
│   ├── results/                      # CSV + JSON benchmark outputs (gitignored)
│   └── sample_data/sample.csv        # Minimal CSV for smoke-testing
│
└── tests/
    ├── test_fusion.py                # Unit tests: FusionModule strategies (no downloads)
    └── test_pipeline_integration.py  # Integration tests: full pipeline with mock backends

Quick start

python -m venv .venv && .venv/Scripts/activate  # Windows
# source .venv/bin/activate                     # Linux/Mac

pip install -r requirements.txt

# Minimal run (pyannote diarizer + Whisper ASR + VALLR VSR + GPT-4o-mini fusion)
python main.py \
  --video input.mp4 \
  --hf_token hf_xxx \
  --llm_api_key sk-xxx

# Swap ASR to Parakeet, use a local Ollama LLM
python main.py \
  --video input.mp4 \
  --asr parakeet \
  --llm local \
  --llm_base_url http://localhost:11434/v1 \
  --llm_model llama3

# No LLM (uses token_merge fallback for mid-similarity conflicts)
python main.py --video input.mp4 --llm none

# YAML config file
python main.py --video input.mp4 --config avsr_config.yaml

# Parallel ASR+VSR threads (1.5–2x faster on GPU)
python main.py --video input.mp4 --parallel

# List all registered backends
python main.py --list_backends

Available backends

Slot	Key	Model	Notes
ASR	`whisper`	faster-whisper large-v3-turbo	Default. Best multilingual
ASR	`parakeet`	nvidia/parakeet-tdt-1.1b	Fastest English-only (RTF ~0.02)
ASR	`moonshine`	Useful Labs MoonShine	Low-VRAM edge devices
ASR	`mock`	—	Testing, no download needed
VSR	`vallr`	VALLR	Default. Subprocess CLI
VSR	`av_hubert`	Meta AV-HuBERT	Best accuracy; requires face crop
VSR	`auto_avsr`	Auto-AVSR	Clean Python API, SOTA on LRS2/3
VSR	`mock`	—	Testing, no download needed
Diar	`pyannote`	pyannote 3.1	Default. Requires HF token
Diar	`nemo`	NeMo MSDD	Better on overlapping speech
Diar	`mock`	—	Testing, returns 3 fixed segments
LLM	`openai`	gpt-4o-mini (default)	Reconciliation via OpenAI API
LLM	`anthropic`	claude-3-haiku	Alternative
LLM	`local`	any OpenAI-compat	Ollama, vLLM, LM Studio, etc.
LLM	`none`	—	Token-merge fallback, no API key

Fusion strategies

Each output segment includes a strategy field explaining how the final text was chosen:

Strategy	Condition	Meaning
`asr_dominant`	sim ≥ 0.75	Both sources agree; ASR text used (cleaner)
`llm_reconciled`	0.5–0.75, LLM on	LLM resolved the ambiguity
`token_merge`	0.5–0.75, no LLM	Diff-based merge; conflicts shown as `[word_a\|word_b]`
`conflict_asr_fallback`	sim < 0.5	Genuine disagreement; ASR used, segment flagged for review
`single_modality`	one source empty	Only one input available

Adding a new backend

The registry is decorator-based — no other file needs to change.

# engines/asr/asr_engines.py

@register_asr("my_model")
class MyASR(ASREngine):
    def __init__(self, config):
        self.model = load_my_model(config.device)

    def transcribe(self, audio_path: str) -> str:
        return self.model.infer(audio_path)

Then use it immediately:

python main.py --video input.mp4 --asr my_model

Same pattern applies to @register_vsr and @register_diar.

Running tests

# Unit tests only (fast, ~2s, no model downloads)
.venv/Scripts/python -m pytest tests/test_fusion.py -v

# Integration tests (loads sentence-transformers once, ~26s)
.venv/Scripts/python -m pytest tests/test_pipeline_integration.py -v

# All tests
.venv/Scripts/python -m pytest tests/ -v

Benchmarking

Evaluate WER/CER against standard datasets and compare against published systems (Whisper-Flamingo, AV-HuBERT, Auto-AVSR, etc.).

# Smoke test with mock backends (no models, no real data)
python benchmarks/benchmark.py \
  --dataset custom_csv \
  --data_root benchmarks/sample_data/sample.csv \
  --asr mock --vsr mock --llm none

# Real evaluation on LRS3 with noise sweep
python benchmarks/benchmark.py \
  --dataset lrs3 \
  --data_root /data/lrs3 \
  --asr whisper --vsr vallr \
  --hf_token hf_xxx \
  --snr_levels clean 20dB 10dB 0dB \
  --noise_types white babble

# LibriSpeech ASR-only
python benchmarks/benchmark.py \
  --dataset librispeech \
  --data_root /data/LibriSpeech \
  --split test-clean \
  --asr whisper --vsr mock --llm none \
  --max_samples 500

Supported datasets: lrs2, lrs3, librispeech, voxceleb2, custom_csv, custom_json

Custom CSV format:

video_path,audio_path,transcript,split,id,duration
/data/001.mp4,,hello world,test,001,3.5

Results are saved to benchmarks/results/<dataset>_<timestamp>.csv and _detail.json.

The comparison table is printed automatically at the end of each run:

======================================================================
  COMPARISON TABLE (WER %)
======================================================================
  System                                Modality    WER %
  -------------------------------------------------------
  This pipeline (A)                            A      3.1  <- YOUR PIPELINE
  This pipeline (V)                            V     28.4  <- YOUR PIPELINE
  This pipeline (AV)                          AV      2.8  <- YOUR PIPELINE
  -------------------------------------------------------
  Whisper-Flamingo (AV)                       AV      2.3
  AV-HuBERT (AV)                             AV      1.4
  Auto-AVSR (AV)                             AV      0.9
  Whisper large-v3 (A)                        A      2.7
======================================================================

Configuration reference

All settings live in core/config.py (or override via CLI / YAML):

# avsr_config.yaml
asr_backend: whisper          # whisper | parakeet | moonshine | mock
vsr_backend: vallr            # vallr | av_hubert | auto_avsr | mock
diar_backend: pyannote        # pyannote | nemo | mock

hf_token: "hf_xxx"           # required for pyannote
whisper_model: large-v3-turbo
whisper_language: en

llm_provider: openai          # openai | anthropic | local | none
llm_model: gpt-4o-mini
llm_api_key: "sk-xxx"

sim_threshold_high: 0.75      # above -> ASR wins outright
sim_threshold_med: 0.50       # between -> LLM reconciliation
                              # below  -> conflict, ASR fallback + flagged
output_json: output_fusion.json
output_srt:  output_fusion.srt

Output format

JSON (output_fusion.json):

[
  {
    "speaker": "SPEAKER_00",
    "start": 0.0,
    "end": 3.2,
    "text": "he took the pill",
    "strategy": "llm_reconciled",
    "similarity": 0.63,
    "confidence": 0.73,
    "flagged": false,
    "asr_raw": "he took the pill",
    "vsr_raw": "he took the bill"
  }
]

SRT (output_fusion.srt): standard subtitle format with speaker labels; conflicted segments marked with ⚠.

Requirements

Core: torch, faster-whisper, pyannote.audio, sentence-transformers, pydub, moviepy, pyyaml

Optional by backend:

Parakeet / NeMo diarizer: nemo_toolkit[asr]
Mouth ROI (AV-HuBERT, Auto-AVSR): mediapipe, opencv-python
OpenAI LLM: openai
Anthropic LLM: anthropic
Benchmarking: jiwer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moduler_AVSR

Architecture

Project structure

Quick start

Available backends

Fusion strategies

Adding a new backend

Running tests

Benchmarking

Configuration reference

Output format

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
__pycache__		__pycache__
benchmarks		benchmarks
core		core
engines		engines
fusion		fusion
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
avsr_config.yaml		avsr_config.yaml
main.py		main.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

moduler_AVSR

Architecture

Project structure

Quick start

Available backends

Fusion strategies

Adding a new backend

Running tests

Benchmarking

Configuration reference

Output format

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages