Skip to content

TroyHow0413/moduler_AVSR

Repository files navigation

moduler_AVSR

Audio-Visual Speech Recognition pipeline with speaker diarization, LLM-powered fusion, and a fully swappable component registry.

Architecture

Video input
    │
    ▼
[Speaker Diarization]          ← swap: config.diar_backend
    │  (segments: speaker, start, end)
    ├──────────────────────────┐
    ▼                          ▼
[ASR Engine]             [VSR Engine]      ← swap: config.asr_backend / vsr_backend
faster-whisper           VALLR / AV-HuBERT
    │                          │
    └─────────────┬────────────┘
                  ▼
     [Sentence Embedder]  all-MiniLM-L6-v2
         cosine similarity
                  │
      ┌───────────┼──────────────┐
      ▼           ▼              ▼
   sim≥0.75    0.5–0.75       sim<0.5
   ASR wins    LLM recon.     conflict + flag
                  │
                  ▼
     [LLM Reconciliation]       ← swap: config.llm_provider
     GPT-4o-mini / Claude
     context-aware correction
                  │
                  ▼
     TranscriptSegment
     (text, strategy, sim, flagged)
                  │
           JSON + SRT output

Project structure

moduler_AVSR/
├── main.py                           # CLI entry point
├── pipeline.py                       # Orchestrator (Diarize → ASR/VSR → Fuse → Output)
├── avsr_config.yaml                  # Example YAML config
├── requirements.txt
│
├── core/
│   ├── config.py                     # PipelineConfig dataclass (all knobs in one place)
│   └── registry.py                   # Abstract interfaces + plugin registry
│
├── engines/
│   ├── asr/asr_engines.py            # Whisper, Parakeet, MoonShine, Mock
│   ├── vsr/vsr_engines.py            # VALLR, AV-HuBERT, Auto-AVSR, Mock
│   └── diarization/diarizers.py      # pyannote, NeMo, Mock
│
├── fusion/
│   └── fusion.py                     # FusionModule + LLMClient
│
├── utils/
│   ├── preprocessing.py              # Audio/video slicing, mouth ROI extraction
│   └── output.py                     # JSON + SRT export
│
├── benchmarks/
│   ├── benchmark.py                  # Benchmarking CLI (WER/CER vs SOTA)
│   ├── datasets.py                   # Loaders: LRS2/3, LibriSpeech, VoxCeleb2, custom
│   ├── metrics.py                    # WER, CER, RTF + published baseline numbers
│   ├── noise.py                      # SNR noise injection (white/pink/babble)
│   ├── results/                      # CSV + JSON benchmark outputs (gitignored)
│   └── sample_data/sample.csv        # Minimal CSV for smoke-testing
│
└── tests/
    ├── test_fusion.py                # Unit tests: FusionModule strategies (no downloads)
    └── test_pipeline_integration.py  # Integration tests: full pipeline with mock backends

Quick start

python -m venv .venv && .venv/Scripts/activate  # Windows
# source .venv/bin/activate                     # Linux/Mac

pip install -r requirements.txt

# Minimal run (pyannote diarizer + Whisper ASR + VALLR VSR + GPT-4o-mini fusion)
python main.py \
  --video input.mp4 \
  --hf_token hf_xxx \
  --llm_api_key sk-xxx

# Swap ASR to Parakeet, use a local Ollama LLM
python main.py \
  --video input.mp4 \
  --asr parakeet \
  --llm local \
  --llm_base_url http://localhost:11434/v1 \
  --llm_model llama3

# No LLM (uses token_merge fallback for mid-similarity conflicts)
python main.py --video input.mp4 --llm none

# YAML config file
python main.py --video input.mp4 --config avsr_config.yaml

# Parallel ASR+VSR threads (1.5–2x faster on GPU)
python main.py --video input.mp4 --parallel

# List all registered backends
python main.py --list_backends

Available backends

Slot Key Model Notes
ASR whisper faster-whisper large-v3-turbo Default. Best multilingual
ASR parakeet nvidia/parakeet-tdt-1.1b Fastest English-only (RTF ~0.02)
ASR moonshine Useful Labs MoonShine Low-VRAM edge devices
ASR mock Testing, no download needed
VSR vallr VALLR Default. Subprocess CLI
VSR av_hubert Meta AV-HuBERT Best accuracy; requires face crop
VSR auto_avsr Auto-AVSR Clean Python API, SOTA on LRS2/3
VSR mock Testing, no download needed
Diar pyannote pyannote 3.1 Default. Requires HF token
Diar nemo NeMo MSDD Better on overlapping speech
Diar mock Testing, returns 3 fixed segments
LLM openai gpt-4o-mini (default) Reconciliation via OpenAI API
LLM anthropic claude-3-haiku Alternative
LLM local any OpenAI-compat Ollama, vLLM, LM Studio, etc.
LLM none Token-merge fallback, no API key

Fusion strategies

Each output segment includes a strategy field explaining how the final text was chosen:

Strategy Condition Meaning
asr_dominant sim ≥ 0.75 Both sources agree; ASR text used (cleaner)
llm_reconciled 0.5–0.75, LLM on LLM resolved the ambiguity
token_merge 0.5–0.75, no LLM Diff-based merge; conflicts shown as [word_a|word_b]
conflict_asr_fallback sim < 0.5 Genuine disagreement; ASR used, segment flagged for review
single_modality one source empty Only one input available

Adding a new backend

The registry is decorator-based — no other file needs to change.

# engines/asr/asr_engines.py

@register_asr("my_model")
class MyASR(ASREngine):
    def __init__(self, config):
        self.model = load_my_model(config.device)

    def transcribe(self, audio_path: str) -> str:
        return self.model.infer(audio_path)

Then use it immediately:

python main.py --video input.mp4 --asr my_model

Same pattern applies to @register_vsr and @register_diar.

Running tests

# Unit tests only (fast, ~2s, no model downloads)
.venv/Scripts/python -m pytest tests/test_fusion.py -v

# Integration tests (loads sentence-transformers once, ~26s)
.venv/Scripts/python -m pytest tests/test_pipeline_integration.py -v

# All tests
.venv/Scripts/python -m pytest tests/ -v

Benchmarking

Evaluate WER/CER against standard datasets and compare against published systems (Whisper-Flamingo, AV-HuBERT, Auto-AVSR, etc.).

# Smoke test with mock backends (no models, no real data)
python benchmarks/benchmark.py \
  --dataset custom_csv \
  --data_root benchmarks/sample_data/sample.csv \
  --asr mock --vsr mock --llm none

# Real evaluation on LRS3 with noise sweep
python benchmarks/benchmark.py \
  --dataset lrs3 \
  --data_root /data/lrs3 \
  --asr whisper --vsr vallr \
  --hf_token hf_xxx \
  --snr_levels clean 20dB 10dB 0dB \
  --noise_types white babble

# LibriSpeech ASR-only
python benchmarks/benchmark.py \
  --dataset librispeech \
  --data_root /data/LibriSpeech \
  --split test-clean \
  --asr whisper --vsr mock --llm none \
  --max_samples 500

Supported datasets: lrs2, lrs3, librispeech, voxceleb2, custom_csv, custom_json

Custom CSV format:

video_path,audio_path,transcript,split,id,duration
/data/001.mp4,,hello world,test,001,3.5

Results are saved to benchmarks/results/<dataset>_<timestamp>.csv and _detail.json.

The comparison table is printed automatically at the end of each run:

======================================================================
  COMPARISON TABLE (WER %)
======================================================================
  System                                Modality    WER %
  -------------------------------------------------------
  This pipeline (A)                            A      3.1  <- YOUR PIPELINE
  This pipeline (V)                            V     28.4  <- YOUR PIPELINE
  This pipeline (AV)                          AV      2.8  <- YOUR PIPELINE
  -------------------------------------------------------
  Whisper-Flamingo (AV)                       AV      2.3
  AV-HuBERT (AV)                             AV      1.4
  Auto-AVSR (AV)                             AV      0.9
  Whisper large-v3 (A)                        A      2.7
======================================================================

Configuration reference

All settings live in core/config.py (or override via CLI / YAML):

# avsr_config.yaml
asr_backend: whisper          # whisper | parakeet | moonshine | mock
vsr_backend: vallr            # vallr | av_hubert | auto_avsr | mock
diar_backend: pyannote        # pyannote | nemo | mock

hf_token: "hf_xxx"           # required for pyannote
whisper_model: large-v3-turbo
whisper_language: en

llm_provider: openai          # openai | anthropic | local | none
llm_model: gpt-4o-mini
llm_api_key: "sk-xxx"

sim_threshold_high: 0.75      # above -> ASR wins outright
sim_threshold_med: 0.50       # between -> LLM reconciliation
                              # below  -> conflict, ASR fallback + flagged
output_json: output_fusion.json
output_srt:  output_fusion.srt

Output format

JSON (output_fusion.json):

[
  {
    "speaker": "SPEAKER_00",
    "start": 0.0,
    "end": 3.2,
    "text": "he took the pill",
    "strategy": "llm_reconciled",
    "similarity": 0.63,
    "confidence": 0.73,
    "flagged": false,
    "asr_raw": "he took the pill",
    "vsr_raw": "he took the bill"
  }
]

SRT (output_fusion.srt): standard subtitle format with speaker labels; conflicted segments marked with ⚠.

Requirements

Core: torch, faster-whisper, pyannote.audio, sentence-transformers, pydub, moviepy, pyyaml

Optional by backend:

  • Parakeet / NeMo diarizer: nemo_toolkit[asr]
  • Mouth ROI (AV-HuBERT, Auto-AVSR): mediapipe, opencv-python
  • OpenAI LLM: openai
  • Anthropic LLM: anthropic
  • Benchmarking: jiwer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages