中文版 | English
A modular framework for evaluating Arabic Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems.
This framework provides:
- TTS Generation: Generate audio from text using various TTS models
- ASR Transcription: Transcribe audio using ASR models
- Evaluation: Calculate WER/CER metrics with Arabic text normalization
- Audio Quality Metrics: STOI, PESQ, Duration Error, and MCD
# Core dependencies
pip install torch transformers soundfile pandas jiwer tqdm python-dotenv
# Audio quality metrics (optional but recommended)
pip install pystoi pesq librosa scipy# Full TTS-ASR pipeline with audio quality metrics
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3
# ASR-only evaluation
python main.py --dataset everyayah --asr-model whisper-large-v3
# Skip TTS (use existing audio)
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3 --skip-tts
# Skip audio quality metrics (faster evaluation)
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3 --skip-audio-metricsmms-tts-ara: Meta MMS-TTS Arabicopenaudio-s1-mini: OpenAudio S1-mini (Fish Speech)elevenlabs-multilingual-v2: ElevenLabs APIminimax-speech-02-hd: MiniMax API
whisper-large-v3: OpenAI Whisper Large V3qwen3-omni: Qwen3-Omni 30Bconformer-ctc: NeMo Conformer-CTC
clArTTS: Classical Arabic TTS dataset (205 samples)everyayah: Quran recitation dataset (~6,000 samples)arvoice: Arabic voice datasetRuisheng_TTS: Ruisheng TTS dataset (68 samples)
Results are saved in results/{dataset}/{tts_model}_to_{asr_model}/:
generated_audio/: Generated WAV filestranscriptions.jsonl: ASR transcriptionsevaluation_results.csv: Per-sample metrics (WER, CER, STOI, PESQ, DE, MCD)evaluation_summary.csv: Overall metrics with averagestiming.json: Performance metrics
Text Metrics:
- WER (Word Error Rate): Word-level transcription accuracy
- CER (Character Error Rate): Character-level transcription accuracy
Audio Quality Metrics:
- STOI (Short-Time Objective Intelligibility): Speech intelligibility (0-1, higher is better)
- PESQ (Perceptual Evaluation of Speech Quality): Speech quality (-0.5 to 4.5, higher is better)
- DE (Duration Error): Relative duration difference (0 to inf, lower is better)
- MCD (Mel-Cepstral Distortion): Spectral distance (lower is better, <6.0 is good)
- Prepare dataset structure:
datasets/my_dataset/
├── metadata.csv
└── wav/
├── 00000.wav
└── ...
- Create metadata.csv:
id,file,text
0,00000.wav,النص العربي هنا
1,00001.wav,نص آخر
- Register in
src/benchmark/config/dataset_config.py:
"my_dataset": DatasetConfig(
name="my_dataset",
metadata_file="datasets/my_dataset/metadata.csv",
audio_dir="datasets/my_dataset/wav",
id_column="id",
text_column="text",
audio_column="file"
),- Create TTS module in
src/benchmark/modules/tts/my_tts.py:
from .base_tts import BaseTTS
class MyTTS(BaseTTS):
def load(self):
# Load your model
pass
def synthesize(self, text: str, output_path: str) -> tuple[float, float]:
# Generate audio and save to output_path
# Return (generation_time, audio_duration)
pass- Register in
src/benchmark/modules/tts/__init__.py:
from .my_tts import MyTTS- Add config in
src/benchmark/config/model_config.py:
"my-tts": TTSModelConfig(
model_name="my-tts",
model_type="my_tts",
model_path="models/my-tts",
device="cuda",
sampling_rate=16000
),- Create ASR module in
src/benchmark/modules/asr/my_asr.py:
from .base_asr import BaseASR
class MyASR(BaseASR):
def load(self):
# Load your model
pass
def transcribe(self, audio_path: str) -> tuple[str, float]:
# Transcribe audio
# Return (transcription, transcription_time)
pass- Register in
src/benchmark/modules/asr/__init__.py:
from .my_asr import MyASR- Add config in
src/benchmark/config/model_config.py:
"my-asr": ASRModelConfig(
model_name="my-asr",
model_type="my_asr",
model_path="models/my-asr",
device="cuda",
language="ar"
),