Bilingual Kazakh–English target-speaker ASR for overlapping speech.
Persona-ASR transcribes an enrolled target speaker from a multi-talker overlapping mixture and explicitly rejects utterances in which the target speaker is absent. It couples an enrollment-conditioned recognizer (a frozen ECAPA-TDNN speaker embedding modulating a WavLM-Base+ encoder through FiLM, with language-specific CTC heads) with a target-presence gate, and supports same- and cross-language enrollment.
| Resource | Link |
|---|---|
| Model checkpoints (backbone + presence gate) | https://huggingface.co/issai/Persona-ASR |
| KazMix-3 (Kazakh 3-speaker TS-ASR dataset) | https://huggingface.co/datasets/issai/KazMix-3 |
| PersonaMix (controlled bilingual benchmark) | https://huggingface.co/datasets/issai/PersonaMix |
KazMix-3 ships mixture manifests and generation scripts; the audio is regenerated from the Kazakh Speech Dataset (OpenSLR 140). PersonaMix provides the full controlled benchmark.
Training and evaluation code will be released in this repository.
Released under CC BY 4.0.