-
I am working on an ASR model for the English language spoken by non-native speakers but from different accent origins, like Arabs, Asians, Spanish, Indians, etc. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi @Omarnabk, great questions!
The model
You do not need to have separate model per accent, if there is sufficient data per accent (all accents should have roughly same amount of numbers of speech). If an accent has less data compared to others, it might benefit from a separate model.
To segment audio, NeMo provides a CTC segmentation toolkit which you could try for your case. https://github.com/NVIDIA/NeMo/tree/main/tools/ctc_segmentation
We do not (yet) support KenLM integration during buffered/streaming inference. In essence, due to the cost of running beam search, I would suggest first finishing a transcription, and then only once applying KenLM beam search at the end after entire sentence has been transcribed.
ASR models would not have enough training speech data to generalize to special keywords or names since they are usually unique. A domain specific KenLM model - with text that contains these special keywords and peoples names would be easier to integrate. |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your reply. For my dataset, I found that Citrinet is better than Conformer. I also tried different versions of Citrinet: stt_en_citrinet_256, stt_en_citrinet_512, stt_en_citrinet_1024, and stt_en_citrinet_1024_25_Gamma. The obtained WER was 17%, 15%, 14% and 40%, respectively. Based on your experiments, stt_en_citrinet_1024_gamma_0_25 is supposed to be the best, but for me, it was the worst. Even using KenLM or neural rescoring, the error still very high. I tried to fine-tune all the versions, again, stt_en_citrinet_1024_gamma_0_25 was the worst. Any opinion? or particular configuration needed to fine-tune it? Thanks! |
Beta Was this translation helpful? Give feedback.
Hi @Omarnabk, great questions!
The model
stt_en_conformer_ctc_large
(https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large) is a Conformer, not a Citrinet. You might instead want to usestt_en_citrinet_1024_gamma_0_25
(https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25) in order to use Citrinet.You do not need to have separate model per accent, if there is sufficient data per accent (all accents should have roughly same amount of numbers of speech). I…