Improve ASR for non-native spekers #2400

Omarnabk · 2021-06-24T18:09:07Z

Omarnabk
Jun 24, 2021

I am working on an ASR model for the English language spoken by non-native speakers but from different accent origins, like Arabs, Asians, Spanish, Indians, etc.
I have around 10H of training data and 30 min for testing. Finetuning the model with KenLM and LM-rescoring and I obtained a WER of ~22%. Currently, I'm working on "stt_en_conformer_ctc_large" CitriNet.
My questions:
1- Since my model is dealing with different accents, it is a good practice to have one model for all the accents at once?
My model is supposed to work for online streaming. I tried to use ASRFrame but did not give good accuracy as if I split the input audio based on the voice activity (VAD) and transcription of each activity at once.
2- Do you have an example of how to use KenLM online streaming?
3- last question: the speech I have contains people's names and special keywords that the model fails in transcribing them. Any advice on how to improve the name transcription. Currently, I'm just including them in the KenLM.

Thanks in advance!

Answered by titu1994

Jun 25, 2021

Hi @Omarnabk, great questions!

Currently, I'm working on "stt_en_conformer_ctc_large" CitriNet

The model stt_en_conformer_ctc_large (https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large) is a Conformer, not a Citrinet. You might instead want to use stt_en_citrinet_1024_gamma_0_25 (https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25) in order to use Citrinet.

Since my model is dealing with different accents, it is a good practice to have one model for all the accents at once?

You do not need to have separate model per accent, if there is sufficient data per accent (all accents should have roughly same amount of numbers of speech). I…

View full answer

titu1994 · 2021-06-25T21:15:26Z

titu1994
Jun 25, 2021
Maintainer

Hi @Omarnabk, great questions!

Currently, I'm working on "stt_en_conformer_ctc_large" CitriNet

The model stt_en_conformer_ctc_large (https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large) is a Conformer, not a Citrinet. You might instead want to use stt_en_citrinet_1024_gamma_0_25 (https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25) in order to use Citrinet.

Since my model is dealing with different accents, it is a good practice to have one model for all the accents at once?

You do not need to have separate model per accent, if there is sufficient data per accent (all accents should have roughly same amount of numbers of speech). If an accent has less data compared to others, it might benefit from a separate model.

My model is supposed to work for online streaming. I tried to use ASRFrame but did not give good accuracy as if I split the input audio based on the voice activity (VAD) and transcription of each activity at once.

To segment audio, NeMo provides a CTC segmentation toolkit which you could try for your case. https://github.com/NVIDIA/NeMo/tree/main/tools/ctc_segmentation

Do you have an example of how to use KenLM online streaming?

We do not (yet) support KenLM integration during buffered/streaming inference. In essence, due to the cost of running beam search, I would suggest first finishing a transcription, and then only once applying KenLM beam search at the end after entire sentence has been transcribed.

the speech I have contains people's names and special keywords that the model fails in transcribing them. Any advice on how to improve the name transcription. Currently, I'm just including them in the KenLM.

ASR models would not have enough training speech data to generalize to special keywords or names since they are usually unique. A domain specific KenLM model - with text that contains these special keywords and peoples names would be easier to integrate.

1 reply

khursani8 Jun 26, 2021

We do not (yet) support KenLM integration during buffered/streaming inference. In essence, due to the cost of running beam search, I would suggest first finishing a transcription, and then only once applying KenLM beam search at the end after entire sentence has been transcribed.

Cool cannot wait for the streaming inference support.
If still want to try Online Decoding can try this https://github.com/parlance/ctcdecode#online-decoding

Omarnabk · 2021-06-30T15:23:15Z

Omarnabk
Jun 30, 2021
Author

Many thanks for your reply.

For my dataset, I found that Citrinet is better than Conformer.

I also tried different versions of Citrinet: stt_en_citrinet_256, stt_en_citrinet_512, stt_en_citrinet_1024, and stt_en_citrinet_1024_25_Gamma. The obtained WER was 17%, 15%, 14% and 40%, respectively.

Based on your experiments, stt_en_citrinet_1024_gamma_0_25 is supposed to be the best, but for me, it was the worst. Even using KenLM or neural rescoring, the error still very high.

I tried to fine-tune all the versions, again, stt_en_citrinet_1024_gamma_0_25 was the worst. Any opinion? or particular configuration needed to fine-tune it?

Thanks!

1 reply

titu1994 Jun 30, 2021
Maintainer

Citrinet gamma model is tuned for smaller receptive field and have smaller number of subwords. You will need to tune your model training in order to achieve good results. Also the smaller subword may be interfering here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ASR for non-native spekers #2400

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improve ASR for non-native spekers #2400

Omarnabk Jun 24, 2021

Replies: 2 comments · 2 replies

titu1994 Jun 25, 2021 Maintainer

khursani8 Jun 26, 2021

Omarnabk Jun 30, 2021 Author

titu1994 Jun 30, 2021 Maintainer

Omarnabk
Jun 24, 2021

Replies: 2 comments 2 replies

titu1994
Jun 25, 2021
Maintainer

Omarnabk
Jun 30, 2021
Author

titu1994 Jun 30, 2021
Maintainer