Citrinet model with LM to reduce the WER for microphone recorded audio #2039

kruthikakr · 2021-04-09T12:52:10Z

Hi
I am using stt_en_citrinet_1024 model and able to get good transcript , I am using the recorded audios with microphone and WER is varying from 3.5% to 15%. This has names of person and place, how to include the words in the model.

any suggestions with following aspects

Preprocessing the audio
using LM model for decoding ( for citrinet is there any implementation in nemo)
Post processing steps say spell correction

looking for inputs .

titu1994 · 2021-04-09T19:24:27Z

You could finetune Citrinet using the same Tokenizer on the specific domain (if there is sufficient data).

If you have some noise files, noise robust training, the same method as QuartzNet can be applied to Citrinet.

For preprocessing, there should be monochannel 16Khz wav files. We find that attempting signal denoising before inference will generally not do much better, and sometimes will do worse based on the artifacts introduced.

For language modelling with Citrinet (and BPE models in general), we plan to release code snippets to build custom KenLM model and run beam search through similar steps as the offline asr notebook. However there are some significant differences and we have not compiled a clean script for such a task yet. I will try to prioritize that in the coming weeks.

There is also Transformer based rescoring that can further boost offline WER reduction, though that pipeline is not ready yet.

@AlexGrinch is there any ETA (within some months?) that you expect to have the pipeline for transformer based rescoring?

kruthikakr · 2021-04-15T18:50:50Z

Thanks you for response.
We would try to write the script with LM for BPE models . Any inputs or leads are much appreciated.

titu1994 · 2021-04-15T18:58:58Z

@VahidooX If you have a rough draft, could you create a gist and share here when it's ready ? We can clean it up in the actual PR

kruthikakr · 2021-04-19T07:45:21Z

Can someone please share some details on this ? waiting for response.

VahidooX · 2021-04-19T18:20:09Z

Created a PR for adding the feature of training and evaluating n-gram KenLM on top of BPE-base ASR models. It still needs the documentations. #2066

VahidooX · 2021-04-22T22:08:53Z

The PR to support N-gram LM for ASR models is merged : #2066
It can do grid search for beam search decoder's hyperparameters to fine-tune them. The scripts support both character-level and BPE-level models. You may read more here: https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/asr_language_modelling.rst

You need to install the beam search decoders and KenLM to use this feature.

kruthikakr · 2021-04-23T07:13:03Z

Thank you very much.

kruthikakr added the question label Apr 9, 2021

kruthikakr changed the title ~~[Question]~~ Citrinet model with LM to reduce the WER with audio recorded with microphone Apr 9, 2021

kruthikakr changed the title ~~Citrinet model with LM to reduce the WER with audio recorded with microphone~~ Citrinet model with LM to reduce the WER for microphone audio recorded Apr 9, 2021

kruthikakr changed the title ~~Citrinet model with LM to reduce the WER for microphone audio recorded~~ Citrinet model with LM to reduce the WER for microphone recorded audio Apr 9, 2021

okuchaiev assigned titu1994 Apr 9, 2021

okuchaiev closed this as completed Apr 23, 2021

NVIDIA locked and limited conversation to collaborators Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Citrinet model with LM to reduce the WER for microphone recorded audio #2039

Citrinet model with LM to reduce the WER for microphone recorded audio #2039

kruthikakr commented Apr 9, 2021

titu1994 commented Apr 9, 2021

kruthikakr commented Apr 15, 2021

titu1994 commented Apr 15, 2021

kruthikakr commented Apr 19, 2021

VahidooX commented Apr 19, 2021 •

edited

Loading

VahidooX commented Apr 22, 2021

kruthikakr commented Apr 23, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Citrinet model with LM to reduce the WER for microphone recorded audio #2039

Citrinet model with LM to reduce the WER for microphone recorded audio #2039

Comments

kruthikakr commented Apr 9, 2021

titu1994 commented Apr 9, 2021

kruthikakr commented Apr 15, 2021

titu1994 commented Apr 15, 2021

kruthikakr commented Apr 19, 2021

VahidooX commented Apr 19, 2021 • edited Loading

VahidooX commented Apr 22, 2021

kruthikakr commented Apr 23, 2021

This issue was moved to a discussion.

VahidooX commented Apr 19, 2021 •

edited

Loading