This tutorial uses the following binaries with the following capabilities:
fl_asr_tutorial_inference_ctc
: perform inference with a pretrained model with CTC lossfl_asr_tutorial_finetune_ctc
: finetune a pretrained CTC model with additional datafl_asr_align
: force align audio and transcriptions using a CTC modelfl_asr_voice_activity_detection_ctc
: detect speech and perform audio analysis
The wav2letter Robust ASR (RASR) recipe contains robust pre-trained models and resources for finetuning some of which are used in the Colab tutorials above.
See the full documentation for more general training or decoding instructions.
The outline below describes the end-to-end process of finetuning a pretrained acoustic model. In several steps:
-
Preprocessing the audio.
a. Most audio formats are supported and are automatically detected.
b. All audio used in training or inference must have the same sample rate; up/downsampling audio may be necessary. Provided pretrained models were trained using 16 kHz and will require that sample rate for finetuning, so up/downsample your audio as necessary.
-
Force-aligning the labeled audio.
a. Using the existing transcriptions, generate audio-text alignments using the
fl_asr_align
binary. See the full alignment documentation.b. Based on the alignments, trim the existing audio to include sections containing speech. Doing so typically increases training speed.
-
Generate a final list file for training and validation sets using the trimmed audio and transcriptions. See the list file documentation for more details.
-
Use the
fl_asr_tutorial_finetune_ctc
binary to finetune the pretrained model (or train your own from scratch). List files can be passed to finetuning or inference binaries using thetrain
/valid
ortest
flags, respectively.
See this colab notebook for a step-by-step tutorial.
The fl_asr_tutorial_inference_ctc
binary provides a way to perform inference with CTC-trained acoustic models. To perform inference, you'll need the following components (with their corresponding flags
):
- An acoustic model (AM) (
am_path
) - A token set with which the AM was trained (
tokens_path
) - A lexicon (
lexicon_path
) - A language model for decoding (
lm_path
)
The following parameters are also configurable when performing inference:
- The sample rate of input audio (
sample_rate
) - The beam size when decoding (
beam_size
) - The beam size of the token beam when decoding (
beam_size_token
) - The beam threshold for decoding (
beam_threshold
) - The LM weight score for decoding (
lm_weight
) - The word score for decoding (
word_score
).
See the complete ASR app documentation for a more detailed explanation of each of these flags. See the aforementioned colab tutorial for sensible values used in a demo.
See this colab notebook for a step-by-step tutorial.
The fl_asr_tutorial_finetune_ctc
binary provides a means of finetuning a pretrained acoustic model on additional labeled audio. Usage of the binary is as follows:
./fl_asr_tutorial_finetune_ctc [path to directory containing model] [...flags]
To finetune, you'll need the following components (with their corresponding flags
):
- An acoustic model (AM) to finetune (the first argument to the binary invocation, e.g.
fl_asr_tutorial_finetune_ctc [path] [...flags]
) - A token set with which the AM was trained (
tokens
)* - A lexicon (
lexicon
) - Validation sets to use for finetuning (
valid
) - Train sets with data on which to finetune (
train
) - Other training flags for flashlight training or audio processing as per the ASR documentation.
- Should be identical to that with which the original AM was trained. Will be provided with the AM in recipes/tutorials.