Perform cleaning and normalization to standardize speech transcripts (train and test) across datasets.
We perform the following steps -
- Remove puctuations
- Normalize the characters
Note: We do not do the following steps for speech transcripts but might be useful for LMs used for speech models.
- Remove sentences having Out of Dictionary characters.
- Convert num to word
We also provide language-specific character dictionaries to remove OOD characters for each language. We recommend using the same dictionaries for training ASR models.
pip install pandas
pip install indic-nlp-library
python cleaning.py <input_file> <output_file> <lang_code>
input_file: file containing sentences (one sentence per line)
output_file: path to output file
lang_code: language code in ISO format
Released under the MIT license