Skip to content

01-vyom/End_2_End_Automatic_Speech_Recognition_For_Gujarati

Repository files navigation

End-to-End Automatic Speech Recognition For Gujarati

ICON 2020: 17th International Conference on Natural Language Processing

Deepang Raval1 | Vyom Pathak1 | Muktan Patel1 | Brijesh Bhatt1

Dharmsinh Desai University, Nadiad1

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems' to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11% decrease in Word Error Rate (WER) with respect to base-model WER.

If you find this work useful, please cite this work using the following BibTeX:

@inproceedings{raval-etal-2020-end,
    title = "End-to-End Automatic Speech Recognition for {G}ujarati",
    author = "Raval, Deepang  and
      Pathak, Vyom  and
      Patel, Muktan  and
      Bhatt, Brijesh",
    booktitle = "Proceedings of the 17th International Conference on Natural Language Processing (ICON)",
    month = dec,
    year = "2020",
    address = "Indian Institute of Technology Patna, Patna, India",
    publisher = "NLP Association of India (NLPAI)",
    url = "https://aclanthology.org/2020.icon-main.56",
    pages = "409--419",
    abstract = "We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems{'} to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11{\%} decrease in Word Error Rate (WER) with respect to base-model WER.",
}

Setup

System & Requirements

  • Linux OS
  • Python-3.6
  • TensorFlow-2.2.0
  • CUDA-11.1
  • CUDNN-7.6.5

Setting up repository

git clone https://github.com/01-vyom/End_2_End_Automatic_Speech_Recognition_For_Gujarati.git
python -m venv asr_env
source $PWD/asr_env/bin/activate

Installing Dependencies

Change directory to the root of the repository.

pip install --upgrade pip
pip install -r requirements.txt

Running Code

Change directory to the root of the repository.

Training

To train the model in the paper, run this command:

python ./Train/train.py

Note:

  • If required change the variables PathDataAudios and PathDataTranscripts to point to appropriate path to audio files and path to trascript file, in Train/feature_extractor.py file.
  • If required change the variable currmodel in Train/train.py file to change the model name that is being saved.

Evaluation

Inference

To inference using the model trained, run:

python ./Eval/inference.py

Note:

  • Change the variables PathDataAudios and PathDataTranscripts to point to appropriate path to audio files and path to trascript file for testing.
  • To change the name of the model for inferencing, change the variable model, and to change the name of file for testing, change test_data variable.
  • The output will be a .pickle of references and hypothesis with a model specific name stored in the ./Eval/ folder.

Decoding

To decode the infered output, run:

python ./Eval/decode.py

Note:

  • To select a model specific .pickle change the model variable.
  • The output will be stored in ./Eval/, specific to a model with all types of decoding and actual text.

Post-Processing

For post-processing the decoded output, follow the steps mentioned in this README.

System Analysis

To perform the system analysis, run:

python ./System Analysis/system_analysis.py

Note:

  • To select a model specific decoding .csv file to analyze, change the model variable.

  • To select a specific type of column (hypothesis type) to perform analysis, change the type variable. The output files will be saved in ./System Analysis/, specific to a model and type of decoding.

Results

Our algorithm achieves the following performance:

Technique name WER(%) reduction
Prefix with LMs' 2.42
Prefix with LMs' + Spell Corrector BERT 5.11

Note:

  • These reductions in WER are w.r.t. the Greedy Decoding.

Acknowledgement

The prefix decoding code is based on 1 and 2 open-source implementations. The code for Bert based spell corrector is adapted from this open-source implementation

Licensed under the MIT License.