This is a tensorflow implementation of end-to-end ASR. Though there are several fantastic github repos in tensorflow, I tried to implement it without using
tf.contrib.seq2seq API in this repo. In addition, the performance on LibriSpeech dev/test sets is evaluated and the evaluation results are reasonable.
- Char/Subword text encoding.
- MFCC/fbank acoustic features with per utterance CMVN.
- LAS training (visualized with tensorboard: loss, sample text outputs, features, alignments).
- TFrecord dataset pipline.
- Batch testing using greedy decoder.
- Beam search decoder.
Note that this project is still in progress.
- Currently, I only test this model on MFCC 39 (13+delta+accelerate) features.
- CTC related parts are not yet fully tested.
- Volume augmentation is currently commented out because it shows little improvements.
- CNN-based Listener.
- Location-aware attention.
- Augmentation include speed perturbation.
- Label smoothing. (IMPORANT)
- Scheduled learning rate.
- Scheduled sampling.
- Generally, LibriSpeech-100 is not large enough to train a well-perform LAS.
- In my experience, adding more data is the best policy.
- A better way to check if your model is learning in a right way is to monitor the speech-text aligments in tensorboard or set verbosity=1.
pip3 install virtualenv virtualenv --python=python3 venv source venv/bin/activate pip3 install -r requirements.txt
The definitions of the args are described in las/arguments.py. You can modify all args there before preprocessing, training, testing and decoding.
0) Prepare data
- Libirspeech train/dev/test data
1) Preprocess Audios & Texts. / Training pipeline.
I include dataset pipeline and training pipeline in run.sh.
Train RNNLM: (NOT READY)
Testing with gready decoder.
python3 test.py --split SPLIT \ # test or dev --unit UNIT \ --feat_dim FEAT_DIM \ --feat_dir FEAT_DIR \ --save_dir SAVE_DIR
Beam search decoder.
python3 decode.py --split SPLIT \ # test or dev --unit UNIT \ --beam_size BEAM_SIZE \ --convert_rate 0.24 \ # 0.24 is large enough. --apply_lm APPLY_LM \ --lm_weight LM_WEIGHT \ --feat_dim FEAT_DIM \ --feat_dir FEAT_DIR \ --save_dir SAVE_DIR
tensorboard --logdir ./summary
Results trained on LibriSpeech-360 (WER)
- Listen, Attend and Spell
- Attention-Based Models for Speech Recognition
- On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
- Char RNNLM: TensorFlow-Char-RNN
- A nice Pytorch version: End-to-end-ASR-Pytorch.
- Add scheduled sampling.
- Add location-aware attention.
- Evaluate the performance with subword unit: Subword las training.
- Decoding with subword-based RNNLM.
- Evaluate the performance on joint CTC training, decoding.