Skip to content
Switch branches/tags
Go to file


Failed to load latest commit information.
Latest commit message
Commit time


GitHub python tensorflow PyPI

Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2

TensorFlowASR implements some automatic speech recognition architectures such as DeepSpeech2, Jasper, RNN Transducer, ContextNet, Conformer, etc. These models can be converted to TFLite to reduce memory and computation for deployment 😄

What's New?

  • (02/16/2021) Supported for TPU training
  • (12/27/2020) Supported naive token level timestamp, see demo with flag --timestamp
  • (12/17/2020) Supported ContextNet
  • (12/12/2020) Add support for using masking
  • (11/14/2020) Supported Gradient Accumulation for Training in Larger Batch Size

Table of Contents

😋 Supported Models


  • CTCModel (End2end models using CTC Loss for training, currently supported DeepSpeech2, Jasper)
  • Transducer Models (End2end models using RNNT Loss for training, currently supported Conformer, ContextNet, Streaming Transducer)



Install tensorflow>=2.3.0 or tf-nightly.

For training and testing, you should use git clone for installing necessary packages from other authors (ctc_decoders, rnnt_loss, etc.)

Installing via PyPi

Run pip3 install -U TensorFlowASR

Installing from source

git clone
cd TensorFlowASR
pip3 install .

For anaconda3:

conda create -y -n tfasr tensorflow-gpu python=3.8 # tensorflow if using CPU
conda activate tfasr
pip install -U tensorflow-gpu # upgrade to latest version of tensorflow
git clone
cd TensorFlowASR
pip install .

Running in a container

docker-compose up -d

Setup training and testing

  • For datasets, see datasets

  • For training, testing and using CTC Models, run ./scripts/

  • For training Transducer Models with RNNT Loss from warp-transducer, run export CUDA_HOME=/usr/local/cuda && ./scripts/ (Note: only export CUDA_HOME when you have CUDA)

  • For training Transducer Models with RNNT Loss in TF, make sure that warp-transducer is not installed (by simply run pip3 uninstall warprnnt-tensorflow)

  • For mixed precision training, use flag --mxp when running python scripts from examples

  • For enabling XLA, run TF_XLA_FLAGS=--tf_xla_auto_jit=2 python3 $path_to_py_script)

  • For hiding warnings, run export TF_CPP_MIN_LOG_LEVEL=2 before running any examples

TFLite Convertion

After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.

  1. Install tf-nightly using pip install tf-nightly
  2. Build a model with the same architecture as the trained model (if model has tflite argument, you must set it to True), then load the weights from trained model to the built model
  3. Load TFSpeechFeaturizer and TextFeaturizer to model using function add_featurizers
  4. Convert model's function to tflite as follows:
func = model.make_tflite_function(**options) # options are the arguments of the function
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tflite_model = converter.convert()
  1. Save the converted tflite model as follows:
if not os.path.exists(os.path.dirname(tflite_path)):
with open(tflite_path, "wb") as tflite_out:
  1. Then the .tflite model is ready to be deployed

Features Extraction

See features_extraction


See augmentations

Training & Testing Tutorial

  1. Define config YAML file, see the config.yml files in the example folder for reference (you can copy and modify values such as parameters, paths, etc.. to match your local machine configuration)
  2. Download your corpus (a.k.a datasets) and create a script to generate transcripts.tsv files from your corpus (this is general format used in this project because each dataset has different format). For more detail, see datasets. Note: Make sure your data contain only characters in your language, for example, english has a to z and '. Do not use cache if your dataset size is not fit in the RAM.
  3. [Optional] Generate TFRecords to use for better performance by using the script
  4. Create vocabulary file (characters or subwords/wordpieces) by defining language.characters, using the scripts or There're predefined ones in vocabularies
  5. [Optional] Generate metadata file for your dataset by using script This metadata file contains maximum lengths calculated with your config.yml and total number of elements in each dataset, for static shape training and precalculated steps per epoch.
  6. For training, see train_*.py files in the example folder to see the options
  7. For testing, see test_.*.py files in the example folder to see the options. Note: Testing is currently not supported for TPUs. It will print nothing other than the progress bar in the console, but it will store the predicted transcripts to the file output_name.tsv in the outdir defined in the config yaml file. After testing is done, the metrics (WER and CER) are calculated from output_name.tsv. If you define the same output_name, it will resume the testing from the previous tested batch, which means if the testing is done then it will only calculate the metrics, if you want to run a new test, define a new output_name that the file output.tsv is not exists or only contains the header

Recommendation: For better performance, please use keras builtin training functions as in train_keras_*.py files and/or tfrecords. Keras builtin training uses infinite dataset, which avoids the potential last partial batch.

See examples for some predefined ASR models and results

Corpus Sources and Pretrained Models

For pretrained models, go to drive


Name Source Hours
LibriSpeech LibriSpeech 970h
Common Voice 1932h


Name Source Hours
Vivos 15h
InfoRe Technology 1 InfoRe1 (passwd: BroughtToYouByInfoRe) 25h
InfoRe Technology 2 (used in VLSP2019) InfoRe2 (passwd: BroughtToYouByInfoRe) 415h


Name Source Hours
Common Voice 750h

References & Credits

  1. NVIDIA OpenSeq2Seq Toolkit
  3. Sequence Transduction with Recurrent Neural Network
  4. End-to-End Speech Processing Toolkit in PyTorch


Huy Le Nguyen