Skip to content
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.

Commit

Permalink
S2T rename (#295)
Browse files Browse the repository at this point in the history
* rename w2l encoder to tdnnencoder

Signed-off-by: Jason <jasoli@nvidia.com>

* add jasper docs

Signed-off-by: Jason <jasoli@nvidia.com>
  • Loading branch information
blisc authored and vsl9 committed Nov 29, 2018
1 parent 0b21e2a commit c6315c7
Show file tree
Hide file tree
Showing 18 changed files with 146 additions and 78 deletions.
2 changes: 1 addition & 1 deletion docs/sources/source/in-depth-tutorials.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _in_depth:

In-depth tutorials
In-depth Tutorials
==================

.. toctree::
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ OpenSeq2Seq
OpenSeq2Seq is a TensorFlow-based toolkit for training sequence-to-sequence models:

* :ref:`machine translation <machine_translation>` (GNMT, Transformer, ConvS2S, ...)
* :ref:`speech recognition <speech_recognition>` (DeepSpeech2, Wave2Letter, ...)
* :ref:`speech recognition <speech_recognition>` (DeepSpeech2, Wave2Letter, Jasper, ...)
* :ref:`speech synthesis <speech_synthesis>` (Tacotron2, ...)
* :ref:`language model <language_model>` (LSTM, ...)
* :ref:`sentiment analysis <sentiment_analysis>` (SST, IMDB, ...)
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/source/installation.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _installation:

Installation instructions
Installation Instructions
=========================

Pre-built docker container
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/source/mixed-precision.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _mixed_precision:

Mixed precision training
Mixed Precision Training
========================

.. epigraph::
Expand Down
11 changes: 6 additions & 5 deletions docs/sources/source/speech-recognition.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ Currently we support following models:
- `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_
- `link <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_

* - :doc:`Wave2Letter+-34 </speech-recognition/wave2letter>`
* - :doc:`Jasper 10x3 </speech-recognition/jasper>`
- 5.10
- `w2lplus_xlarge_34_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_34_8gpus_mp.py>`_
- `jasper_10x3_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper_10x3_8gpus_mp.py>`_
- `link <https://drive.google.com/a/nvidia.com/file/d/1hI9Rv_px5vqpuWQOCwfKmZzRVXMPiTtT/view?usp=sharing>`_

* - :doc:`Wave2Letter+-54-syn </speech-recognition/wave2letter>`
* - :doc:`Jasper 10x5 syn </speech-recognition/jasper>`
- 4.32
- `w2lplus_xlarge_54_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_54_8gpus_mp.py>`_
- `jasper_10x5_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper_10x5_8gpus_mp.py>`_
- `link <https://drive.google.com/a/nvidia.com/file/d/1b9CHczABFG4TRgtZg_jSaRQ-8oCjay76/view?usp=sharing>`_


Expand All @@ -53,6 +53,7 @@ have a look at the `configuration files <https://github.com/NVIDIA/OpenSeq2Seq/b

speech-recognition/deepspeech2
speech-recognition/wave2letter
speech-recognition/jasper


################
Expand Down Expand Up @@ -131,7 +132,7 @@ To train with Horovod on <N> GPUs, use the following command::
mpiexec --allow-run-as-root -np <N> python run.py --config_file=... --mode=train_eval --use_horovod=True

##############
Synthetic Data
Synthetic data
##############

Our current best model was trained using synthetic data. The creation of the synthetic data and training process is described :ref:`here <synthetic_data>`.
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
69 changes: 69 additions & 0 deletions docs/sources/source/speech-recognition/jasper.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
.. _jasper:

Jasper
=======

Model
~~~~~~

Jasper (Just Another Speech Recognizer) is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. Jasper is a family of models where each model has a different number of layers. Jasper models are denoted as Jasper bxr where b and r represent:

- b: the number of blocks
- r: the number of repetitions of each convolutional layer within a block

.. image:: jasper.png

All models have 4 common layers. There is an initial convolutional layer with stride 2 to decrease the time dimension of the speech. The other 3 layers are at the end of the network. The first layer has a dilation of 2 to increase the model's receptive field. The last two layers are fully connected layers that are used to project the final output to a distribution over characters.

Each 1D-convolutional layer consists of a convolutional operation, batch normalization, clipped relu activation, and dropout. Shown on the left.

There is a residual connection between each block which consists of a projection layer, followed by batch normalization. The residual is then added to the output of the last 1D-convolutional layer in the block before the clipped relu activation and dropout. Shown on the right.

.. image:: jasper_layers.png

We preprocess the speech signal by sampling the raw audio waveform of the signal using a sliding window of 20ms with stride 10ms. We then extract log-mel filterbank energies of size 64 from these frames as input features to the model.

We use Connectionist Temporal Classification (CTC) loss to train the model. The output of the model is a sequence of letters corresponding to the speech input. The vocabulary consists of all alphabets (a-z), space, and the apostrophe symbol, a total of 29 symbols including the blank symbol used by the CTC loss.

Training
~~~~~~~~

Our current best WER is a 54 layer model trained using synthetic data. We achieved a WER of 4.32% on the librispeech test-clean dataset using greedy decoding:

+---------------------+-----------------------------------------------------------------------+
| Model | LibriSpeech Dataset |
+ +-----------------+-----------------+-----------------+-----------------+
| | Dev-Clean | Dev-Other | Test-Clean | Test-Other |
+ +--------+--------+--------+--------+--------+--------+--------+--------+
| | Greedy | Beam | Greedy | Beam | Greedy | Beam | Greedy | Beam |
+=====================+========+========+========+========+========+========+========+========+
| Jasper 10x3 | 5.10 | 4.37 | 15.49 | 13.46 | 5.10 | 5.14 | 16.21 | 14.35 |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
| Jasper 10x5 | 4.51 | 3.77 | 13.88 | 12.20 | 4.59 | 4.46 | 14.34 | 12.79 |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
| Jasper 10x5 syn | 4.32 | 3.74 | 13.74 | 11.57 | 4.32 | 4.39 | 14.08 | 12.21 |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+


We used Open SLR language model while decoding with beam search using a beam width of 128.

The models were trained for 400 (200 for syn) epochs on 8 GPUs. We use:

* SGD with momentum = 0.9
* a learning rate with polynomial decay using an initial learning rate of 0.05
* Layer-wise Adative Rate Control (LARC) with eta = 0.001
* weight-decay = 0.001
* dropout (varible per layer: 0.2-0.4)

Synthetic Data
~~~~~~~~~~~~~~
All models with "syn" in their name are trained using a combined dataset of Librispeech and synthetic data.

The training details can be found :ref:`here <synthetic_data>`.

Mixed Precision
~~~~~~~~~~~~~~~

To use mixed precision (float16) during training we made a few minor changes to the model. Tensorflow by default calls Keras Batch Normalization on 3D input (BxTxC) and cuDNN on 4D input (BxHxWxC). In order to use cuDNN's BN we added an extra dimension to the 3D input to make it a 4D tensor (BxTx1xC).

We also use backoff loss scaling.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 26 additions & 35 deletions docs/sources/source/speech-recognition/wave2letter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Wave2Letter+
Model
~~~~~

This is a fully convolutional model, based on Facebook's `Wave2Letter <https://arxiv.org/abs/1609.03193>`_ and `Wave2LetterV2 <https://arxiv.org/abs/1712.09444>`_ papers. The base model (`Wave2Letter+ <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_) consists of 17 1D-Convolutional Layers and 2 Fully Connected Layers for a total of 19 layers:
This is a fully convolutional model, based on Facebook's `Wave2Letter <https://arxiv.org/abs/1609.03193>`_ and `Wave2LetterV2 <https://arxiv.org/abs/1712.09444>`_ papers. The model consists of 17 1D-Convolutional Layers and 2 Fully Connected Layers:

.. image:: wave2letter.png

Expand All @@ -26,57 +26,48 @@ In addition to this, we use stride 2 in the first convolutional layer. This decr
We have also observed a slight improvement after adding a dilation 2 for the last convolutional layer to increase the receptive-field of the model.
Both striding and dilation improved the WER from 7.17% to 6.67%.

X Large Model
~~~~~~~~~~~~~~
The xlarge models, `Wave2Letter+-34 <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_34_8gpus_mp.py>`_ and `Wave2Letter+-54-syn <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_54_8gpus_mp.py>`_, are larger models with 34 and 54 layers. The base model contains of 15 convolutional layers which consists of 5 blocks of 3 repeating convolutional layers. For the xlarge model, we double the number of blocks to 10 for the 34 layer model. For the 54 layer model, we further increase the amount of repeating convolutional layers inside each block from 3 to 5.

Synthetic Data
~~~~~~~~~~~~~~
All models with "syn" in their name are trained using a combined dataset of Librispeech and synthetic data.

The training details can be found :ref:`here <synthetic_data>`.

Training
~~~~~~~~

Our current best WER is a 54 layer model trained using synthetic data. We achieved a WER of 4.32% on the librispeech test-clean dataset using greedy decoding:

+---------------------+-----------------------------------------------------------------------+
| Model | LibriSpeech Dataset |
+ +-----------------+-----------------+-----------------+-----------------+
| | Dev-Clean | Dev-Other | Test-Clean | Test-Other |
+ +--------+--------+--------+--------+--------+--------+--------+--------+
| | Greedy | Beam | Greedy | Beam | Greedy | Beam | Greedy | Beam |
+=====================+========+========+========+========+========+========+========+========+
| W2L+ | 6.67 | 4.77 | 18.68 | 13.88 | 6.58 | 4.92 | 19.61 | 15.01 |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
| W2L+-34 | 5.10 | - | 15.49 | - | 5.10 | - | 16.21 | - |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
| W2L+-54-syn | 4.32 | - | 13.74 | - | 4.32 | - | 14.08 | - |
+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+

We achieved a WER of 6.58 (the WER in the paper is 6.7) on the librispeech test-clean dataset using greedy decoding:

.. list-table::
:widths: 1 1 1
:header-rows: 1

* - LibriSpeech Dataset
- WER %, Greedy Decoding
- WER %, Beam Search: 2048
* - dev-clean
- 6.67%
- 4.77%
* - test-clean
- 6.58%
- 4.92%
* - dev-other
- 18.68%
- 13.88%
* - test-other
- 19.61%
- 15.01%

We used Open SLR language model while decoding with beam search using a beam width of 2048.

The checkpoint for the model trained using the configuration `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_ can be found at `Checkpoint <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_.
The checkpoint for the model trained using the configuration `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/18.09/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_ can be found at `Checkpoint <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_.

The base model was trained for 200 epochs on 8 GPUs. We use:
Our best model was trained for 200 epochs on 8 GPUs. We use:

* SGD with momentum = 0.9
* a learning rate with polynomial decay using an initial learning rate of 0.05
* Layer-wise Adative Rate Control (LARC) with eta = 0.001
* weight-decay = 0.001
* dropout (varible per layer: 0.2-0.4)
* batch size of 32 per GPU for float32 and 64 for mixed-precision.

The xlarge models are trained for 400 epochs on 8 GPUs. All other parameters are kept the same as the base model except:

* we add residual connections between each convolutional block
* batch size of 32 per GPU for float32 and 64 for mixed-precision.


Mixed Precision
~~~~~~~~~~~~~~~

To use mixed precision (float16) during training we made a few minor changes to the model. Tensorflow by default calls Keras Batch Normalization on 3D input (BxTxC) and cuDNN on 4D input (BxHxWxC). In order to use cuDNN's BN we added an extra dimension to the 3D input to make it a 4D tensor (BxTx1xC).

The mixed precison model reached the same WER for the same number of steps as float32. The training time decreased by ~1.5x on 8-GPU DGX1 system, and by ~3x on 1-GPU and 4-GPUs when using Horovod.
The mixed precison model reached the same WER for the same number of steps as float32. The training time decreased by ~1.5x on 8-GPU DGX1 system, and by ~3x on 1-GPU and 4-GPUs when using Horovod.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data.speech2text.speech2text import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
Expand Down Expand Up @@ -50,7 +50,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data.speech2text.speech2text import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
from open_seq2seq.optimizers.lr_policies import poly_decay

### If training with synthetic data, don't forget to add your synthetic csv
### to dataset files

base_model = Speech2Text

base_params = {
Expand Down Expand Up @@ -50,7 +53,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down Expand Up @@ -173,21 +176,24 @@
"loss_params": {},
}

# train_params = {
# "data_layer": Speech2TextDataLayer,
# "data_layer_params": {
# "num_audio_features": 64,
# "input_type": "logfbank",
# "vocab_file": "open_seq2seq/test_utils/toy_speech_data/vocab.txt",
# "dataset_files": [
# "/data/librispeech/librivox-train-clean-100.csv",
# "/data/librispeech/librivox-train-clean-360.csv",
# "/data/librispeech/librivox-train-other-500.csv",
# ],
# "max_duration": 16.7,
# "shuffle": True,
# },
# }
train_params = {
"data_layer": Speech2TextDataLayer,
"data_layer_params": {
"num_audio_features": 64,
"input_type": "logfbank",
"vocab_file": "open_seq2seq/test_utils/toy_speech_data/vocab.txt",
"dataset_files": [
"/data/librispeech/librivox-train-clean-100.csv",
"/data/librispeech/librivox-train-clean-360.csv",
"/data/librispeech/librivox-train-other-500.csv",
# Add synthetic csv here
],
"syn_enable": False, # Change to True if using synthetic data
"syn_subdirs": [], # Add subdirs of synthetic data
"max_duration": 16.7,
"shuffle": True,
},
}

eval_params = {
"data_layer": Speech2TextDataLayer,
Expand Down
4 changes: 2 additions & 2 deletions example_configs/speech2text/w2l_large_8gpus.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
Expand Down Expand Up @@ -49,7 +49,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down
4 changes: 2 additions & 2 deletions example_configs/speech2text/w2l_large_8gpus_mp.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
Expand Down Expand Up @@ -50,7 +50,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down
4 changes: 2 additions & 2 deletions example_configs/speech2text/w2lplus_large_8gpus.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
Expand Down Expand Up @@ -49,7 +49,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down
4 changes: 2 additions & 2 deletions example_configs/speech2text/w2lplus_large_8gpus_mp.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: skip-file
import tensorflow as tf
from open_seq2seq.models import Speech2Text
from open_seq2seq.encoders import Wave2LetterEncoder
from open_seq2seq.encoders import TDNNEncoder
from open_seq2seq.decoders import FullyConnectedCTCDecoder
from open_seq2seq.data import Speech2TextDataLayer
from open_seq2seq.losses import CTCLoss
Expand Down Expand Up @@ -50,7 +50,7 @@
"summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
'variable_norm', 'gradient_norm', 'global_gradient_norm'],

"encoder": Wave2LetterEncoder,
"encoder": TDNNEncoder,
"encoder_params": {
"convnet_layers": [
{
Expand Down
2 changes: 1 addition & 1 deletion open_seq2seq/encoders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from .ds2_encoder import DeepSpeech2Encoder
from .resnet_encoder import ResNetEncoder
from .tacotron2_encoder import Tacotron2Encoder
from .w2l_encoder import Wave2LetterEncoder
from .tdnn_encoder import TDNNEncoder
from .las_encoder import ListenAttendSpellEncoder
from .convs2s_encoder import ConvS2SEncoder
from .lm_encoders import LMEncoder
Expand Down

0 comments on commit c6315c7

Please sign in to comment.