S2T rename (#295)

* rename w2l encoder to tdnnencoder Signed-off-by: Jason <jasoli@nvidia.com> * add jasper docs Signed-off-by: Jason <jasoli@nvidia.com>
NVIDIA · Nov 29, 2018 · c6315c7 · c6315c7
1 parent 0b21e2a
commit c6315c7
Show file tree

Hide file tree

Showing 18 changed files with 146 additions and 78 deletions.
diff --git a/docs/sources/source/in-depth-tutorials.rst b/docs/sources/source/in-depth-tutorials.rst
@@ -1,6 +1,6 @@
 .. _in_depth:
 
-In-depth tutorials
+In-depth Tutorials
 ==================
 
 .. toctree::

diff --git a/docs/sources/source/index.rst b/docs/sources/source/index.rst
@@ -24,7 +24,7 @@ OpenSeq2Seq
 OpenSeq2Seq is a TensorFlow-based toolkit for training sequence-to-sequence models:
 
  * :ref:`machine translation <machine_translation>` (GNMT, Transformer, ConvS2S, ...)
- * :ref:`speech recognition <speech_recognition>` (DeepSpeech2, Wave2Letter, ...)
+ * :ref:`speech recognition <speech_recognition>` (DeepSpeech2, Wave2Letter, Jasper, ...)
  * :ref:`speech synthesis <speech_synthesis>` (Tacotron2, ...)
  * :ref:`language model <language_model>` (LSTM, ...)
  * :ref:`sentiment analysis <sentiment_analysis>` (SST, IMDB, ...)

diff --git a/docs/sources/source/installation.rst b/docs/sources/source/installation.rst
@@ -1,6 +1,6 @@
 .. _installation:
 
-Installation instructions
+Installation Instructions
 =========================
 
 Pre-built docker container

diff --git a/docs/sources/source/mixed-precision.rst b/docs/sources/source/mixed-precision.rst
@@ -1,6 +1,6 @@
 .. _mixed_precision:
 
-Mixed precision training
+Mixed Precision Training
 ========================
 
 .. epigraph::

diff --git a/docs/sources/source/speech-recognition.rst b/docs/sources/source/speech-recognition.rst
@@ -28,14 +28,14 @@ Currently we support following models:
      - `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_
      - `link <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_
 
-   * - :doc:`Wave2Letter+-34 </speech-recognition/wave2letter>`
+   * - :doc:`Jasper 10x3 </speech-recognition/jasper>`
      - 5.10
-     - `w2lplus_xlarge_34_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_34_8gpus_mp.py>`_
+     - `jasper_10x3_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper_10x3_8gpus_mp.py>`_
      - `link <https://drive.google.com/a/nvidia.com/file/d/1hI9Rv_px5vqpuWQOCwfKmZzRVXMPiTtT/view?usp=sharing>`_
 
-   * - :doc:`Wave2Letter+-54-syn </speech-recognition/wave2letter>`
+   * - :doc:`Jasper 10x5 syn </speech-recognition/jasper>`
      - 4.32
-     - `w2lplus_xlarge_54_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_54_8gpus_mp.py>`_
+     - `jasper_10x5_8gpus_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper_10x5_8gpus_mp.py>`_
      - `link <https://drive.google.com/a/nvidia.com/file/d/1b9CHczABFG4TRgtZg_jSaRQ-8oCjay76/view?usp=sharing>`_
 
 
@@ -53,6 +53,7 @@ have a look at the `configuration files <https://github.com/NVIDIA/OpenSeq2Seq/b
 
    speech-recognition/deepspeech2
    speech-recognition/wave2letter
+   speech-recognition/jasper
 
 
 ################
@@ -131,7 +132,7 @@ To train with Horovod on <N> GPUs, use the following command::
     mpiexec --allow-run-as-root -np <N> python run.py --config_file=... --mode=train_eval --use_horovod=True
 
 ##############
-Synthetic Data
+Synthetic data
 ##############
 
 Our current best model was trained using synthetic data. The creation of the synthetic data and training process is described :ref:`here <synthetic_data>`.

diff --git a/docs/sources/source/speech-recognition/jasper.png b/docs/sources/source/speech-recognition/jasper.png
diff --git a/docs/sources/source/speech-recognition/jasper.rst b/docs/sources/source/speech-recognition/jasper.rst
@@ -0,0 +1,69 @@
+.. _jasper:
+
+Jasper
+=======
+
+Model
+~~~~~~
+
+Jasper (Just Another Speech Recognizer) is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. Jasper is a family of models where each model has a different number of layers. Jasper models are denoted as Jasper bxr where b and r represent:
+
+- b: the number of blocks
+- r: the number of repetitions of each convolutional layer within a block
+
+.. image:: jasper.png
+
+All models have 4 common layers. There is an initial convolutional layer with stride 2 to decrease the time dimension of the speech. The other 3 layers are at the end of the network. The first layer has a dilation of 2 to increase the model's receptive field. The last two layers are fully connected layers that are used to project the final output to a distribution over characters.
+
+Each 1D-convolutional layer consists of a convolutional operation, batch normalization, clipped relu activation, and dropout. Shown on the left.
+
+There is a residual connection between each block which consists of a projection layer, followed by batch normalization. The residual is then added to the output of the last 1D-convolutional layer in the block before the clipped relu activation and dropout. Shown on the right.
+
+.. image:: jasper_layers.png
+
+We preprocess the speech signal by sampling the raw audio waveform of the signal using a sliding window of 20ms with stride 10ms. We then extract log-mel filterbank energies of size 64 from these frames as input features to the model.
+
+We use Connectionist Temporal Classification (CTC) loss to train the model. The output of the model is a sequence of letters corresponding to the speech input. The vocabulary consists of all alphabets (a-z), space, and the apostrophe symbol, a total of 29 symbols including the blank symbol used by the CTC loss.
+
+Training
+~~~~~~~~
+
+Our current best WER is a 54 layer model trained using synthetic data. We achieved a WER of 4.32% on the librispeech test-clean dataset using greedy decoding:
+
++---------------------+-----------------------------------------------------------------------+
+| Model               | LibriSpeech Dataset                                                   |
++                     +-----------------+-----------------+-----------------+-----------------+
+|                     | Dev-Clean       |       Dev-Other |      Test-Clean |      Test-Other |
++                     +--------+--------+--------+--------+--------+--------+--------+--------+
+|                     | Greedy |  Beam  | Greedy |  Beam  | Greedy |  Beam  | Greedy |  Beam  |
++=====================+========+========+========+========+========+========+========+========+
+| Jasper 10x3         | 5.10   | 4.37   | 15.49  | 13.46  | 5.10   | 5.14   | 16.21  | 14.35  |
++---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
+| Jasper 10x5         | 4.51   | 3.77   | 13.88  | 12.20  | 4.59   | 4.46   | 14.34  | 12.79  |
++---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
+| Jasper 10x5 syn     | 4.32   | 3.74   | 13.74  | 11.57  | 4.32   | 4.39   | 14.08  | 12.21  |
++---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
+
+
+We used Open SLR language model while decoding with beam search using a beam width of 128.
+
+The models were trained for 400 (200 for syn) epochs on 8 GPUs. We use:
+
+* SGD with momentum = 0.9
+* a learning rate with polynomial decay using an initial learning rate of 0.05
+* Layer-wise Adative Rate Control (LARC) with eta = 0.001
+* weight-decay = 0.001
+* dropout (varible per layer: 0.2-0.4)
+
+Synthetic Data
+~~~~~~~~~~~~~~
+All models with "syn" in their name are trained using a combined dataset of Librispeech and synthetic data.
+
+The training details can be found :ref:`here <synthetic_data>`.
+
+Mixed Precision
+~~~~~~~~~~~~~~~
+
+To use mixed precision (float16) during training we made a few minor changes to the model. Tensorflow by default calls Keras Batch Normalization on 3D input (BxTxC) and cuDNN on 4D input (BxHxWxC). In order to use cuDNN's BN we added an extra dimension to the 3D input to make it a 4D tensor (BxTx1xC).
+
+We also use backoff loss scaling.
diff --git a/docs/sources/source/speech-recognition/jasper_layers.png b/docs/sources/source/speech-recognition/jasper_layers.png
diff --git a/docs/sources/source/speech-recognition/wave2letter.rst b/docs/sources/source/speech-recognition/wave2letter.rst
@@ -7,7 +7,7 @@ Wave2Letter+
 Model
 ~~~~~
 
-This is a fully convolutional model, based on Facebook's `Wave2Letter <https://arxiv.org/abs/1609.03193>`_ and `Wave2LetterV2 <https://arxiv.org/abs/1712.09444>`_  papers. The base model (`Wave2Letter+ <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_) consists of 17 1D-Convolutional Layers and 2 Fully Connected Layers for a total of 19 layers:
+This is a fully convolutional model, based on Facebook's `Wave2Letter <https://arxiv.org/abs/1609.03193>`_ and `Wave2LetterV2 <https://arxiv.org/abs/1712.09444>`_  papers. The model consists of 17 1D-Convolutional Layers and 2 Fully Connected Layers:
 
 .. image:: wave2letter.png
 
@@ -26,57 +26,48 @@ In addition to this, we use stride 2 in the first convolutional layer. This decr
 We have also observed a slight improvement after adding a dilation 2 for the last convolutional layer to increase the receptive-field of the model.
 Both striding and dilation improved the WER from 7.17% to 6.67%.
 
-X Large Model
-~~~~~~~~~~~~~~
-The xlarge models, `Wave2Letter+-34 <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_34_8gpus_mp.py>`_ and `Wave2Letter+-54-syn <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_xlarge_54_8gpus_mp.py>`_, are larger models with 34 and 54 layers. The base model contains of 15 convolutional layers which consists of 5 blocks of 3 repeating convolutional layers. For the xlarge model, we double the number of blocks to 10 for the 34 layer model. For the 54 layer model, we further increase the amount of repeating convolutional layers inside each block from 3 to 5.
-
-Synthetic Data
-~~~~~~~~~~~~~~
-All models with "syn" in their name are trained using a combined dataset of Librispeech and synthetic data.
-
-The training details can be found :ref:`here <synthetic_data>`.
-
 Training
 ~~~~~~~~
 
-Our current best WER is a 54 layer model trained using synthetic data. We achieved a WER of 4.32% on the librispeech test-clean dataset using greedy decoding:
-
-+---------------------+-----------------------------------------------------------------------+
-| Model               | LibriSpeech Dataset                                                   |
-+                     +-----------------+-----------------+-----------------+-----------------+
-|                     | Dev-Clean       |       Dev-Other |      Test-Clean |      Test-Other |
-+                     +--------+--------+--------+--------+--------+--------+--------+--------+
-|                     | Greedy |  Beam  | Greedy |  Beam  | Greedy |  Beam  | Greedy |  Beam  |
-+=====================+========+========+========+========+========+========+========+========+
-| W2L+                | 6.67   | 4.77   | 18.68  | 13.88  | 6.58   | 4.92   | 19.61  | 15.01  |
-+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
-| W2L+-34             | 5.10   | -      | 15.49  | -      | 5.10   | -      | 16.21  | -      |
-+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
-| W2L+-54-syn         | 4.32   | -      | 13.74  | -      | 4.32   | -      | 14.08  | -      |
-+---------------------+--------+--------+--------+--------+--------+--------+--------+--------+
-
+We achieved a WER of 6.58 (the WER in the paper is 6.7) on the librispeech test-clean dataset using greedy decoding:
+
+.. list-table::
+   :widths: 1 1 1
+   :header-rows: 1
+
+   * - LibriSpeech Dataset
+     - WER %, Greedy Decoding
+     - WER %, Beam Search: 2048
+   * - dev-clean
+     - 6.67%
+     - 4.77%
+   * - test-clean
+     - 6.58%
+     - 4.92%
+   * - dev-other
+     - 18.68%
+     - 13.88%
+   * - test-other
+     - 19.61%
+     - 15.01%
 
 We used Open SLR language model while decoding with beam search using a beam width of 2048.
 
-The checkpoint for the model trained using the configuration `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_ can be found at `Checkpoint <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_.
+The checkpoint for the model trained using the configuration `w2l_plus_large_mp <https://github.com/NVIDIA/OpenSeq2Seq/blob/18.09/example_configs/speech2text/w2lplus_large_8gpus_mp.py>`_ can be found at `Checkpoint <https://drive.google.com/file/d/10EYe040qVW6cfygSZz6HwGQDylahQNSa/view?usp=sharing>`_.
 
-The base model was trained for 200 epochs on 8 GPUs. We use:
+Our best model was trained for 200 epochs on 8 GPUs. We use:
 
 * SGD with momentum = 0.9
 * a learning rate with polynomial decay using an initial learning rate of 0.05
 * Layer-wise Adative Rate Control (LARC) with eta = 0.001
 * weight-decay = 0.001
 * dropout (varible per layer: 0.2-0.4) 
-* batch size of 32 per GPU for float32 and 64 for mixed-precision.
-
-The xlarge models are trained for 400 epochs on 8 GPUs. All other parameters are kept the same as the base model except:
-
-* we add residual connections between each convolutional block
+* batch size of 32 per GPU for float32 and 64 for mixed-precision. 
 
 
 Mixed Precision
 ~~~~~~~~~~~~~~~
 
 To use mixed precision (float16) during training we made a few minor changes to the model. Tensorflow by default calls Keras Batch Normalization on 3D input (BxTxC) and cuDNN on 4D input (BxHxWxC). In order to use cuDNN's BN we added an extra dimension to the 3D input to make it a 4D tensor (BxTx1xC). 
 
-The mixed precison model reached the same WER for the same number of steps as float32. The training time decreased by ~1.5x on 8-GPU DGX1 system, and by ~3x on 1-GPU and 4-GPUs when using Horovod.
+The mixed precison model reached the same WER for the same number of steps as float32. The training time decreased by ~1.5x on 8-GPU DGX1 system, and by ~3x on 1-GPU and 4-GPUs when using Horovod.
diff --git a/...speech2text/w2lplus_xlarge_34_8gpus_mp.py → ...nfigs/speech2text/jasper_10x3_8gpus_mp.py b/...speech2text/w2lplus_xlarge_34_8gpus_mp.py → ...nfigs/speech2text/jasper_10x3_8gpus_mp.py
@@ -1,7 +1,7 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data.speech2text.speech2text import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
@@ -50,7 +50,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {

diff --git a/...speech2text/w2lplus_xlarge_54_8gpus_mp.py → ...nfigs/speech2text/jasper_10x5_8gpus_mp.py b/...speech2text/w2lplus_xlarge_54_8gpus_mp.py → ...nfigs/speech2text/jasper_10x5_8gpus_mp.py
@@ -1,12 +1,15 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data.speech2text.speech2text import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
 from open_seq2seq.optimizers.lr_policies import poly_decay
 
+### If training with synthetic data, don't forget to add your synthetic csv
+### to dataset files
+
 base_model = Speech2Text
 
 base_params = {
@@ -50,7 +53,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {
@@ -173,21 +176,24 @@
     "loss_params": {},
 }
 
-# train_params = {
-#     "data_layer": Speech2TextDataLayer,
-#     "data_layer_params": {
-#         "num_audio_features": 64,
-#         "input_type": "logfbank",
-#         "vocab_file": "open_seq2seq/test_utils/toy_speech_data/vocab.txt",
-#         "dataset_files": [
-#             "/data/librispeech/librivox-train-clean-100.csv",
-#             "/data/librispeech/librivox-train-clean-360.csv",
-#             "/data/librispeech/librivox-train-other-500.csv",
-#         ],
-#         "max_duration": 16.7,
-#         "shuffle": True,
-#     },
-# }
+train_params = {
+    "data_layer": Speech2TextDataLayer,
+    "data_layer_params": {
+        "num_audio_features": 64,
+        "input_type": "logfbank",
+        "vocab_file": "open_seq2seq/test_utils/toy_speech_data/vocab.txt",
+        "dataset_files": [
+            "/data/librispeech/librivox-train-clean-100.csv",
+            "/data/librispeech/librivox-train-clean-360.csv",
+            "/data/librispeech/librivox-train-other-500.csv",
+            # Add synthetic csv here
+        ],
+        "syn_enable": False, # Change to True if using synthetic data
+        "syn_subdirs": [], # Add subdirs of synthetic data
+        "max_duration": 16.7,
+        "shuffle": True,
+    },
+}
 
 eval_params = {
     "data_layer": Speech2TextDataLayer,

diff --git a/example_configs/speech2text/w2l_large_8gpus.py b/example_configs/speech2text/w2l_large_8gpus.py
@@ -1,7 +1,7 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
@@ -49,7 +49,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {

diff --git a/example_configs/speech2text/w2l_large_8gpus_mp.py b/example_configs/speech2text/w2l_large_8gpus_mp.py
@@ -1,7 +1,7 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
@@ -50,7 +50,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {

diff --git a/example_configs/speech2text/w2lplus_large_8gpus.py b/example_configs/speech2text/w2lplus_large_8gpus.py
@@ -1,7 +1,7 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
@@ -49,7 +49,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {

diff --git a/example_configs/speech2text/w2lplus_large_8gpus_mp.py b/example_configs/speech2text/w2lplus_large_8gpus_mp.py
@@ -1,7 +1,7 @@
 # pylint: skip-file
 import tensorflow as tf
 from open_seq2seq.models import Speech2Text
-from open_seq2seq.encoders import Wave2LetterEncoder
+from open_seq2seq.encoders import TDNNEncoder
 from open_seq2seq.decoders import FullyConnectedCTCDecoder
 from open_seq2seq.data import Speech2TextDataLayer
 from open_seq2seq.losses import CTCLoss
@@ -50,7 +50,7 @@
     "summaries": ['learning_rate', 'variables', 'gradients', 'larc_summaries',
                   'variable_norm', 'gradient_norm', 'global_gradient_norm'],
 
-    "encoder": Wave2LetterEncoder,
+    "encoder": TDNNEncoder,
     "encoder_params": {
         "convnet_layers": [
             {

diff --git a/open_seq2seq/encoders/__init__.py b/open_seq2seq/encoders/__init__.py
@@ -12,7 +12,7 @@
 from .ds2_encoder import DeepSpeech2Encoder
 from .resnet_encoder import ResNetEncoder
 from .tacotron2_encoder import Tacotron2Encoder
-from .w2l_encoder import Wave2LetterEncoder
+from .tdnn_encoder import TDNNEncoder
 from .las_encoder import ListenAttendSpellEncoder
 from .convs2s_encoder import ConvS2SEncoder
 from .lm_encoders import LMEncoder