## OpenSeq2Seq Documentation (changes are welcome!)
Link: https://nvidia.github.io/OpenSeq2Seq/html/index.html

OpenSeq2Seq is a TensorFlow-based toolkit for sequence-to-sequence models:

        - machine translation (GNMT, Transformer, ConvS2S, …)
        - speech recognition (DeepSpeech2, Wave2Letter, Jasper, …)
        - speech commands (RN-50, Jasper)
        - speech synthesis (Tacotron2, WaveNet…)
        - language model (LSTM, …)
        - sentiment analysis (SST, IMDB, …)
        - image classification (ResNet-50)

Main features:

        - modular architecture that allows assembling of new models from available components
        - support for mixed-precision training, that utilizes Tensor Cores in NVIDIA Volta/Turing GPUs
        - fast Horovod-based distributed training supporting both multi-GPU and multi-node modes

### General Installation

In [0]:
# Clone the repo and install the dependences
!git clone https://github.com/NVIDIA/OpenSeq2Seq
%cd OpenSeq2Seq
!pip install -r requirements.txt

In [0]:
# Install Tensorflow GPU
!pip install tensorflow-gpu==1.15.0

In [0]:
# Install the Baidu CTC Decoder
!scripts/install_decoders.sh

In [0]:
# Test the installation (if you get an error here, just ignore it)
!python scripts/ctc_decoders_test.py

In [0]:
# Build a custom native TF op for CTC decoder w/ language model
# Install boost
!apt-get install libboost-all-dev

# Build kenlm
!apt-get install cmake
!./scripts/install_kenlm.sh

In [0]:
# Validate TensorFlow installation
!python -c "import tensorflow as tf; print(tf.__version__)"

In [0]:
# Download a language model for a CTC decoder
!./scripts/download_lm.sh

In [0]:
 # Run speech2text example with enabled CTC beam search decoder and save the output to logs
 !python run.py --config_file=example_configs/speech2text/ds2_toy_config.py --mode=train_eval --enable_logs

### Horovod installation

In [0]:
# For multi-GPU and distributed training we install Horovod
!pip install mpi4py
!pip install horovod

In [0]:
# To that everything is installed correctly
!bash scripts/run_all_tests.sh

In [0]:
# When training with Horovod, use the following commands (update parameters as needed)
!mpiexec --allow-run-as-root -np <num_gpus> python run.py --config_file=... --mode=train_eval --use_horovod=True --enable_logs

In [0]:
# Run inference to dump logits to a pickle file (update parameters as needed)
!python run.py --mode=infer --config="MODEL_CONFIG" --logdir="MODEL_CHECKPOINT_DIR" --num_gpus=1 --use_horovod=False --decoder_params/use_language_model=False --infer_output_file=model_output.pickle

In [0]:
# Run beam search decoder (update parameters as needed)
!python scripts/decode.py --logits=model_output.pickle --labels="CSV_FILE" --lm="LM_BINARY"  --vocab="ALPHABET_FILE" --alpha=ALPHA --beta=BETA --beam_width=BEAM_WIDTH

### Multi-GPU and Distributed Training


In [0]:
# For multi-GPU training with native Distributed Tensorflow approach, 
# you need to set use_horovod: False and num_gpus= in the configuration file. 

# To start training use run.py script (update parameters as needed):
!python run.py --config_file=... --mode=train_eval

In [0]:
# To use Horovod you will need to set use_horovod: True in the config and use mpirun (update parameters as needed):
!mpiexec -np <num_gpus> python run.py --config_file=... --mode=train_eval --use_horovod=True --enable_logs

        You can use Horovod both for multi-GPU and for multi-node training.
        
        Note: num_gpus parameter will be ignored when use_horovod is set to True. In that case, the number of GPUs 
        to use is specified in the command line with mpirun arguments.

### Mixed Precision Training

Enabling mixed precision with existing models in OpenSeq2Seq is simple: change dtype parameter of model_params to “mixed”. You might need to enable loss scaling: either statically, by setting loss_scale parameter inside model_params to the desired number, or you can use dynamic loss scaling by setting automatic_loss_scaling parameter to “Backoff” or “LogMax”:

            base_params = {
            ...
            "dtype": "mixed",
            # enabling static or dynamic loss scaling might improve model convergence

            # "loss_scale": 10.0,
            # "automatic_loss_scaling": "Backoff",
            ...
            }

### Optimizers (LARC and NovoGrad)

The key idea of LARC is to adjust learning rate (LR) for each layer in such way that the magnitude of weight updates would be small compared to weights’ norm.

To use LARC you should add the following lines to model configuration:

        "larc_params": {
        "larc_eta": 0.002,
        }

NovoGrad is a first-order SGD-based algorithm, which computes second moments per layer instead of per weight as in Adam. Compared to Adam, NovoGrad takes less memory, and we find it to be more numerically stable.

To use Novograd you should tun off the standard regularization and add the following lines to model configuration:

        "optimizer": NovoGrad,
        "optimizer_params": {
            "beta1": 0.95,
            "beta2": 0.98,
            "epsilon": 1e-08,
            "weight_decay": 0.001,
        },

### Speech Recognition

Automatic speech recognition (ASR) systems can be built using a number of approaches depending on input data type, intermediate representation, model’s type and output post-processing. OpenSeq2Seq is currently focused on end-to-end CTC-based models (like original DeepSpeech model). These models are called end-to-end because they take speech samples and transcripts without any additional information. CTC allows finding an alignment between audio and text.

Training pipeline consists of the following blocks:

        1) audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
        2) neural acoustic model (which predicts a probability distribution P_t(c) over vocabulary characters c per each time step t given input features per each timestep)
        3) CTC loss function

Inference pipeline is different for block #3:

        decoder (which transforms a probability distribution into actual transcript)

We support different options for these steps. The recommended pipeline is the following (in order to get the best accuracy, the lowest WER):

        1) Mel scale log spectrograms for audio features (using librosa backend)
        2) Jasper as a neural acoustic model
        3) Baidu’s CTC beam search decoder with N-gram language model rescoring

#### Decoders

In order to get words out of a trained model one needs to use a decoder. Decoder converts a probability distribution over characters into text. There are two types of decoders that are usually employed with CTC-based models: greedy decoder and beam search decoder with language model re-scoring.

        A greedy decoder outputs the most probable character at each time step. It is very fast and it can produce transcripts that 
        are very close to the original pronunciation. But it may introduce many small misspelling errors. Due to the nature of 
        WER metric, even one character error makes a whole word incorrect.

        A beam search decoder with language model re-scoring allows checking many possible decodings (beams) at once 
        with assigning a higher score for more probable N-grams according to a given language model.
        The language model helps to correct misspelling errors. The downside is that it is significantly slower than a greedy decoder.

There are two implementations of beam search decoder in OpenSeq2Seq:

        1) native TensorFlow operation (./ctc_decoder_with_lm/). It is rather a deprecated decoder due to its slowness 
           (it works in a single CPU thread only). We keep it for backward compatibility. You have to build it 
           (or use pre-built version in NVIDIA TensorFlow container). In order to enable it, you’ll need to define its parameters 
           "beam_width", "alpha", "beta", "decoder_library_path", "lm_path", "trie_path", "alphabet_config_path" and 
           add "use_language_model": True line in "decoder_params" section of the config file.

        2) Baidu decoder (as a separate Python script). It is parallelized across batch on multiple CPU cores, so it is significantly faster. 
           It doesn’t require a separate trie file as an input. It is the recommended decoder for ASR models. 
           In order to use it, please:
               - make sure that "decoder_params" section has 'infer_logits_to_pickle': True line and 
                 that "dataset_files" field of "infer_params" section contains a target CSV file
               - run inference (to dump logits to a pickle file)
               - run beam search decoder (with specific ALPHA, BETA and BEAM_WIDTH hyperparameters)

Now let’s consider a relatively lightweight version of DeepSpeech2 based model for English speech recognition on LibriSpeech dataset. Download and preprocess LibriSpeech dataset:

In [0]:
# First, make the LibriSpeech directory
!mkdir -p data
!mkdir data/librispeech

In [0]:
# Download the dataset (this will take a lot of time)
!apt-get -y install sox libsox-dev
!pip install sox

!python scripts/import_librivox.py data/librispeech 

In [0]:
# Everything should be setup to train the model
!python run.py --config_file=example_configs/speech2text/ds2_small_1gpu.py --mode=train_eval

In [0]:
# Build your own language model
!export LS_DIR=/data/speech/LibriSpeech/
!python scripts/build_lm.py --n 5 $LS_DIR/librivox-train-clean-100.csv $LS_DIR/librivox-train-clean-360.csv librivox-train-other-500.csv

### Speech Synthesis

The current Tacotron 2 implementation supports the LJSpeech dataset and the MAILABS dataset. 
    
        For more details about the model including hyperparameters and tips, see Tacotron-2. 
        The current WaveNet implementation only supports LJSpeech.

        First, you need to download and extract the dataset into a directory of your choice. The extracted file should consist 
        of a metadata.csv file and a directory of wav files. metadata.csv lists all the wav filename and their corresponding transcripts 
        delimited by the ‘|’ character.

In [0]:
# To start training Tacotron
# If your GPU does not have enough memory, reduce the batch_size_per_gpu parameter.
!python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=train

In [0]:
# To start training WaveNet
# If your GPU does not have enough memory, reduce the batch_size_per_gpu parameter.
!python run.py --config_file=example_configs/text2speech/wavenet_float.py --mode=train

Once training is done (this can take a while on a single GPU), you can run inference. 
To do some, first create a csv file named test.csv in the same location as train.csv with lines 
in the following format:

        UNUSED | UNUSED | This is an example sentence that I want to generate.

You can put as many lines inside the csv as you want. 
The model will produce one audio sample per line and save the audio sample inside your log_dir. 
Lastly, run:

In [0]:
!python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=infer --infer_output_file=unused

For WaveNet, only interactive infer is supported. First, replace the contents of the first box of with tacotron_save_spec.py. This will save the spectrogram generated 
by Tacotron as a numpy array in spec.npy. 

Next, replace the contents of the first box with wavenet_naive_infer.py 
and re-run the notebook. The generated audio will be saved to result/sample_step0_infer.wav every 1000 steps. 
Note that this will take some time.

        This model extends Tacotron 2 with Global Style Tokens (see also paper). 
        We differ from the published paper in that we use Tacotron 2 from OpenSeq2Seq as opposed to Tacotron.

Training Instructions:

        1) Extract the dataset to a directory
        2) Change data_root inside tacotron_gst_combine_csv.py to point to where the dataset was extracted.
        3) Run tacotron_gst_combine_csv.py inside the scripts directory. 
           The script will merge all the metadata csv files into one large train csv file.
        4) Change line 15 of tacotron_gst.py such dataset_location points to where the dataset was extracted
        5) Train the model by running:

In [0]:
!python run.py --config_file=example_configs/text2speech/tacotron_gst.py --mode=train

Inference is similar to Tacotron infer, except tacotron-gst additionally requires a style wav inside the infer csv. train.csv should contains lines with lines in the following format: 

        path/to/style.wav | UNUSED | This is an example sentence that I want to generate. 

### Machine Translation

Next let’s build a small English-German translation model. This model should train in a reasonable time on a single GPU.

In [0]:
# Download (this will take some time)
!scripts/get_en_de.sh

        This script will download English-German training data from WMT, clean it, and tokenize using Google’s Sentencepiece library. 
        By default, the vocabulary size we use is 32,768 for both English and German.

        To train a small English-German model, change data_root inside en-de-nmt-small.py to the WMT data location and adjust 
        num_gpus to train on more than one GPU (if available).    

In [0]:
# Start training
!python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=train_eval

In [0]:
# Once training is done (this can take a while on a single GPU), you can run inference:
!python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=infer --infer_output_file=raw.txt --num_gpus=1

        Note that the model output is tokenized. In our case it will output BPE segments instead of words.

In [0]:
# The next step is to detokenize
!python tokenizer_wrapper.py --mode=detokenize --model_prefix=.../Data/wmt16_de_en/m_common --decoded_output=result.txt --text_input=raw.txt

In [0]:
# We measure BLEU scores using SacreBLEU package: (A Call for Clarity in Reporting BLEU Scores) 
# Run SacreBleu on detokenized data:
!cat result.txt | sacrebleu -t wmt14 -l en-de > result.txt.BLEU

All models have been trained with specific version of tokenizer. So first step would be copy m_common.model and m_common.vocab to current folder.

To translate your English text source_txt to German you should:

In [0]:
# 1. tokenize source.txt into source.tok:
!python tokenizer_wrapper.py --mode=encode --model_prefix=m_common  --text_input=source.txt --tokenized_output=source.tok --vocab_size=32768

In [0]:
# 2. modify model config.py

        base_params = {
        "use_horovod": False,
        "num_gpus": 1,
        ...
        "logdir": "checkpoint/model",
        }
        ...
        infer_params = {
        "batch_size_per_gpu": 256,
        "data_layer": ParallelTextDataLayer,
        "data_layer_params": {
            "src_vocab_file": "m_common.vocab",
            "tgt_vocab_file": "m_common.vocab",
            "source_file": "source.tok",
            "target_file": "source.tok", # this line will be ignored
            "delimiter":   " ",
            "shuffle":     False,
            "repeat":      False,
            "max_length":  1024,
        },
        }
        ...

In [0]:
# 3. translate source.tok into output.tok:
!python run.py --config_file=config.py --mode=infer --logdir=checkpoint/model  --infer_output_file=output.tok --num_gpus=1

In [0]:
# 4. detokenize output.tok:
!python tokenizer_wrapper.py --mode=detokenize --model_prefix=m_common --text_input=output.tok --decoded_output=output.txt

Transformer model is based solely on attention mechanisms, without any recurrent or convolutional layers. 

Common source and target vocabulary is used to share input/output embedding Tokenization of input and output is done with SentencePiece (SentencePiece). 

It is very good for neural machine translation tasks and base configuration achieves SacreBLEU of 26.4 on WMT 2014 English-to-German translation task ( checkpoint ) while big model gets around 27.5 ( checkpoint ).

In [0]:
# This model is based on Google Transformer which was introduced in Attention is all you need by A. Vaswani, etal.

# Here is an example command of how to train such model on a 4-GPU machine:
!mpirun --allow-run-as-root --mca orte_base_help_aggregate 0 -mca btl ^openib -np 4 -H localhost:4 -bind-to none --map-by slot -x LD_LIBRARY_PATH python run.py --config_file=example_configs/text2text/en-de/transformer-bp-fp32.py --mode=train

In [0]:
# Then run inference like this
!python run.py --config_file=example_configs/text2text/en-de/transformer-bp-fp32.py --mode=infer --infer_output_file=raw_fp32.txt --num_gpus=1 --use_horovod=False

In [0]:
# De-tokenize output
!python tokenizer_wrapper.py --mode=detokenize --model_prefix=wmt16_de_en/m_common --decoded_output=fp32.txt --text_input=raw_fp32.txt

In [0]:
# And compute BLEU score
!cat fp32.txt | sacrebleu -t wmt14 -l en-de > fp32.BLEU

        You should get around 26.4 after 300K iterations for the base model.

### Language Model

The WkiText-103 dataset, developed by Salesforce, contains over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. It has 267,340 unique tokens that appear at least 3 times in the dataset. Since it has full-length Wikipedia articles, the dataset is well-suited for tasks that can benefit of long term dependencies, such as language modeling.

You can download the datasets here <https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/>, extract them to the location of your choice. The dataset should contain of 3 files for train, validation, and test. Don’t forget to update the data_root parameter in your config file to point to the location of your dataset.

Next let’s create a simple LSTM language model by defining a config file for it or using one of the config files defined in example_configs/lstmlm.

        1) change data_root to point to the directory containing the raw dataset used to train your 
           language model, for example, your WikiText dataset downloaded above.
        2) change processed_data_folder to point to the location where you want to store the processed dataset.
           If the dataset has been pre-procesed before, the data layer can just load the data from this location.
        3) update other hyper parameters such as number of layers, number of hidden units, cell type, 
           loss function, learning rate, optimizer, etc. to meet your needs.
        4) choose dtype to be "mixed" if you want to use mixed-precision training, 
           or tf.float32 to train only in FP32.

In [0]:
# For example, your config file is lstm-wkt103-mixed.py. To train without Horovod, 
# update use_horovod to False in the config file and run:
!python run.py --config_file=example_configs/lstmlm/lstm-wkt103-mixed.py --mode=train_eval --enable_logs

In [0]:
# When training with Horovod, use the following command:
!mpiexec --allow-run-as-root -np <num_gpus> python run.py --config_file=example_configs/lstmlm/lstm-wkt103-mixed.py --mode=train_eval --use_horovod=True --enable_logs

Some things to keep in mind:

        1) Don’t forget to update num_gpus to the number of GPUs you want to use.
        2) If the vocabulary is large (the word-level vocabulary for WikiText-103 is 267,000+), you might want to use 
           BasicSampledSequenceLoss, which uses sampled softnax, instead of BasicSequenceLoss, which uses full softmax.
        3) If your GPUs still run out of memory, reduce the batch_size_per_gpu

In [0]:
# Even if your training is done using sampled softmax, evaluation and text generation will always
# be done using full softmax. Running in the mode eval will evaluate your model on the evaluation set:
!python run.py --config_file=example_configs/lstmlm/lstm-wkt103-mixed.py --mode=eval --enable_logs

In [0]:
# Running in the mode infer will generate text from the seed tokens, defined in the config file under the parameter name seed_tokens, each seed token should be separated by space. 
# [TODO: make seed_tokens take a list of strings instead]:
!python run.py --config_file=example_configs/lstmlm/lstm-wkt103-mixed.py --mode=infer --enable_logs

### Sentiment Analysis

The model we use for sentiment analysis is the same one we use for the LSTM language model, except that the last output dimension is the number of sentiment classes instead of the vocabulary size. This sameness allows the sentiment analysis model to use the model pretrained on the language model for this task. You can choose to train the sentiment analysis task from scratch, or from the pretrained language model.

In this model, each source sentence is run through the LSTM cells. The last hidden state at the end of the sequence is then passed into the output projection layer before softmax is performed to get the predicted sentiment. If the parameter use_cell_state is set to True, the last cell state at the end of the sequence is concatenated to the last hidden state.



        The IMDB Dataset contains 50,000 labeled samples of much longer length. The median length is 205 tokens. 
        Half of them are deemed positive and the other half negative. The train set, which contains of 25,000 samples, is 
        separated into a train set of 24, 000 samples and a validation set of 1,000 samples. The dalay layer used to process 
        this dataset is called SSTDataLayer. The dataset can be downloaded here <http://ai.stanford.edu/~amaas/data/sentiment/>.

Next let’s create a simple LSTM language model by defining a config file for it or using one of the config files defined in example_configs/transfer.

        - if you want to use a pretrained language model specify the location of the pretrained language model 
          using the parameter load_model.

        - change data_root to point to the directory containing the raw dataset used to train your language model, for example, 
          the IMDB dataset downloaded above.

        - change processed_data_folder to point to the location where you want to store the processed dataset. 
          If the dataset has been pre-procesed before, the data layer can just load the data from this location.

        - update other hyper parameters such as number of layers, number of hidden units, cell type, loss function, learning rate, 
          optimizer, etc. to meet your needs.

        - choose dtype to be "mixed" if you want to use mixed-precision training, or tf.float32 to train only in FP32.

In [0]:
# For example, your config file is lstm-wkt103-mixed.py. 
# To train without Horovod, update use_horovod to False in the config file and run:
!python run.py --config_file=example_configs/transfer/imdb-wkt2.py --mode=train_eval --enable_logs

In [0]:
# When training with Horovod, use the following command:
!mpiexec --allow-run-as-root -np <num_gpus> python run.py --config_file=example_configs/transfer/imdb-wkt2.py --mode=train_eval --enable_logs

Some things to keep in mind:

    - Don’t forget to update num_gpus to the number of GPUs you want to use.
    - If your GPUs run out of memory, reduce the batch_size_per_gpu parameter.

In [0]:
# Running in the mode eval will evaluate your model on the evaluation set:
!python run.py --config_file=example_configs/transfer/imdb-wkt2.py --mode=eval --enable_logs

In [0]:
# Running in the mode infer will evaluate your model on the test set:
!python run.py --config_file=example_configs/transfer/imdb-wkt2.py --mode=test --enable_logs

        The performance of the model is reported on accuracy and F1 scores.

### Image Classification

Our ResNet-50 v2 model is a mixed precison replica of TensorFlow ResNet-50 , which corresponds to the model defined in the paper Identity Mappings in Deep Residual Networks by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Jul 2016.

This model was trained with different optimizers to state-of-the art accuracy for ResNet-50 model. Our best model reached top-1=77.63%, top-5=93.73 accuracy for Imagenet classification task.

        You will need to download the ImageNet dataset and convert it to TFRecord format as described in
        `TensorFlow ResNet <https://github.com/tensorflow/models/tree/master/official/resnet`_

In [0]:
# Let’s train a model using SGD with momentum. To train model with 1 GPU with float precision:
!python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval

In [0]:
# If your GPU does not have enough memory, you can reduce the batch_size_per_gpu:
!python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --batch_size_per_gpu=32

In [0]:
# If you have 2 GPUs, then you can use “native” Tensorflow multi-GPU training by setting num_gpus:
!python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=False --num_gpus=2

In [0]:
# or you can use Horovod (-np flag defines number of GPUs):
!mpirun --allow-run-as-root --mca orte_base_help_aggregate 0 -mca btl ^openib -np 2 -H localhost:8 -bind-to none --map-by slot -x LD_LIBRARY_PATH python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=True

In [0]:
# If you have Volta or Turing GPU which supports float16, you can speed-up training by using mixed precision:
!python run.py --config_file=example_configs/image2label/resnet-50-v2-mp.py --mode=train_eval --use_horovod=False --num_gpus=2