# Training En-Hi NMT model

We've pre-processed the dataset and can now proceed to train the NMT model.

## Defining paths

Let's define the path to the pre-processed dataset and the directories where the trained models and intermediate artifacts will get saved.

In [None]:
# Dataset root
dataset_root = '../dataset/en-hi'
%env dataset_root = {dataset_root}

# English and Hindi training-validation set
%env english_set_tokenized_train = {dataset_root}/final_train_norm_lf_1000_tk_moses.en
%env english_set_tokenized_val = {dataset_root}/final_val_norm_lf_1000_tk_moses.en
%env hindi_set_tokenized_train = {dataset_root}/final_train_norm_lf_1000_tk_indicnlp.hi
%env hindi_set_tokenized_val = {dataset_root}/final_val_norm_lf_1000_tk_indicnlp.hi

# Output durectory
%env outdir = /preproc/
%env results = results

## Training shared BPE tokenizer and creating tarred dataset

Traditional tokenization methods are particularly problematic when dealing with misspellings and rare words. There have been many suggestions to tackle this problem, and one such strategy is sub-word tokenization.

Sub-word tokenization involves breaking down words into subword units that allow the model to make intelligent decisions on words it doesn’t recognize. These subword units can be a character or a string of characters. Byte Pair Encoding (BPE) is a very common subword tokenization technique that relies on counting the most common strings of bytes from data, and then replacing those strings with signifiers from the learned vocabulary. Some other examples of sub-word tokenization techniques are WordPiece and SentencePiece.

To facilitate sub-word tokenization, we will train a shared sub-word BPE tokenizer to tokenize both the English and Hindi sentences with a vocabulary size of 32k. NVIDIA NeMo currently supports the [YouTokenToMe](https://github.com/VKCOM/YouTokenToMe) BPE tokenizer.

In addition to training the sub-word BPE, we will also create a parallel tarred version of the training dataset. This would help us increase the training throughput by a huge margin.


Feel free to change the following parameters as per your system confirguration:
1. --tokens_in_batch `12500`
2. --n_preproc_jobs `2`

In [None]:
!python create_tarred_parallel_dataset.py --shared_tokenizer --src_fname $english_set_tokenized_train --tgt_fname $hindi_set_tokenized_train --out_dir $outdir --encoder_tokenizer_vocab_size 32000 --decoder_tokenizer_vocab_size 32000 --max_seq_length 512 --tokens_in_batch 12500 --n_preproc_jobs 2 

In [None]:
# Moving the preproc folder to current directory
!mv $outdir ./
%env outdir = preproc

## Training NMT model

Now that all the prerequisite steps for training the NMT model are complete, we train the model using the Transformer architecture for English to Hindi MT. The encoder and decoder have 6 layers with 8 attention heads per layer with a word embedding size of 512. 

All the rest of the hyperparameters are a part of the AAYN-base architecture configuration which is passed via a yaml file to the `cn` parameter. The default configuration file for the model can be found [here](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/machine_translation/conf/aayn_base.yaml). Feel free to experiment with the beam size and length penalty parameters.

Let's train the model with the following command.

This command will save a `.nemo` file to `exp_manager.exp_dir` that contains the model’s architecture and trained weights. By default, the model is saved as `AAYNBase.nemo`. The argument `trainer.devices` specifies the number of GPUs to use for training. Since we are setting `model.train_ds.use_tarred_dataset=true`, this script will use the parallel tarred dataset that we just created in order to scale the dataloaders during training. We have also specified the path to the trained sub-word BPE tokenizer in `model.encoder_tokenizer.tokenizer_model` and `model.decoder_tokenizer.tokenizer_model`.


In [None]:
HYDRA_FULL_ERROR=1 python enc_dec_nmt.py \
  -cn aayn_base \
  model.preproc_out_dir=$outdir \
  model.train_ds.use_tarred_dataset=true \
  model.train_ds.metadata_file=$outdir/metadata.tokens.12500.json \
  model.train_ds.tokens_in_batch=12500 \
  model.validation_ds.tokens_in_batch=8192 \
  model.validation_ds.src_file_name=$english_set_tokenized_val \
  model.validation_ds.tgt_file_name=$hindi_set_tokenized_val \
  model.encoder_tokenizer.vocab_size=32000 \
  model.decoder_tokenizer.vocab_size=32000 \
  ~model.test_ds \
  model.max_generation_delta=5 \
  model.shared_tokenizer=true \
  model.encoder_tokenizer.tokenizer_model=$outdir/shared_tokenizer.32000.BPE.model \
  model.decoder_tokenizer.tokenizer_model=$outdir/shared_tokenizer.32000.BPE.model \
  trainer.devices=[0,1,2,3] \
  ~trainer.max_epochs \
  +trainer.max_steps=150000 \
  +exp_manager.exp_dir=$results \
  +exp_manager.create_checkpoint_callback=True \
  +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \
  +exp_manager.checkpoint_callback_params.mode=max \
  +exp_manager.checkpoint_callback_params.save_top_k=5

The trained model will get saved at `$results/AAYNBase/<data-time_of_training>/checkpoints/AAYNBase.nemo`

We'll also copy this model to another folder `model` for ease-of-reference during deployment.

In [None]:
!mkdir -p ../model
!cp $results/AAYNBase/<data-time_of_training>/checkpoints/AAYNBase.nemo ../model/