# FAIRSEQ

from https://fairseq.readthedocs.io/en/latest/getting_started.html

"Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks." It provides reference implementations of various sequence-to-sequence models making our life much more easier!

In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Installation

In [4]:
! pip install --upgrade fairseq
!pip install sacremoses subword_nmt

Requirement already up-to-date: fairseq in /home/aims/anaconda3/envs/aims/lib/python3.7/site-packages (0.9.0)


## Downloading some data and required scripts

In [6]:
! bash data/prepare-wmt14en2fr.sh

Cloning Moses github repository (for tokenization scripts)...
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 147544 (delta 12), reused 15 (delta 5), pack-reused 147514[K
Receiving objects: 100% (147544/147544), 129.75 MiB | 1.91 MiB/s, done.
Resolving deltas: 100% (113998/113998), done.
Checking out files: 100% (3467/3467), done.
Cloning Subword NMT repository (for BPE pre-processing)...
Cloning into 'subword-nmt'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 551 (delta 14), reused 23 (delta 7), pack-reused 509[K
Receiving objects: 100% (551/551), 328.51 KiB | 1001.00 KiB/s, done.
Resolving deltas: 100% (320/320), done.
--2020-03-24 11:33:47--  http://statmt.org/wmt14/training-parallel-nc-v9.tgz
Resolving statmt.org (statmt.org)... 1

## Pretrained Model Evaluation

Let's first see how to evaluate a pretrained model in fairseq. We'll download a pretrained model along with it's vocabulary

In [7]:
! curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1909M    0  146k    0     0  79493      0  6:59:48  0:00:01  6:59:47 79451wmt14.en-fr.fconv-py/
wmt14.en-fr.fconv-py/model.pt
 99 1909M   99 1907M    0     0  5443k      0  0:05:59  0:05:58  0:00:01 5247k  0     0  5934k      0  0:05:29  0:01:30  0:03:59 4321k 0     0  4706k      0  0:06:55  0:02:58  0:03:57 3300k 5089k      0  0:06:24  0:04:38  0:01:46 7822kwmt14.en-fr.fconv-py/dict.en.txt
wmt14.en-fr.fconv-py/dict.fr.txt
100 1909M  100 1909M    0     0  5444k      0  0:05:59  0:05:59 --:--:-- 5091k
wmt14.en-fr.fconv-py/bpecodes
wmt14.en-fr.fconv-py/README.md


We have written a script to do it, but as a fun example, let's do it in Jupyter Notebook for fun

In [8]:
sentence = 'Why is it rare to discover new marine mammal species ?'

In [9]:
%%bash -s "$sentence"
SCRIPTS=data/mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=data/subword-nmt
BPE_TOKENS=40000
src=en
tgt=fr
echo $1 | \
            perl $NORM_PUNC $src | \
            perl $REM_NON_PRINT_CHAR | \
            perl $TOKENIZER -threads 8 -a -l $src > temp_tokenized.out         
prep=wmt14.en-fr.fconv-py
BPE_CODE=$prep/bpecodes
python $BPEROOT/apply_bpe.py -c $BPE_CODE < temp_tokenized.out > final_result.out
rm temp_tokenized.out
cat final_result.out
rm final_result.out

Why is it rare to discover new marine mam@@ mal species ?


Tokenizer Version 1.1
Language: en
Number of threads: 8
  args.codes = codecs.open(args.codes.name, encoding='utf-8')


Let's now look at the very cool interactive feature of fairseq. Open shell, cd to this directory and type the copy the following command:

In [10]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Why is it rare to discover new marine mam@@ mal species ?" | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --beam 1 --source-lang en --target-lang fr

Namespace(beam=1, bpe=None, buffer_size=1, cpu=False, criterion='cross_entropy', data='wmt14.en-fr.fconv-py', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', input='-', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=1, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', p

This generation script produces three types of outputs: a line prefixed with O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker which is omitted from the text. Let's do this in bash again

In [11]:
!  echo "Why is it rare to discover new marine mam@@ mal species ?" | sed -r 's/(@@ )|(@@ ?$)//g' 

Why is it rare to discover new marine mammal species ?


All Good! Now let's train a new model

## Training

### Data Preprocessing

Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German). We will work with a part of WMT 2014 like we did in the previous section

To pre-process and binarize the IWSLT dataset run <code>bash prepare-wmt14en2fr.sh</code> like we did for the previous section. This will download the data, tokenize it, perform byte pair encoding and do a test train split on the data. 

To Binaize the data, we do the following:

Ofcourse, we cannot see what is inside the binary line, but let's check what is in the dictionary

In [18]:
! ls data-bin/wmt14_en_fr/

ls: cannot access 'data-bin/wmt14_en_fr/': No such file or directory


In [19]:
! head -5 data-bin/wmt14_en_fr/dict.fr.txt

head: cannot open 'data-bin/wmt14_en_fr/dict.fr.txt' for reading: No such file or directory


In [12]:
! head -5 data-bin/wmt14_en_fr/dict.fr.txt

de 241097
, 209932
. 163838
la 142626
les 109031


## Model

Fairseq provides a lot of predefined architectures to choose from. For English-French, we will choose an architecure known to work well for the problem. In the next section, we will see how to define custom models in Fairseq

In [20]:
! mkdir -p fairseq_models/checkpoints/fconv_wmt_en_fr

## Generating and Checking BLEU for our model

In [24]:
! pip install sacrebleu



In [25]:
! mkdir -p fairseq_models/logs

In [26]:
%%bash
fairseq-generate data-bin/wmt14_en_fr  \
  --path fairseq_models/checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \
  --beam 1 --batch-size 128 --remove-bpe --sacrebleu --force >> fairseq_models/logs/our_model.out

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval N]
                        [--log-format {json,none,simple,tqdm}]
                        [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16]
                        [--memory-efficient-fp16]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--min-loss-scale D]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--criterion {cross_entropy,sentence_ranking,legacy_masked_lm_loss,nat_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,adaptive_loss,binary_cross_entropy,masked_lm,sentence_prediction,composite_loss}]
                        [--tokenizer {nltk,space,moses}]
           

CalledProcessError: Command 'b'fairseq-generate data-bin/wmt14_en_fr  \\\n  --path fairseq_models/checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \\\n  --beam 1 --batch-size 128 --remove-bpe --sacrebleu --force >> fairseq_models/logs/our_model.out\n'' returned non-zero exit status 2.

In [27]:
! head -10 fairseq_models/logs/our_model.out

Namespace(beam=1, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/wmt14_en_fr', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='fairseq_models/chec

In [19]:
! tail -2 fairseq_models/logs/our_model.out

| Translated 3003 sentences (97117 tokens) in 54.2s (55.45 sentences/s, 1793.20 tokens/s)
| Generate test with beam=1: BLEU = 26.30 54.1/32.2/21.2/13.9 (BP = 0.982 ratio = 0.982 hyp_len = 101496 ref_len = 103343)


### Generating and Checking BLEU for the large Pretrained Model

In [20]:
%%bash
fairseq-generate data-bin/wmt14.en-fr.newstest2014  \
  --path wmt14.en-fr.fconv-py/model.pt \
  --beam 1 --batch-size 128 --remove-bpe --sacrebleu >> fairseq_models/logs/pretrained_model.out



In [21]:
! head -10 fairseq_models/logs/pretrained_model.out

Namespace(beam=1, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/wmt14.en-fr.newstest2014', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='wmt14.

In [22]:
! tail -2 fairseq_models/logs/pretrained_model.out

| Translated 3003 sentences (95125 tokens) in 52.2s (57.52 sentences/s, 1822.08 tokens/s)
| Generate test with beam=1: BLEU = 43.12 69.0/50.5/39.0/30.4 (BP = 0.956 ratio = 0.957 hyp_len = 95480 ref_len = 99747)


## Writing A Custom Model in FAIRSEQ

We will extend fairseq by adding a new FairseqModel that encodes a source sentence with an LSTM and then passes the final hidden state to a second LSTM that decodes the target sentence (without attention).

### Building an Encoder and Decoder

In this section we’ll define a simple LSTM Encoder and Decoder. All Encoders should implement the FairseqEncoder interface and Decoders should implement the FairseqDecoder interface. These interfaces themselves extend torch.nn.Module, so FairseqEncoders and FairseqDecoders can be written and used in the same ways as ordinary PyTorch Modules.

### Encoder

Our Encoder will embed the tokens in the source sentence, feed them to a torch.nn.LSTM and return the final hidden state.

### Decoder

Our Decoder will predict the next word, conditioned on the Encoder’s final hidden state and an embedded representation of the previous target word – which is sometimes called input feeding or teacher forcing. More specifically, we’ll use a torch.nn.LSTM to produce a sequence of hidden states that we’ll project to the size of the output vocabulary to predict each target word

## Registering the Model

Now that we’ve defined our Encoder and Decoder we must register our model with fairseq using the register_model() function decorator. Once the model is registered we’ll be able to use it with the existing Command-line Tools.

All registered models must implement the BaseFairseqModel interface. For sequence-to-sequence models (i.e., any model with a single Encoder and Decoder), we can instead implement the FairseqModel interface.

Create a small wrapper class in the same file and register it in fairseq with the name 'simple_lstm':

Finally let’s define a named architecture with the configuration for our model. This is done with the register_model_architecture() function decorator. Thereafter this named architecture can be used with the --arch command-line argument, e.g., --arch tutorial_simple_lstm

In [42]:
import fairseq
import os

fairseq_file = os.path.dirname(fairseq.__file__)
fairseq_path = os.path.join(fairseq_file, 'models')
print(fairseq_path)

/opt/anaconda3/lib/python3.7/site-packages/fairseq/models


In [48]:
%%bash -s "$fairseq_path"
cp fairseq_models/custom_models/simple_lstm.py $1

In [49]:
%%bash -s "$fairseq_path"
ls $1 | grep lstm

lstm.py
simple_lstm.py


## Training Our Custom Model

This is just to show you how to train a custom model so we'll only train it for 3 epochs.
Note that WMT dataset is large so you should train it for a long time. As we only trained for 3 epochs, the BLEU may be low.

In [46]:
! mkdir -p fairseq_models/checkpoints/tutorial_simple_lstm

%%bash
fairseq-train data-bin/wmt14_en_fr \
  --arch tutorial_simple_lstm \
  --encoder-dropout 0.2 --decoder-dropout 0.2 \
  --optimizer adam --lr 0.005 --lr-shrink 0.5 \
  --max-tokens 12000 \
  --max-epoch 3 --save-dir fairseq_models/checkpoints/tutorial_simple_lstm

In [61]:
!head -10 fairseq_models/logs/custom_model.out

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/wmt14_en_fr', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='fairseq_models/chec

In [62]:
!tail -2 fairseq_models/logs/custom_model.out

| Translated 3003 sentences (100167 tokens) in 29.1s (103.16 sentences/s, 3441.03 tokens/s)
| Generate test with beam=5: BLEU = 4.37 17.9/5.9/3.3/1.1 (BP = 1.000 ratio = 1.191 hyp_len = 123060 ref_len = 103343)
