# `fairseq` Tutorial for Machine Translation

Install `fairseq` from [fairseq repository](https://github.com/pytorch/fairseq). Make sure you remember what version you are using.

In [1]:
import numpy as np
import pandas as pd

import fairseq
fairseq.__version__

'1.0.0a0+19793a7'

## Load Data

In [2]:
#read sample dataset
with open("../data/OpenSubtitles/OpenSubtitles_sample.th", "r", encoding='utf-8') as f:
    th_lines = f.readlines()
    th_lines = [i[:-1] for i in th_lines]
with open("../data/OpenSubtitles/OpenSubtitles_sample.zh_cn", "r", encoding='utf-8') as f:
    zh_lines = f.readlines()
    zh_lines = [i[:-1] for i in zh_lines]

In [3]:
df = pd.DataFrame({'zh':zh_lines, 'th':th_lines}).drop_duplicates()
df

Unnamed: 0,zh,th
0,记得自己的洗礼仪式 这可能吗?,คุณจำตอนพิธีล้างบาปของคุณได้ เป็นไปได้ยังไง?
1,不可能?,เป็นไปไม่ได้รึไง? แต่มันจริงนะ
2,可那是事实啊 是听大人们说的吧?,คุณได้ยินผู้ใหญ่เขาคุยกันรึเปล่า?
3,我能感受到透过玻璃的阳光,ฉันรู้สึกได้ถึงแสงอาทิตย์ลอดผ่านกระจกเข้ามา
4,我还记得爸爸的心跳声呢,ฉันยังจำเสียงหัวใจเต้นของพ่อได้
...,...,...
95,是你非常爱惜的,มันมีค่าต่อคุณมากนี่
96,偶尔会在恋爱时常常约会的饭店碰面,บางทีคุณก็เจอกันโดยบังเอิญใน ภัตตาคารที่เคยมาด...
97,也会在经常光顾的酒吧碰面,และเจอกันในบาร์ที่เคยไปเมาด้วยกัน ดื่ม
98,干,ดื่ม


## Clean Data

In this tutorial, we will assume the sample data is already cleaned by:
1. Deduplication
2. Filter out by similarity score
3. Filter out by word/character ratio
4. See more cleaning in [thai2nmt/scripts/clean_text.py](https://github.com/vistec-AI/thai2nmt/blob/master/scripts/clean_text.py)

## Split to Train/Validation/Test

In [4]:
np.random.seed(1112)
train_pct = 0.8
valid_pct = 0.9
random_idxs = np.random.choice(df.index.tolist(), size=df.shape[0]).tolist()
train_idxs = random_idxs[:int(df.shape[0]*train_pct)]
valid_idxs = random_idxs[int(df.shape[0]*train_pct):int(df.shape[0]*valid_pct)]
test_idxs = random_idxs[int(df.shape[0]*valid_pct):]
len(train_idxs), len(valid_idxs), len(test_idxs), len(random_idxs)

(79, 10, 10, 99)

In [6]:
train_df = df.iloc[train_idxs,:]
valid_df = df.iloc[valid_idxs,:]
test_df = df.iloc[test_idxs,:]

train_df.to_csv('../data/fairseq_tutorial/cleaned/train.csv',index=False)
valid_df.to_csv('../data/fairseq_tutorial/cleaned/valid.csv',index=False)
test_df.to_csv('../data/fairseq_tutorial/cleaned/test.csv',index=False)

train_df.shape, valid_df.shape, test_df.shape

((79, 2), (10, 2), (10, 2))

## Train `sentencepiece` Tokenizer

Only train on the training set.

In [7]:
#save deduped
train_df.to_csv('../data/fairseq_tutorial/cleaned/train.txt',header=None, index=None, sep='\n')

In [8]:
import sentencepiece as spm

In [11]:
#train
input_fname = '../data/fairseq_tutorial/cleaned/train.txt'
vocab_size = 500
tokenizer_name = 'sample_zhth'

spm_cmd = (f'--input={input_fname} '
           '--character_coverage=1.0 '
           f'--model_prefix={tokenizer_name} '
           f'--vocab_size={vocab_size} ')
spm.SentencePieceTrainer.train(spm_cmd)

In [13]:
#tokenize
sp = spm.SentencePieceProcessor()
sp.load(f'{tokenizer_name}.model')

True

In [14]:
sp.encode_as_pieces('มานั่งตรงนี้สิ')

['▁', 'มา', 'นั่', 'ง', 'ตร', 'ง', 'นี้', 'สิ']

In [15]:
sp.encode_as_pieces('和别人一起来吗?')

['▁', '和', '别', '人', '一', '起', '来', '吗', '?']

## Tokenize with Trained `sentencepiece`

In [16]:
def tokenize_and_save(df, output_fname, src, tgt, tokenizer):
    #tokenize
    df[f'{src}_tokenized'] = df[f'{src}'].map(tokenizer.encode_as_pieces)
    df[f'{tgt}_tokenized'] = df[f'{tgt}'].map(tokenizer.encode_as_pieces)
    
    #save to file
    with open(f'{output_fname}.{src}', 'w', encoding='utf-8') as f:
        for tokens in df[f'{src}_tokenized'].tolist():
            line = ' '.join(tokens)
            f.write(f"{line}\n")
    with open(f'{output_fname}.{tgt}', 'w', encoding='utf-8') as f:
        for tokens in df[f'{tgt}_tokenized'].tolist():
            line = ' '.join(tokens)
            f.write(f"{line}\n")

In [19]:
tokenize_and_save(train_df, '../data/fairseq_tutorial/tokenized/train', 'zh', 'th', sp)
tokenize_and_save(valid_df, '../data/fairseq_tutorial/tokenized/valid', 'zh', 'th', sp)
tokenize_and_save(test_df, '../data/fairseq_tutorial/tokenized/test', 'zh', 'th', sp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


## Binarize with `fairseq-preprocess`

`fairseq` takes input as binarized files created by [fairseq-preprocess](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) command.

In [20]:
!fairseq-preprocess --source-lang zh --target-lang th \
    --trainpref ../data/fairseq_tutorial/tokenized/train \
    --validpref ../data/fairseq_tutorial/tokenized/valid \
    --testpref ../data/fairseq_tutorial/tokenized/test \
    --destdir ../data/fairseq_tutorial/binarized \
    --workers 30 \
    --nwordssrc 500 \
    --nwordstgt 500 --joined-dictionary

2021-05-30 18:10:42 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../data/fairseq_tutorial/binarized', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=500, nwordstgt=500, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, simul_type=None, source_lang='zh', srcdict=None, suppress_crashes=False, ta

## Train Transformer with `fairseq-train`

`fairseq` takes input as binarized files created by [fairseq-train](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) command.

In [24]:
import wandb
wandb.init(project='fairseq-tutorial', entity='cstorm125')

[34m[1mwandb[0m: wandb version 0.10.31 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [28]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train ../data/fairseq_tutorial/binarizedR \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
    --dropout 0.3 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --skip-invalid-size-inputs-valid-test \
    --max-tokens 5000 \
    --save-dir ../data/fairseq_tutorial/models \
    --update-freq 16 \
    --wandb-project $WANDB_PROJECT \
    --fp16 \
    --keep-last-epochs 25 \
    --max-epoch 10 \
    --num-workers 0

2021-05-30 18:27:54 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': 'fairseq-tutorial', 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'mode

2021-05-30 18:27:54 | INFO | fairseq.tasks.translation | [zh] dictionary: 472 types
2021-05-30 18:27:54 | INFO | fairseq.tasks.translation | [th] dictionary: 472 types
2021-05-30 18:27:55 | INFO | fairseq_cli.train | TransformerModel(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(472, 512, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
  

2021-05-30 18:27:58 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
2021-05-30 18:27:58 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-05-30 18:27:58 | INFO | fairseq.utils | rank   0: capabilities =  7.0  ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB                    
2021-05-30 18:27:58 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-05-30 18:27:58 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-05-30 18:27:58 | INFO | fairseq_cli.train | max tokens per device = 5000 and max sentences per device = None
2021-05-30 18:27:58 | INFO | fairseq.trainer | Preparing to load checkpoint ../data/fairseq_tutorial/models/checkpoint_last.pt
2021-05-30 18:27:59 | INFO | fairseq.trainer | Loaded checkpoint ../data/fairseq_tutorial/models/checkpoint_last.pt (epoch 2 @ 1 upda

2021-05-30 18:28:48 | INFO | valid | epoch 006 | valid on 'valid' subset | loss 9.665 | nll_loss 9.665 | ppl 811.79 | wps 2837.3 | wpb 71.5 | bsz 5 | num_updates 6 | best_loss 9.665
2021-05-30 18:28:48 | INFO | fairseq.checkpoint_utils | Preparing to save checkpoint for epoch 6 @ 6 updates
2021-05-30 18:28:48 | INFO | fairseq.trainer | Saving checkpoint to ../data/fairseq_tutorial/models/checkpoint6.pt
2021-05-30 18:28:49 | INFO | fairseq.trainer | Finished saving checkpoint to ../data/fairseq_tutorial/models/checkpoint6.pt
2021-05-30 18:29:06 | INFO | fairseq.checkpoint_utils | Saved checkpoint ../data/fairseq_tutorial/models/checkpoint6.pt (epoch 6 @ 6 updates, score 9.665) (writing took 17.626613359999737 seconds)
2021-05-30 18:29:06 | INFO | fairseq_cli.train | end of epoch 6 (average epoch stats below)
2021-05-30 18:29:06 | INFO | train | epoch 006 | loss 9.627 | nll_loss 9.627 | ppl 790.75 | wps 61.3 | ups 0.06 | wpb 1093 | bsz 79 | num_updates 6 | lr 8.4985e-07 | gnorm 6.399 | l

## Make Predictions with `fairseq-interactive`

In [32]:
!cat ../data/fairseq_tutorial/tokenized/test.zh \
    | fairseq-interactive ../data/fairseq_tutorial/binarized \
      --task translation \
      --source-lang zh --target-lang th \
      --path ../data/fairseq_tutorial/models/checkpoint_best.pt \
      --buffer-size 2500 \
      --fp16 \
      --max-tokens 20000 \
      --beam 4 \
    > ../data/fairseq_tutorial/predictions/zhth_test.out

## Measure Performance with BLEU and chrF

To be added.