# FairSeq NMT Tutorial
**F**acebook **AI R**esearch **Seq**uence-to-Sequence Toolkit written in Python

A Fast, Extensible Toolkit for Sequence Modeling

Reference
- https://fairseq.readthedocs.io/en/latest/command_line_tools.html
- https://rlqof7ogm.toastcdn.net/references/2021_session_20.pdf
- https://github.com/matsunagadaiki151/FairseqTutorial/blob/main/FairseqTranslation.ipynb

## Mount Drive & Files

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import os
import sys
my_path = '/content/notebooks'
os.symlink('/content/drive/MyDrive/AllforOne/package_collection', my_path)
sys.path.insert(0, my_path)

Mounted at /content/drive


In [2]:
!nvidia-smi

Fri Mar 24 03:44:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 1. Dataset
AI Hub 'Korean-English pair corpus'  
https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=126

Same as OpenNMT Tutorial.  
https://github.com/Judy-Choi/NMT_Series/blob/main/Model/OpenNMT/NMT_ko-en.ipynb

You can download all files from :  
https://drive.google.com/drive/folders/1xgNQaaEqJArx3iofoC4JJ8tRmfPSZTTM?usp=share_link


### Split Dataset
- Val(Dev) : 5,000
- Test : 3,000
- Train : 1,493,750

## 2. Subword Tokenization
Same as OpenNMT Tutorial.
- https://github.com/Judy-Choi/NMT_Series/blob/main/Model/OpenNMT/NMT_ko-en.ipynb

SentencePiece
- Unsupervised text tokenizer and detokenize
- Not depend on language
- Not depend on Spacing or not
- Alleviate the open vocabulary problems (OOV)
- Supports **BPE(Byte-Pair-Encoding), Unigram** language model

## 3. Model
### FairSeq
https://github.com/facebookresearch/fairseq

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for :
- translation
- summarization
- language modeling
- other text generation tasks

This toolkit supports
- Distributed training
across multiple GPUs and machines.
- Fast mixed-precision training and inference on modern GPUs
- Pytorch

### Install Fairseq

In [3]:
!git clone https://github.com/pytorch/fairseq

Cloning into 'fairseq'...
remote: Enumerating objects: 34534, done.[K
remote: Total 34534 (delta 0), reused 0 (delta 0), pack-reused 34534[K
Receiving objects: 100% (34534/34534), 24.04 MiB | 28.32 MiB/s, done.
Resolving deltas: 100% (25095/25095), done.


In [15]:
cd /content/fairseq

/content/fairseq


In [16]:
ls

CODE_OF_CONDUCT.md  [0m[01;34mfairseq_cli[0m/       MANIFEST.in       [01;34mscripts[0m/
CONTRIBUTING.md     [01;34mfairseq.egg-info[0m/  pyproject.toml    setup.cfg
[01;34mdocs[0m/               hubconf.py         README.md         setup.py
[01;34mexamples[0m/           [01;34mhydra_plugins[0m/     RELEASE.md        [01;34mtests[0m/
[01;34mfairseq[0m/            LICENSE            release_utils.py  train.py


In [17]:
!pip install --editable ./

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitarray
  Using cached bitarray-2.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269 kB)
Collecting omegaconf<2.1
  Using cached omegaconf-2.0.6-py3-none-any.whl (36 kB)
Collecting hydra-core<1.1,>=1.0.7
  Using cached hydra_core-1.0.7-py3-none-any.whl (123 kB)
Collecting sacrebleu>=1.4.12
  Using cached sacrebleu-2.3.1-py3-none-any.whl (118 kB)
Collecting antlr4-python3-runtime==4.8
  Using cached antlr4-python3-runtime-4.8.tar.gz (112 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting colorama
  Using cached colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collect

### Install Libraries

In [7]:
!pip install pyproject-toml
# For large datasets
!pip install pyarrow
# For tensorboard log
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorboardX
  Downloading tensorboardX-2.6-py2.py3-none-any.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 KB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.6
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyproject-toml
  Downloading pyproject_toml-0.0.10-py3-none-any.whl (6.9 kB)
Installing collected packages: pyproject-toml
Successfully installed pyproject-toml-0.0.10
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Command-line Tools
https://fairseq.readthedocs.io/en/latest/command_line_tools.html

Fairseq provides several command-line tools for training and evaluating models:

- fairseq-preprocess: Data pre-processing: build vocabularies and binarize training data
- fairseq-train: Train a new model on one or multiple GPUs
- fairseq-generate: Translate pre-processed data with a trained model
- fairseq-interactive: Translate raw text with a trained model
- fairseq-score: BLEU scoring of generated translations against reference translations
- fairseq-eval-lm: Language model evaluation

### Preprocess

In [10]:
cd /content/drive/MyDrive/AllforOne/Lecture/Fairseq

/content/drive/MyDrive/AllforOne/Lecture/Fairseq


In [11]:
dir = "/content/drive/MyDrive/AllforOne/Lecture/Fairseq"

In [9]:
!fairseq-preprocess \
    --source-lang ko \
    --target-lang en \
    --validpref $dir/dataset/valid \
    --trainpref $dir/dataset/train \
    --destdir $dir/preprocess

/bin/bash: fairseq-preprocess: command not found


## Train

In [18]:
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### fairseq-train arguments
**Model**  
- --arch [model]   
- --optimizer [optimizer]   
- --max-epoch [force stop training at specified epoch]   
- --batch-size, --max-sentences [number of examples in a batch]   

**Set validation metric (ex: BLEU)**   
- --scoring [scoreing metric]   
- --best-checkpoint-metric [metric to use for saving “best” checkpoints]   

(You should add these arguments too.  
Only '--best-checkpoint-metric' bleu flag doesn't work alone)
- --eval-bleu
- --eval-bleu-args
- --eval-bleu-detok moses
- --eval-bleu-remove-bpe
- --eval-bleu-print-samples

**Save checkpoint**
- --maximize-best-checkpoint-metric
  - select the largest metric value for saving “best” checkpoints
- --no-epoch-checkpoints
  - only store last and best checkpoints
- --continue-once [checkpoint_last.pt]
  - continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
- --save-dir [path to save checkpoints]

In [None]:
# trainを実行。
!fairseq-train /content/drive/MyDrive/AllforOne/Lecture/Fairseq/preprocess \
--source-lang ko \
--target-lang en \
--arch transformer \
--optimizer adam \
--max-epoch 300 \
--max-sentences 100 \
--scoring sacrebleu \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--no-epoch-checkpoints \
--continue-once checkpoint_last.pt \
--save-dir $dir/checkpoints/

2023-03-24 03:58:58 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2023-03-24 03:59:03 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': Fal