# 1. Preparation

## 1.1 Install pip and torch

In [None]:
import lmod
await lmod.purge(force=True)
await lmod.load('compiler/gnu/13.3')

In [None]:
import sys
# sys.executable should point to your virtualenv's Python interpreter.
# sys.path should include your virtualenv's site-packages directory.
print(sys.executable)
print(sys.path)

In [None]:
import os

os.environ["PYTHONPATH"] = "/pfs/data5/home/kit/stud/u____/myEnv/lib/python3.9/site-packages:" + os.environ.get("PYTHONPATH", "")
os.environ["PATH"] = "/pfs/data5/home/kit/stud/u____/myEnv/bin:" + os.environ["PATH"]

!which python
!which pip
!echo $PYTHONPATH

In [None]:
!pip install pip==24.0
!pip show torch | grep Version

## 1.2 Install fairseq

In [None]:
!git clone https://github.com/facebookresearch/fairseq.git
%cd fairseq
!pip install --editable ./ 

Then we need to add a new environment variable so that we can use the fairseq command in the terminal.  

In [None]:
!echo $PYTHONPATH
os.environ['PYTHONPATH'] += ":/pfs/data5/home/kit/stud/u____/fairseq/"
!echo $PYTHONPATH

## 1.3 Install other packages

In [None]:
!pip install sacremoses
!pip install sentencepiece
!pip install sacrebleu

## 1.4 Check GPU

In [None]:
import torch
print(torch.__version__)
if torch.cuda.is_available():
    device = torch.cuda.current_device(); print('Current device: ', torch.cuda.get_device_name(device))
else:
    device = 'cpu'; print('Current device: CPU.')

So far, all packages have been installed.
From now on, just execute the following cells.

# 2. Data Preparation

## 2.1 Download dataset

Here we still use TED2020-dataset as example.

In [None]:
!wget -O sample_data.zip https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
!unzip sample_data.zip -d dataset

## 2.2 Preprocessing

Segment the text into subwords using BPE.

In [None]:
import sentencepiece as spm

# After execution, you can find two bpe files in the directory.
spm.SentencePieceTrainer.train(input="dataset/sample_data/train.de-en.en,dataset/sample_data/train.de-en.de",
                               model_prefix="bpe",
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Then we use the trained segmentation model to preprocess the sentences from train/dev/test sets:

In [None]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# Important: encoding="utf-8"
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"dataset/sample_data/spm.{partition}.de-en.{lang}", "w", encoding="utf-8")

        with open(f"dataset/sample_data/{partition}.de-en.{lang}", "r", encoding="utf-8") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()

Now, we will binarize the data for training with fairseq.  
Feel free to check the [documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) of fairseq commands.

In [None]:
# mBART https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md
# Before use, download model ↑.

# Preprocess/binarize the data
TEXT="/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang de --target-lang en \
    --trainpref $TEXT/spm.train.de-en \
    --validpref $TEXT/spm.dev.de-en \
    --testpref $TEXT/spm.tst.de-en \
    --destdir binarized_data/iwslt14.de-en \
    --joined-dictionary \
    --workers 8

The data preprocessing is completed.

# 3. Training

In [None]:
# Make sure that (0.9, 0.98) in "", error might occurs when use ''.
# Use small learningRate!
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
    binarized_data/iwslt14.de-en \
    --arch mbart_large --share-decoder-input-output-embed \
    --optimizer adam --adam-betas "(0.9, 0.98)" --clip-norm 1.0 \
    --lr 3e-5 --lr-scheduler inverse_sqrt --warmup-updates 1000 \
    --dropout 0.3 --weight-decay 0.01 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 5 \
    --max-tokens 4096 \
    --max-epoch 30 \
    --fp16

# 4. Decoding

Now we can generate translations with the trained model.

In [None]:
!fairseq-generate binarized_data/iwslt14.de-en \
      --task translation \
      --source-lang de \
      --target-lang en \
      --path checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --remove-bpe=sentencepiece > "en-de.decode.log"

We extract the hypotheses and references from the decoding log file.

In [None]:
%%bash
grep ^H "en-de.decode.log" | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
grep ^T "en-de.decode.log" | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt
head ./hyp.txt
echo ""
head ./ref.txt

# Section 5. Evaluation

Here we use BLEU as example.

In [None]:
!echo $PWD
!bash -c "cat hyp.txt | sacrebleu ref.txt"