<a href="https://colab.research.google.com/github/GuyRobot/AINotesBook/blob/main/translation/hf_bart_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune BART for Translation on WMT16 Dataset (and Train new Tokenizer)
The code is the supplementary material to the story published in NLPiation medium. Follow [the link](https://medium.com/@nlpiation/fine-tune-bart-for-translation-on-wmt16-dataset-and-train-new-tokenizer-4d0fbdc4aa2e) for a detailed explanation of creating a new tokenizer and use it in Translation task.

You can easily run the following codes and play around to grasp a firm understanding of the concepts. You can get better results by requesting a GPU and adjusting the fine-tuning hyperparameters.

In [1]:
!pip install -q transformers==4.26.1 datasets==2.10.1 evaluate rouge_score

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/6.3 MB[0m [31m4.1 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/6.3 MB[0m [31m22.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.3/6.3 MB[0m [31m65.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302

# Load Dataset

In [2]:
import datasets

In [3]:
dataset = datasets.load_dataset("mt_eng_vietnamese", "iwslt2015-en-vi", cache_dir="./mt_eng_vietnamese")

Downloading builder script:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.90k [00:00<?, ?B/s]

Downloading and preparing dataset mt_eng_vietnamese/iwslt2015-en-vi to /content/mt_eng_vietnamese/mt_eng_vietnamese/iwslt2015-en-vi/1.0.0/53add551a01e9874588066f89d42925f9fad43db347199dad00f7e4b0c905a71...


Downloading data:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/140k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/188k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/184k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/133318 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1269 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1269 [00:00<?, ? examples/s]

Dataset mt_eng_vietnamese downloaded and prepared to /content/mt_eng_vietnamese/mt_eng_vietnamese/iwslt2015-en-vi/1.0.0/53add551a01e9874588066f89d42925f9fad43db347199dad00f7e4b0c905a71. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
print(dataset['train'][0])

{'translation': {'en': 'Rachel Pike : The science behind a climate headline', 'vi': 'Khoa học đằng sau một tiêu đề về khí hậu'}}


In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 133318
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
})


In [6]:
def flatten(batch):
    batch['en'] = batch['translation']['en']
    batch['vi'] = batch['translation']['vi']

    return batch

In [7]:
train = dataset['train'].map( flatten )

Map:   0%|          | 0/133318 [00:00<?, ? examples/s]

In [8]:
print("en => ", train[0]['en'])
print("vi => ", train[0]['vi'])

en =>  Rachel Pike : The science behind a climate headline
vi =>  Khoa học đằng sau một tiêu đề về khí hậu


In [9]:
test = dataset['test'].map( flatten )
validation = dataset['validation'].map( flatten )

Map:   0%|          | 0/1269 [00:00<?, ? examples/s]

Map:   0%|          | 0/1269 [00:00<?, ? examples/s]

In [10]:
train.save_to_disk("./dataset/train")
test.save_to_disk("./dataset/test")
validation.save_to_disk("./dataset/validation")

Saving the dataset (0/1 shards):   0%|          | 0/133318 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1269 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1269 [00:00<?, ? examples/s]

# Create Tokenizer

In [11]:
from tokenizers import normalizers, pre_tokenizers, Tokenizer, models, trainers

In [12]:
# Build a tokenizer
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.normalizer = normalizers.Lowercase()
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

In [13]:
trainer = trainers.BpeTrainer(
    vocab_size=50265,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)

In [14]:
def batch_iterator():
    batch_length = 1000
    for i in range(0, len(train), batch_length):
        yield train[i : i + batch_length]["vi"]

In [15]:
bpe_tokenizer.train_from_iterator( batch_iterator(), length=len(train), trainer=trainer )

In [16]:
bpe_tokenizer.save("./vi_tokenizer.json")

In [17]:
# To read the tokenizer later:
# from transformers import PreTrainedTokenizerFast
# tmp = PreTrainedTokenizerFast.from_pretrained('./ro_tokenizer.json')

# Fine-Tuning

In [18]:
from transformers import BartForConditionalGeneration, AutoTokenizer, PreTrainedTokenizerFast
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_from_disk

## Load model

In [19]:
model = BartForConditionalGeneration.from_pretrained(  "facebook/bart-base" )

(…)ebook/bart-base/resolve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

## Load Tokenizers

In [20]:
en_tokenizer = AutoTokenizer.from_pretrained( "facebook/bart-base" );
vi_tokenizer = PreTrainedTokenizerFast.from_pretrained( "./vi_tokenizer.json" );

(…)cebook/bart-base/resolve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

(…)cebook/bart-base/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)ok/bart-base/resolve/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [21]:
vi_tokenizer.pad_token = en_tokenizer.pad_token

# Trainer

In [22]:
train = load_from_disk("./dataset/train")
test = load_from_disk("./dataset/test")
validation = load_from_disk("./dataset/validation")

In [23]:
def tokenize_dataset(sample):
    input = en_tokenizer(sample['en'], padding='max_length', max_length=120, truncation=True)
    label = vi_tokenizer(sample['vi'], padding='max_length', max_length=120, truncation=True)

    input["decoder_input_ids"] = label["input_ids"]
    input["decoder_attention_mask"] = label["attention_mask"]
    input["labels"] = label["input_ids"]

    return input

In [24]:
# train = train.select(range(2000))
# test = test.select(range(100))
# validation = validation.select(range(100))

In [25]:
train_tokenized = train.map(tokenize_dataset, batched=True)
test_tokenized = test.map(tokenize_dataset, batched=True)
validation_tokenized = validation.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/133318 [00:00<?, ? examples/s]

Map:   0%|          | 0/1269 [00:00<?, ? examples/s]

Map:   0%|          | 0/1269 [00:00<?, ? examples/s]

In [26]:
import nltk
import numpy as np
import nltk
from evaluate import load
nltk.download('punkt')
metric = load("rouge")
bleu = load("bleu")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = vi_tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, vi_tokenizer.pad_token_id)
    decoded_labels = vi_tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    result["bleu"] = bleu.compute(predictions=decoded_preds, references=decoded_labels)['bleu']
    print("Predictions: ", decoded_preds)
    print("References: ", decoded_labels)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != vi_tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [27]:
# set training arguments - these params are not really tuned, feel free to change
batch_size = 8
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy = "epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    # logging_steps=2,  # set to 1000 for full training
    # save_steps=64,  # set to 500 for full training
    # eval_steps=64,  # set to 8000 for full training
    # warmup_steps=1,  # set to 2000 for full training
    overwrite_output_dir=True,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True, # True if GPU
)

In [28]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(en_tokenizer, model=model)

In [29]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=validation_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=en_tokenizer
)

Using cuda_amp half precision backend


In [None]:
for i in range(10):
  trainer.train()

The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: en, translation, vi. If en, translation, vi are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 133318
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 16665
  Number of trainable parameters = 139420416
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
