<a href="https://www.kaggle.com/code/aisuko/translation-nlp?scriptVersionId=164642674" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Translation converts a sequence of text from one language to another. It is one of several tasks we can formulate as a sequence-to-sequence problem, a powerful framework for returning some output from an input, like translaion or summarization. Translation systems are commonly used for translation between different language texts, but is can also be used for speech or some combination in between like text-to-speech or speech-to-text. We are going to fine-tune a pretrained `Translation` model with a `Translation` datasets.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install sacrebleu==2.3.3

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-t5-small-with-opusbook"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading OPUS Books dataset

Here we start by loading the English-French subset of the OPUS Books dataset

In [3]:
from datasets import load_dataset

books=load_dataset("opus_books", "en-fr", split="train[:500]")
books

Downloading readme:   0%|          | 0.00/25.2k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'translation'],
    num_rows: 500
})

In [4]:
books=books.train_test_split(test_size=0.2)
books["train"]

Dataset({
    features: ['id', 'translation'],
    num_rows: 400
})

# Prerpocess

Here the preprocess function needs to:

* Prefix the input with a prompt, so the model knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.

* Tokenize the input and target separately because we can't tokenize target French text with a tokenize pretrained on an English vocabulary.

* Truncate sequences to be no longer than the maximum length set by the `max_lenngth` parameter.

In [5]:
from transformers import AutoTokenizer

model_name="t5-small"
tokenizer=AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [6]:
source_lang="en"
target_lang="fr"
prefix="translate English to French: "

def preprocess_function(examples):
    inputs =[prefix+example[source_lang] for example in examples["translation"]]
    targets=[example[target_lang] for example in examples["translation"]]
    model_inputs=tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs


tokenized_books=books.map(preprocess_function, batched=True)

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

We use DataCollatorForSeq2Seq to `dynamically pad` the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [7]:
from transformers import DataCollatorForSeq2Seq

data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)
print(data_collator)

DataCollatorForSeq2Seq(tokenizer=T5TokenizerFast(name_or_path='t5-small', vocab_size=32100, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id

# Evaluate

Here we use SacreBLEU metric provides hassle-free computation of shreable, comparable, and reproducible BLEU scores.


## BLEU score

The filed of machine translation faces an under-recognized problem because of inconsistency in the preporting of scores from its dominant metric. It refers to the BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared.

Here we create a function that passes our predictions and labels to compute to calculate the SacreBLEU score:

In [8]:
import numpy as np
import evaluate


def postprocess_text(preds, labels):
    preds=[pred.strip() for pred in preds]
    labels=[[label.strip()] for label in labels]
    
    return preds, labels


def compute_metrics(eval_preds):
    preds, labels=eval_preds
    if isinstance(preds, tuple):
        preds=preds[0]
    
    decoded_preds=tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    labels=np.where(labels !=-100, labels, tokenizer.pad_token_id)
    decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds,decoded_labels=postprocess_text(decoded_preds, decoded_labels)
    
    result=metric.compute(predictions=decoded_preds, references=decoded_labels)
    
    result={"bleu":result["score"]}
    
    prediction_lens=[np.count_nonzero(pred!=tokenizer.pad_token_id) for pred in preds]
    result["gen_len"]=np.mean(prediction_lens)
    result={k: round(v,4) for k, v in result.items()}
    return result


metric=evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

# Training

In [9]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model=AutoModelForSeq2SeqLM.from_pretrained(model_name)
print(model.config)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
     

In [10]:
training_args=Seq2SeqTrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)

trainer=Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_055730-n9nr7821[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-t5-small-with-opusbook[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/n9nr7821[0m
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to t

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,1.800537,4.5307,17.53
2,No log,1.783673,4.7189,17.58
3,No log,1.773319,4.6952,17.56
4,No log,1.767541,4.7149,17.56
5,No log,1.765586,4.7012,17.56


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=65, training_loss=2.0582505446213943, metrics={'train_runtime': 69.7218, 'train_samples_per_second': 28.685, 'train_steps_per_second': 0.932, 'total_flos': 52597207597056.0, 'train_loss': 2.0582505446213943, 'epoch': 5.0})

In [11]:
import math

eval_results=trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 5.84


In [12]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

training_args.bin:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/aisuko/ft-t5-small-with-opusbook/tree/main/'

# Inference

In [13]:
from transformers import pipeline

text="translate English to French: Legumes share resources with nitrogen-fixing bacteria."
translator=pipeline("translation",model=os.getenv("WANDB_NAME"))
translator(text)



[{'translation_text': 'Legumes partagent les ressources avec les bactéries fixatrices de azote.'}]

# Inference with PyTorch

In [14]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("WANDB_NAME"))
inputs=tokenizer(text, return_tensors="pt").input_ids

In [15]:
from transformers import AutoModelForSeq2SeqLM

model=AutoModelForSeq2SeqLM.from_pretrained(os.getenv("WANDB_NAME"))
outputs=model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

In [16]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"Ces lègumes partagent leurs ressources avec des bactéries fixatrices d'azote."