📝 Translation Model Fine-Tuning with Hugging Face
This notebook demonstrates how to fine-tune a transformer-based sequence-to-sequence model for English-to-French translation using the Hugging Face ecosystem.
We use the opus_books dataset and train a model to generate French translations from English inputs. The process includes preprocessing, training with 🤗 Accelerate, and evaluation using the SacreBLEU metric.

✅ Key Details
Dataset: opus_books (en-fr)

Model: Pretrained Transformer (e.g., BART, T5, etc.)

Training Tool: Hugging Face Accelerate

Metric: SacreBLEU (standard for translation quality)



In [3]:
# pip install -U datasets huggingface_hub fsspec

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("opus_books", "en-fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

{'id': '0', 'translation': {'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'}}

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 127085
    })
})

#### Split the dataset for training and testing

In [5]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 114376
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 12709
    })
})

In [6]:
split_datasets["train"][1]["translation"]

{'en': 'I thought Diana very provoking, and felt uncomfortably confused; and while I was thus thinking and feeling, St. John bent his head; his Greek face was brought to a level with mine, his eyes questioned my eyes piercingly--he kissed me.',
 'fr': 'Je trouvai Diana un peu hardie, et je me sentais confuse. Cependant Saint-John pencha sa tête, et sa belle figure grecque se trouva à la hauteur de la mienne; ses yeux perçants interrogeaient les miens.'}

### Import Tokenizer

In [7]:
from transformers import AutoTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_name, return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



In [10]:
# !pip install sacremoses

**Must Need** to ensure that the tokenizer processes the targets in the output language (here, French)

In [11]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

inputs = tokenizer(en_sentence, text_target=fr_sentence)
inputs

{'input_ids': [47, 2479, 44904, 420, 1057, 4994, 6608, 2, 10, 5283, 34, 673, 8608, 6655, 25864, 50, 10, 791, 47, 69, 2159, 6067, 10, 8361, 2, 1221, 3, 2211, 45, 313, 179, 2326, 50, 179, 7517, 941, 69, 2506, 12, 15, 518, 42, 4429, 2, 179, 5416, 22040, 240, 5416, 30680, 7512, 244, 21, 21, 2808, 19838, 124, 143, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [131, 34359, 263, 180, 17080, 34, 729, 2625, 894, 2, 11, 157, 143, 39980, 38008, 51, 3, 2373, 1262, 21, 23490, 7007, 6667, 146, 3028, 2, 11, 146, 5455, 2182, 16709, 95, 34359, 17, 8, 6073, 5, 8, 30934, 50, 163, 4219, 329, 25796, 9, 38398, 816, 16, 26882, 9, 3, 0]}

In [13]:
max_length = 128 #set maximum length in the sentence because of dealing with seem pretty short


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [14]:
# Preperocess with tokenizer
tokenized_dataset = split_datasets.map(preprocess_function, batched=True, remove_columns=split_datasets["train"].column_names,)

Map:   0%|          | 0/114376 [00:00<?, ? examples/s]

Map:   0%|          | 0/12709 [00:00<?, ? examples/s]

### Import model

In [15]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

### Adding Data Collation
Need a data collator to deal with the padding for dynamic batching.

In [22]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Matric
BLEU (Bilingual Evaluation Understudy) is a metric that evaluates machine translation by measuring how many words and phrases in the model’s output match the reference translation — it checks n-gram overlap. However, it requires pre-tokenized text, which can lead to inconsistent results across different models and tools.

SacreBLEU is a modern version of BLEU that automatically handles tokenization and standardizes evaluation, making it easier to compare results reliably across models and datasets.

In [21]:
# !pip install sacrebleu evaluate


In [19]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

### Fine Tuning the model

In [23]:
from torch.utils.data import DataLoader

tokenized_dataset.set_format("torch")
train_dataloader = DataLoader(
    tokenized_dataset["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_dataset["validation"], collate_fn=data_collator, batch_size=8
)

In [24]:
# Set optmizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [26]:
# Setting Accelerator
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

We use this code to create a linear learning rate scheduler that gradually decreases the learning rate during training to help the model converge more smoothly and avoid overshooting.

In [40]:
from transformers import get_scheduler

num_train_epochs = 2
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [41]:
output_dir = "opus_books-en-to-fr-accelerate"

### Training loop
We are now ready to write the full training loop. To simplify its evaluation part, we define this postprocess() function that takes predictions and labels and converts them to the lists of strings our metric object will expect:

In [42]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

In [43]:
from tqdm.auto import tqdm
import torch
import numpy as np


progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)


  0%|          | 0/28594 [00:00<?, ?it/s]

  0%|          | 0/1589 [00:00<?, ?it/s]

epoch 0, BLEU score: 28.93




  0%|          | 0/1589 [00:00<?, ?it/s]

epoch 1, BLEU score: 30.52


### Check the Result

In [2]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "opus_books-en-to-fr-accelerate"
translator = pipeline("translation", model=model_checkpoint)
translator("My name is Tamal")