# Fine-tuning d'un modèle FLAN-T5 pour la traduction Anglais → Français

Ce notebook montre comment fine-tuner un modèle `flan-t5-small` de Hugging Face sur une tâche de **traduction** en utilisant un format d'instruction (Instruction Fine-Tuning).

## Objectifs :
- Apprendre à formater un jeu de données pour le fine-tuning avec instructions
- Comprendre le fonctionnement d’un modèle de type sequence-to-sequence
- Traduire de l’anglais vers le français à l’aide d’un modèle T5
- Utiliser un sous-ensemble de 1000 exemples pour un entraînement rapide



##Setup Environment

### Installation des bibliothèques nécessaires

Nous utilisons :
- `transformers` pour charger les modèles et tokenizers,
- `datasets` pour charger des jeux de données,
- `accelerate` pour utiliser le GPU de manière optimisée.




In [1]:
!pip install -q transformers datasets accelerate peft bitsandbytes


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Load Dataset
>Nous chargeons le dataset **WMT14** anglais-français depuis Hugging Face, puis sélectionnons les **1000 premiers exemples** pour un entraînement rapide.


In [2]:
!pip install fsspec==2023.5.0
!pip install --upgrade datasets huggingface_hub


Collecting fsspec==2023.5.0
  Downloading fsspec-2023.5.0-py3-none-any.whl.metadata (6.7 kB)
Downloading fsspec-2023.5.0-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.1/160.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2023.5.0 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2023.5.0
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.1-py3-none-any.whl.metadata (14 kB)
Downloading datasets-

In [3]:
from huggingface_hub import login

login(token="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")


In [5]:
from datasets import load_dataset

# This is a large and clean dataset for EN->FR translation
dataset = load_dataset("wmt14", "fr-en")




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/30 [00:00<?, ?files/s]

train-00000-of-00030.parquet:   0%|          | 0.00/252M [00:00<?, ?B/s]

train-00001-of-00030.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00002-of-00030.parquet:   0%|          | 0.00/243M [00:00<?, ?B/s]

train-00003-of-00030.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

train-00004-of-00030.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

train-00005-of-00030.parquet:   0%|          | 0.00/238M [00:00<?, ?B/s]

train-00006-of-00030.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

train-00007-of-00030.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00008-of-00030.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

train-00009-of-00030.parquet:   0%|          | 0.00/239M [00:00<?, ?B/s]

train-00010-of-00030.parquet:   0%|          | 0.00/239M [00:00<?, ?B/s]

train-00011-of-00030.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00012-of-00030.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00013-of-00030.parquet:   0%|          | 0.00/230M [00:00<?, ?B/s]

train-00014-of-00030.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00015-of-00030.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

train-00016-of-00030.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

train-00017-of-00030.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00018-of-00030.parquet:   0%|          | 0.00/261M [00:00<?, ?B/s]

train-00019-of-00030.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

train-00020-of-00030.parquet:   0%|          | 0.00/261M [00:00<?, ?B/s]

train-00021-of-00030.parquet:   0%|          | 0.00/264M [00:00<?, ?B/s]

train-00022-of-00030.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

train-00023-of-00030.parquet:   0%|          | 0.00/270M [00:00<?, ?B/s]

train-00024-of-00030.parquet:   0%|          | 0.00/274M [00:00<?, ?B/s]

train-00025-of-00030.parquet:   0%|          | 0.00/278M [00:00<?, ?B/s]

train-00026-of-00030.parquet:   0%|          | 0.00/365M [00:00<?, ?B/s]

train-00027-of-00030.parquet:   0%|          | 0.00/322M [00:00<?, ?B/s]

train-00028-of-00030.parquet:   0%|          | 0.00/370M [00:00<?, ?B/s]

train-00029-of-00030.parquet:   0%|          | 0.00/311M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/475k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/536k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40836715 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3003 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/30 [00:00<?, ?it/s]

In [16]:
small_train = dataset["train"].select(range(1000))

## Étape 3 : Formatage des données

Nous convertissons chaque exemple en une instruction du style :

- **Input** : "Translate to French: I love cats."
- **Output** : "J'aime les chats."

Cela suit le format utilisé dans les modèles de type FLAN.


In [None]:
def format_instruction(example):
    return {
        "input": f"Translate to French: {example['translation']['en']}",
        "output": example['translation']['fr']
    }

train_data = small_train.map(format_instruction)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
small_train

Dataset({
    features: ['translation', 'input', 'output'],
    num_rows: 1000000
})

##Load Tokenizer and Model and Tokenize the Dataset
Nous utilisons `flan-t5-small`, un modèle pré-entraîné conçu pour comprendre des instructions textuelles.

Le tokenizer transforme le texte en identifiants numériques, et inversement.

##hello


##Load Tokenizer and Model and Tokenize the Dataset
>Nous utilisons `flan-t5-small`, un modèle pré-entraîné conçu pour comprendre des instructions textuelles.

>Le tokenizer transforme le texte en identifiants numériques, et inversement.


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Nous transformons les textes d'entrée (instructions) et de sortie (réponses attendues) en séquences d'identifiants (`input_ids`, `labels`), avec padding et troncature pour les aligner dans les batchs.


In [None]:
def preprocess(example):
    model_inputs = tokenizer(example["input"], max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(example["output"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_data.map(preprocess, batched=True)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
for i in range(5):
  print(tokenized_train[i])

{'translation': {'en': 'Resumption of the session', 'fr': 'Reprise de la session'}, 'input': 'Translate to French: Resumption of the session', 'output': 'Reprise de la session', 'input_ids': [30355, 15, 12, 2379, 10, 419, 4078, 102, 1575, 13, 8, 2363, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

##Define Training Arguments and Trainer



Nous utilisons `Seq2SeqTrainingArguments` pour configurer l’entraînement :

- `output_dir` : où sauvegarder les checkpoints
- `per_device_train_batch_size` : taille des batchs
- `learning_rate` : taux d’apprentissage
- `num_train_epochs` : nombre de passages sur les données
- `fp16` : permet un entraînement plus rapide et moins coûteux en mémoire sur GPU


In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-small-fr-translate",
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    num_train_epochs=3,
    logging_dir="./logs",
    save_steps=200,
    fp16=True  # if GPU available
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)


  trainer = Seq2SeqTrainer(


### Training

Nous lançons le fine-tuning du modèle sur notre jeu de données de 1000 exemples. Ce processus peut prendre quelques minutes selon les ressources GPU disponibles.


In [None]:
trainer.train()


NameError: name 'trainer' is not defined

## Test the model

Nous testons notre modèle fine-tuné en lui donnant de nouvelles phrases à traduire. L’instruction est générée dynamiquement et le modèle produit la traduction en français.


In [None]:
def translate(text):
    prompt = f"Translate to French: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translate("Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful."))


Bien entendu, il s'est perçue que le dreaded 'millennium' a l'objet de l'a l'intention de produire des droux
