<a href="https://colab.research.google.com/github/vincentmartin/tp-fine-tuning-student-version/blob/main/tp-fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP fine-tuning de LLM

Dans ce notebook vous allez fine tuner un LLM de base, Flan T5, avec la technique PEFT et LoRA.

### Instruction à suivre pour exécution sur Google Colab

Aller dans `Execution -> Modifier le type d'exécution` puis sélectionner `T4-GPU` pour exploiter les fonctionnalités GPU.

![Colab GPU](resources/colab_gpu.png "T4-GPU")

Installationd des dépendances

In [1]:
%pip install -U datasets

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch \
    torchdata --quiet

%pip install \
    transformers \
    evaluate \
    rouge_score \
    loralib \
    peft \
    bitsandbytes

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-4.4.1 pya

Import des dépendances

In [1]:
from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, BitsAndBytesConfig
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import os
import bitsandbytes
os.environ["WANDB_DISABLED"] = "true"

Chargement du LLM de base.

In [2]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Création d'une fonction pour afficher le nombre de paramètres entraînables.

In [3]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


## Fine tuning avec PEFT et LoRA

Le fine tuning complet d'un modèle n'est pas un choix judicieux pour un particulier ou une entreprise qui n'a pas une énorme puissance de calcul. La méthode la plus appropriée est d'utiliser PEFT (_Parameter Efficient Fine-Tuning_).

PEFT est un ensemble de technique qui incluant LORA (_Low Rank Adaptation_) et le _prompt tuning_ (**différent du prompt engineering**). LORA permet de fine tuner un modèle avec peu de ressources matérielles (un ou deux GPU). LORA permet de créer des adapteurs composés de 1-10% des paramètres du LLM original. De plus, le LLM original n'est pas modifié, ce qui permet de rapidement changer d'adapteurs en fonction du cas d'usage.

### Configuration de PEFT / LoRA

Premièrement, configurons PEFT/LoRA pour fine tuner notre modèle de base avec ce que l'on appelle _adapteur_.

PEFT/LoRA gêle les couches du LLM original pour n'entraîner que l'adapteur.

In [13]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank : plus il est grand, plus il y a de paramètres. Idéal : 16-32
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # Pour FLANT5, laisser ce type
)

Ajouter l'adapteur au LLM original.

In [14]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%




## Lancement de l'entraînement

Chargeons le jeu de données pour l'entraînement.

In [15]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Pour que l'entraînement prenne un temps acceptable dans ce notebook, nous diminuons la taille du jeu de données.

In [16]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

**Exercice**  : en vous aidant de la documentation https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments, créer une instance de **Trainer** pour entraîner le LLM. Vous utiliserez les paramètres suivants :
- auto_find_batch_size=True,
- learning_rate=1e-3,
- num_train_epochs=5,
- logging_steps=1,
- max_steps=1   

Le jeu de données à utiliser pour l'entraînement est `tokenized_datasets["train"]`.

**Dans Google Colab, utiliser `report_to=None` sinon il vous sera demandé une clef Wanadb.**

In [17]:
output_dir = './training-output'

peft_training_args = TrainingArguments(
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=5,
    logging_steps=1,
    max_steps=20,
    output_dir=output_dir,
    report_to=None,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


**Exercice** : Lancer l'entraînement et sauvegarder le modèle (adapteur)  et le tokenizer dans le dossier `training-output-checkpoint`.

In [18]:
# Start the training
peft_trainer.train()

# Save the PEFT model and the tokenizer
save_dir = './training-output-checkpoint'
peft_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

Step,Training Loss
1,48.0
2,46.5
3,42.5
4,39.0
5,34.25
6,30.25
7,28.25
8,27.25
9,25.375
10,24.0


('./training-output-checkpoint/tokenizer_config.json',
 './training-output-checkpoint/special_tokens_map.json',
 './training-output-checkpoint/spiece.model',
 './training-output-checkpoint/added_tokens.json',
 './training-output-checkpoint/tokenizer.json')

### Evaluation du modèle fine tuné

Une erreur classique lorsque l'on début est d'évaluer les performances en 'regardant' quelques générations manuellement. C'est une mauvaise idée car (1) ce n'est pas quantifié et (2) ce qui fonctionne sur quelques exemples ne fonctionne peut être pas sur des milliers d'exemples (principe de généralisation).

Lorsque l'on fine tune un modèle, il est donc capital de mesurer les performances pour savoir si **globalement** les résultats sont meilleurs.

In [19]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       'training-output-checkpoint',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)


In [20]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

device = "cuda" if torch.cuda.is_available() else "cpu"


prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = original_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)


peft_model_outputs = peft_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'RESUME HUMAIN:\n{human_baseline_summary}')
print(dash_line)
print(f'RESUME AVEC MODELE ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(dash_line)
print(f'RESUME AVEC MODELE PEFT: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
RESUME HUMAIN:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE ORIGINAL:
You might also want to upgrade your hardware.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE PEFT: Upgrade your computer.


Inférence sur 10 exemples du jeu de test.

In [None]:
# Calculate ROUGE scores
rouge = evaluate.load('rouge')

original_model_rouge_scores = rouge.compute(predictions=original_model_summaries, references=human_baseline_summaries)
peft_model_rouge_scores = rouge.compute(predictions=peft_model_summaries, references=human_baseline_summaries)
print("Original Model ROUGE scores:", original_model_rouge_scores)
print("PEFT Model ROUGE scores:", peft_model_rouge_scores)

In [21]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

In [22]:
for e in zipped_summaries:
    print("\n")
    print(dash_line)
    print(f'RESUME HUMAIN:\n{e[0]}')
    print(dash_line)
    print(f'RESUME AVEC MODELE ORIGINAL:\n{e[1]}')
    print(dash_line)
    print(f'RESUME AVEC MODELE PEFT:\n{e[2]}')



---------------------------------------------------------------------------------------------------
RESUME HUMAIN:
Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE ORIGINAL:
The memo will be distributed to all employees by this afternoon.
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE PEFT:
This memo is to be distributed to all employees by this afternoon.


---------------------------------------------------------------------------------------------------
RESUME HUMAIN:
In order to prevent employees from wasting time on Instant Message programs, #Person1# decides to terminate the use of those programs and asks Ms. Dawson to send out a memo to all employees by the afternoon.
--------

**Exercice** : en utilisant la documentation https://huggingface.co/docs/evaluate/main/en/choosing_a_metric, calculer le score ROUGE entre :
- Les résumés du modèle original  (`original_model_summaries`)  vs. résumés humain (`human_baseline_summaries`).
- Les résumés du modèle peft  (`peft_model_summaries`) vs. résumé humain (`human_baseline_summaries`).

Afficher les scores et commentez les.

In [26]:
# Calculate ROUGE scores
rouge = evaluate.load('rouge')

original_model_rouge_scores = rouge.compute(predictions=original_model_summaries, references=human_baseline_summaries)
peft_model_rouge_scores = rouge.compute(predictions=peft_model_summaries, references=human_baseline_summaries)
print("Original Model ROUGE scores:", original_model_rouge_scores)
print("PEFT Model ROUGE scores:", peft_model_rouge_scores)

Original Model ROUGE scores: {'rouge1': np.float64(0.1610763018657756), 'rouge2': np.float64(0.07463565891472869), 'rougeL': np.float64(0.1479215131846711), 'rougeLsum': np.float64(0.15302808302808302)}
PEFT Model ROUGE scores: {'rouge1': np.float64(0.23087972212972213), 'rouge2': np.float64(0.08590250329380764), 'rougeL': np.float64(0.19755969955969954), 'rougeLsum': np.float64(0.20024055574055571)}


In [27]:
# Calculate performance improvement
original_rouge1 = original_model_rouge_scores['rouge1']
peft_rouge1 = peft_model_rouge_scores['rouge1']
improvement = peft_rouge1 - original_rouge1
print(f"Improvement in ROUGE-1 score after PEFT fine-tuning: {improvement:.4f}")

original_rouge_l = original_model_rouge_scores['rougeL']
peft_rouge_l = peft_model_rouge_scores['rougeL']
improvement_l = peft_rouge_l - original_rouge_l
print(f"Improvement in ROUGE-L score after PEFT fine-tuning: {improvement_l:.4f}")

Improvement in ROUGE-1 score after PEFT fine-tuning: 0.0698
Improvement in ROUGE-L score after PEFT fine-tuning: 0.0496


**Exercice** : calculer le gain de performance en pourcentage du modèle PEFT sur le modèle original

## Fine tuning de Llama 3 ou Qwen 3 1.7B

Le modèle `flan-t5-base`que nous avons utilisé jusqu'à maintenant est bien pour comprendre les principes mais c'est un modèle ancien aux performances dépassées par rapport aux modèles récents tels que Llama 3.

Dans cet exercice, vous allez charger puis fine tuner un LLM bien plus performant tout en conservant une taille acceptable de 3B de paramètres : Llama 3.2 - 3B. Nous pouvons aussi tester avec Qwen 3 1.7B (https://huggingface.co/Qwen/Qwen3-1.7B).

Afin que le modèle puisse être chargé en VRAM, nous utiliserons une version quantisée en 4bits : https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-bnb-4bit. L'utilisation de la bibliothèque `bitsandbytes`est alors indispensable.

**Redémarrer la session à ce stade pour réinitialiser la RAM et la VRAM**

### Conseils pour réaliser l'exercice :

- Le modèle n'est plus de type _Encoder Decoder_ (Seq2Seq) mais _Decoder only_ (CausalLM). Effectuer les modifications en conséquence
- Réduire la taille du jeu de données d'entraînement pour rester dans des temps acceptables (100 exemples)
- Modifier les arguments d'entraînement (`TrainingArguments`) pour prendre accélérer le traitement : considérer les paramètres `per_device_train_batch_size`, `gradient_accumulation_steps`, `gradient_chekpointing`.

L'exercice peut prendre un certain temps, faites votre maximum et avancer pas à pas.

In [30]:
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, prepare_model_for_kbit_training

model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Prepare the model for k-bit training to fix gradient issues with quantized models
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

peft_model = get_peft_model(model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)



trainable model parameters: 9175040
all model parameters: 1812638720
percentage of trainable model parameters: 0.51%


In [31]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    full_text = [p + summary + tokenizer.eos_token for p, summary in zip(prompt, example["summary"])]
    tokenized = tokenizer(full_text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    example['input_ids'] = tokenized.input_ids
    example['attention_mask'] = tokenized.attention_mask
    example['labels'] = tokenized.input_ids.clone()
    return example

In [32]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

# Reduce dataset size
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

output_dir = './llama-training-output'
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    learning_rate=1e-3,
    logging_steps=1,
    max_steps=30,
    save_steps=10,
    gradient_checkpointing=True,
    report_to=None,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
# Train
trainer.train()

# Save
save_dir = './llama-training-output-checkpoint'
peft_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,6.8642
2,6.6279
3,4.5978
4,5.5671
5,5.9509
6,6.4508


In [149]:
# Evaluation
# Load saved model for inference
base_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
peft_model_loaded = PeftModel.from_pretrained(base_model, save_dir, is_trainable=False)

# Example inference
index = 200
dialogue = dataset['test'][index]['dialogue']
prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary: "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

outputs = peft_model_loaded.generate(input_ids=input_ids, max_new_tokens=200, num_beams=1)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

# Calculate ROUGE scores for evaluation
rouge = evaluate.load('rouge')
index = 250
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']
prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary: "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
print(dash_line)

rouge = evaluate.load("rouge")
reference = dataset['test'][index]['summary']

generated_summary = summary.split("Summary:")[-1].strip()

results = rouge.compute(
    predictions=[generated_summary],
    references=[reference]
)

print("ROUGE-1:", results["rouge1"])
print("ROUGE-2:", results["rouge2"])
print("ROUGE-L:", results["rougeL"])




Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:  #Person1# suggests upgrading #Person2#'s computer system.
---------------------------------------------------------------------------------------------------
ROUGE-1: 0.1379310344827