## Fine-tuning du modèle DistilBart avec dataset généré

Notebook executé dans Kaggle

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "sshleifer/distilbart-xsum-12-1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

In [3]:
!pip install datasets



In [5]:
import pandas as pd

data = pd.read_csv("/kaggle/input/ytb-summaries-gen/new_ref_data_final.csv")

In [6]:
data

Unnamed: 0,article,summary
0,oh my God whoa holy [ __ ] [Music] holy [ __ ]...,In this enthusiastic reaction to the first epi...
1,She-Hulk Peter Pan and Wendy the rings of powe...,The video critiques modern female protagonists...
2,this is jinx pretty evil right well what if I ...,"The video reexamines Jinx from Arcane, arguing..."
3,imagine having to give up everything you ever ...,"The relationship between Ekko and Jinx, known ..."
4,ever since arcane came out there was one quest...,Arcane is not officially considered part of th...
...,...,...
1188,"- [Narrator] Fortress of Kustrin, Brandenburg,...",Frederick the Great is one of the most dynamic...
1189,- [Narrator] King Louie and his family were no...,King Louie and his family were now in the Tuil...
1190,- This video was made possible by Honey. Keep ...,The video was made possible by Honey. Keep wat...
1191,- [Deeply Voice] This video was made possible ...,Russian Tsars had no time for pathetic ideas l...


In [47]:
#split train, val, test
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

In [54]:
val_df.loc[673].name

673

In [57]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

max_input_length = 512  # Longueur maximale du texte source
max_target_length = 128  # Longueur maximale du résumé cible

def preprocess_function(examples):
    articles = [str(a).strip("'\"") for a in examples["article"]]
    summaries = [str(s).strip("'\"") for s in examples["summary"]]

    inputs = tokenizer(
        articles,
        max_length=max_input_length, 
        truncation=True, 
        padding="max_length"
    )
    targets = tokenizer(
        summaries, 
        max_length=max_target_length, 
        truncation=True, 
        padding="max_length"
    )
    inputs["labels"] = targets["input_ids"]
    return inputs

In [58]:
train_data = train_dataset.map(preprocess_function, batched=True)
val_data = val_dataset.map(preprocess_function, batched=True)
test_data = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/715 [00:00<?, ? examples/s]

Map:   0%|          | 0/239 [00:00<?, ? examples/s]

Map:   0%|          | 0/239 [00:00<?, ? examples/s]

In [61]:
train_data[0]

{'article': "this video is sponsored by squarespace we're back because of course we are you know when you make a video is just sort of a one-off joke but then that video out of nowhere becomes one of the most popular things you've ever done and memes start happening and your comment section fills with demands for more videos just like it i'm not mad about it i'm not ungrateful i just i don't get it plenty of the videos on this channel they take weeks and months of intense research writing editing shooting visual effects all to put together and then there's this series the study of atmospheres and overall weather on planets other than earth is called [Music] exometerology people with banana allergies have an increased risk for latex allergies because they both share similar proteins [Music] the third word in the third chapter of the third harry potter book is several the king of hearts is the only king in a deck of cards without a mustache highway gothic is the name of the font develope

In [62]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

In [63]:
import numpy as np

def compute_metrics(eval_pred):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}

In [64]:
!pip install evaluate rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=a8bf188c8a1b8d62831160b642d6467165817d3b98b1f79a5227bdaffe27f13a
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.3 rouge_score-0.1.2


In [65]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [66]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="/kaggle/working/my_fine_tuned_model",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    logging_dir=None,
    report_to="none",
)



In [67]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


In [68]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.340251,0.3914,0.2553,0.3196,0.3193,45.113
2,No log,1.18617,0.432,0.2973,0.3576,0.3575,49.2929
3,1.550300,1.133298,0.4708,0.338,0.3979,0.3975,52.6946
4,1.550300,1.107224,0.4938,0.3584,0.4178,0.4176,54.3975
5,1.550300,1.154282,0.4993,0.3765,0.433,0.4328,52.4561
6,0.543400,1.153011,0.5173,0.3905,0.4465,0.4466,55.59
7,0.543400,1.172014,0.5278,0.4018,0.4565,0.4563,56.2385
8,0.543400,1.206222,0.5204,0.3947,0.4531,0.4519,56.6234
9,0.315100,1.242708,0.5206,0.3928,0.45,0.4489,57.4561
10,0.315100,1.267136,0.524,0.3952,0.4526,0.452,56.841


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

TrainOutput(global_step=1790, training_loss=0.7118995964860116, metrics={'train_runtime': 1418.2003, 'train_samples_per_second': 5.042, 'train_steps_per_second': 1.262, 'total_flos': 3689107999948800.0, 'train_loss': 0.7118995964860116, 'epoch': 10.0})

In [69]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [70]:
model.save_pretrained('/kaggle/working/fine_tuned_model')
tokenizer.save_pretrained('/kaggle/working/fine_tuned_model')

('/kaggle/working/fine_tuned_model/tokenizer_config.json',
 '/kaggle/working/fine_tuned_model/special_tokens_map.json',
 '/kaggle/working/fine_tuned_model/vocab.json',
 '/kaggle/working/fine_tuned_model/merges.txt',
 '/kaggle/working/fine_tuned_model/added_tokens.json',
 '/kaggle/working/fine_tuned_model/tokenizer.json')

In [72]:
model_ft = AutoModelForSeq2SeqLM.from_pretrained('/kaggle/working/fine_tuned_model')
tokenizer_ft = AutoTokenizer.from_pretrained('/kaggle/working/fine_tuned_model')

repo_name = "claradlnv/fine-tuned-distilbart2"

model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

model.safetensors:   0%|          | 0.00/886M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/claradlnv/fine-tuned-distilbart2/commit/bc4b7c52308204544782b18ff33ea091c5b595ca', commit_message='Upload tokenizer', commit_description='', oid='bc4b7c52308204544782b18ff33ea091c5b595ca', pr_url=None, repo_url=RepoUrl('https://huggingface.co/claradlnv/fine-tuned-distilbart2', endpoint='https://huggingface.co', repo_type='model', repo_id='claradlnv/fine-tuned-distilbart2'), pr_revision=None, pr_num=None)

### TEST

In [74]:
model_test = AutoModelForSeq2SeqLM.from_pretrained(repo_name)
tokenizer_test = AutoTokenizer.from_pretrained(repo_name)

model_test = model_test.to(device)

In [83]:
from tqdm.notebook import tqdm

def generate_summaries(dataset, model, tokenizer, max_length=128, num_beams=4):
    summaries = []
    with tqdm(total=len(dataset), desc="Generating Summaries") as pbar:
        for example in dataset:
            input_ids = tokenizer(
                example["article"], return_tensors="pt", truncation=True, padding=True, max_length=512
            ).input_ids.to(device)
            summary_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
            decoded_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
            summaries.append(decoded_summary)
            pbar.update(1)
    return summaries

In [84]:
test_summaries = generate_summaries(test_dataset, model_test, tokenizer_test)

Generating Summaries:   0%|          | 0/239 [00:00<?, ?it/s]

In [86]:
!pip install evaluate rouge_score



In [88]:
from evaluate import load
rouge = load("rouge")

# Récupérer les résumés cibles
test_references = [example["summary"] for example in test_dataset]

# Calculer les scores ROUGE
results = rouge.compute(predictions=test_summaries, references=test_references, use_stemmer=True)

# Afficher les scores
for key, value in results.items():
    print(f"{key}: {value:.4f}")

rouge1: 0.5479
rouge2: 0.4266
rougeL: 0.4833
rougeLsum: 0.4827


In [90]:
from transformers import pipeline

summarizer = pipeline("summarization", model=repo_name, tokenizer=repo_name)

Device set to use cuda:0


In [96]:
text_to_summarize = input("Texte à résumé :")

summary = summarizer(text_to_summarize, max_length=128, min_length=30, num_beams=4, early_stopping=True)
print("Résumé généré :", summary[0]['summary_text'])

Texte à résumé : In the vast expanse of the universe, humans are but a tiny speck. Yet, the story of human existence is filled with remarkable achievements and extraordinary struggles. From the moment early humans first discovered fire, to the creation of advanced technology, to our exploration of space, humanity has continuously pushed the boundaries of what is possible. The rise of civilizations, the development of language, art, and science, and the ability to build empires and connect across vast distances, all speak to the resilience and creativity of the human spirit. However, this journey has not been without challenges. Wars, natural disasters, and conflicts have shaped history, and many struggles remain as we continue to face issues like climate change, inequality, and the quest for peace. Despite these obstacles, humanity’s ability to adapt and evolve has allowed us to survive and thrive in a constantly changing world.


Résumé généré : Humans are but a tiny speck. Yet, the story of human existence is filled with remarkable achievements and extraordinary struggles. Despite these obstacles, humanity’s ability to adapt and evolve has allowed us to survive and thrive in a constantly changing world.
