## Fine-tuning du modèle DistilBart

Notebook executé dans Colab

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "sshleifer/distilbart-xsum-12-1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [9]:
import pandas as pd

dev = pd.read_table("dev_mms.tsv")
test = pd.read_table("test_mms.tsv")
train = pd.read_table("train_mms.tsv")

In [10]:
train

Unnamed: 0,id,date,headline,article,abstract
0,COOPOA0009,0,How to Make Perfect POACHED EGGS - Cooking Basics,Hey everyone Its Natasha of NatashasKitchenco...,Intro What pot to use for poached eggs Type of...
1,COOPOA0015,0,Poaching Techniques - Healthy Eating and AGA C...,Music hello Im penny and this is my auger the ...,seal the meat in the hot water sealing it in h...
2,COOPOA0006,0,How to Poach Eggs For Beginners | Food Network,Music egg poaching takes practice because you ...,Do you cover poached eggs
3,COOPOA0014,0,How to Make Poached Eggs,Today on The Stay At Home Chef Im showing you ...,whirlpool method skillet method outro
4,COOPOA0012,0,How to Make Perfect Poached Eggs - 3 Ways | Ja...,hi guys me and the future family together with...,bring your water to the boil put some vinegar ...
...,...,...,...,...,...
3565,SPOFOO0018,0,How To Do a Matthews in Soccer,Today we are learning how to do a matthews in ...,Small Push Touch Inside Touch Outside Take a B...
3566,SPOFOO0000,0,Greatest Moments In College Football History ᴴᴰ,Music Music Applause Music Applause three wide...,Running Play Running Play Touchdown Running Pl...
3567,SPOFOO0020,0,5 MOST BASIC SOCCER/FOOTBALL SKILLS for BEGINNERS,Music if youre just starting to play football ...,BEATING THE GOALKEEPER BASIC PASSING RECEIVING...
3568,SPOFOO0011,0,Soccer Formations Explained,Music whats up guys the snowman here and today...,Intro Formations Other formations


In [11]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(dev)

max_input_length = 512  # Longueur maximale du texte source
max_target_length = 128  # Longueur maximale du résumé cible

def preprocess_function(examples):
    inputs = tokenizer(examples["article"], max_length=max_input_length, truncation=True, padding="max_length")
    targets = tokenizer(examples["abstract"], max_length=max_target_length, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

In [12]:
train_data = train_dataset.map(preprocess_function, batched=True)
val_data = val_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/3570 [00:00<?, ? examples/s]

Map:   0%|          | 0/680 [00:00<?, ? examples/s]

In [13]:
train_data

Dataset({
    features: ['id', 'date', 'headline', 'article', 'abstract', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 3570
})

In [14]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

In [15]:
import numpy as np

def compute_metrics(eval_pred):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}

In [16]:
!pip install evaluate rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=62335a4ce771409fd2eb534aa467f5482ed1f17e249da8373b50ebbc3043ac6e
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.3 rouge_score-0.1.2


In [17]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [18]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="my_fine_tuned_model",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    logging_dir=None,
    report_to="none",
)



In [19]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


In [20]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.639268,0.1732,0.0591,0.1563,0.1562,17.1279
2,2.183800,1.580124,0.1849,0.0694,0.169,0.1685,15.9191
3,1.481200,1.576834,0.1975,0.0802,0.179,0.1789,17.1294


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

TrainOutput(global_step=1341, training_loss=1.7170231511217724, metrics={'train_runtime': 1011.8204, 'train_samples_per_second': 10.585, 'train_steps_per_second': 1.325, 'total_flos': 5525922612510720.0, 'train_loss': 1.7170231511217724, 'epoch': 3.0})

In [22]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_model')

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [23]:
model_ft = AutoModelForSeq2SeqLM.from_pretrained('./fine_tuned_model')
tokenizer_ft = AutoTokenizer.from_pretrained('./fine_tuned_model')

repo_name = "claradlnv/distilbart-fine-tune"

model_ft.push_to_hub(repo_name)
tokenizer_ft.push_to_hub(repo_name)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/886M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/claradlnv/distilbart-fine-tune/commit/47c458b596a8b451ebb1861fa1a9e4c6212e98a5', commit_message='Upload tokenizer', commit_description='', oid='47c458b596a8b451ebb1861fa1a9e4c6212e98a5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/claradlnv/distilbart-fine-tune', endpoint='https://huggingface.co', repo_type='model', repo_id='claradlnv/distilbart-fine-tune'), pr_revision=None, pr_num=None)

### TEST

In [24]:
model_test = AutoModelForSeq2SeqLM.from_pretrained("claradlnv/distilbart-fine-tune")
tokenizer_test = AutoTokenizer.from_pretrained("claradlnv/distilbart-fine-tune")

config.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/886M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/299 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

In [30]:
text = test['article'][200]

In [31]:
text

'hello Im Susan Matthews I am slates news director and Id like to welcome you all to another social distancing social from future tense apprenticeship between slate new America and Arizona State University today were going to be talking about how the pandemic has upended the meat supply chain and I am joined by Henry revoir a staff writer at Slate and Chris Leonard as was the author of the meat racket the secret takeover of Americas food business and Copeland the secret history of the Koch Industries and corporate power in America Henry and Chris welcome thank you for being here thanks ah so I wanted to start by asking a sort of simple question which is that for me during the coronavirus I have started to realize that a lot of things that I thought were similar are not actually as similar so when I used to go into just my chain grocery store I was kind of like oh Im buying things that are not necessarily local or not necessarily the right thing to buy but there are all kind of the thin

In [45]:
inputs = tokenizer_test(text, return_tensors="pt", truncation=True, max_length=tokenizer_test.model_max_length).input_ids
inputs

tensor([[    0, 42891,  5902,  ...,    47,   548,     2]])

In [46]:
outputs = model_test.generate(inputs, max_new_tokens=150, do_sample=False)

In [48]:
pred = tokenizer_test.decode(outputs[0], skip_special_tokens=True)
pred

'What is the deal with the meat that we buy in the grocery store is a lot more likely than any other things are not the right thing to buy a grocery store'

In [37]:
test['abstract'][200]

'Would It Be Advisable To Push for Smaller More Regional Meat Processing Facilities Consolidating Ownership How Are the PlantBased Meat Substitutes Doing What Happens Next'