#  Résumeur Génératif avec T5

dans ce notebook nous entraîne un modèle génératif (T5-small) pour produire des résumés à partir des articles du dataset CNN/DailyMail. 
> ** Remarque** : L'entraînement complet sur tout le dataset est coûteux (GPU recommandé). Par défaut, nous utilisons un sous-ensemble pour valider la pipeline.


## 1. Préparation de l'environnement


```bash
pip install -q transformers datasets accelerate sentencepiece rouge-score
```


```bash
accelerate config
```


In [1]:
import torch
torch.backends.mps.is_available()

True

In [25]:
import os
import glob
import re
import random
import zipfile
from typing import Dict

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments
)
from rouge_score import rouge_scorer

import nltk
from nltk.tokenize import sent_tokenize

# Vérifier la version de transformers
import transformers
print(f" transformers version: {transformers.__version__}")

# Définir le device dès le début
def get_device():
    if torch.backends.mps.is_available():
        return torch.device("mps")
    elif torch.cuda.is_available():
        return torch.device("cuda")
    else:
        return torch.device("cpu")

device = get_device()
print(f" Using device: {device}")



 transformers version: 4.57.1
 Using device: mps


In [26]:
# Configuration générale
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

BASE_DIR = os.getcwd()
DATA_ZIP = os.path.join(BASE_DIR, "dataset.zip")
DATA_DIR = os.path.join(BASE_DIR, "dataset")
CNN_DIR = os.path.join(DATA_DIR, "cnn", "stories")

# Limites pour expérimentation rapide (ajustez selon vos ressources GPU)
MAX_ARTICLES = 5000           # nombre d'articles chargés pour le nettoyage
MAX_SAMPLES = 15000           # nombre d'articles utilisés pour l'entraînement T5
MAX_TRAIN_SAMPLES = 12000
MAX_VAL_SAMPLES = 2000
MAX_TEST_SAMPLES = 1000

MODEL_NAME = "t5-small"       # ou "t5-base"
OUTPUT_DIR = os.path.join(BASE_DIR, "models", MODEL_NAME.replace("/", "_"))
os.makedirs(OUTPUT_DIR, exist_ok=True)



In [27]:
# Extraire la base de données si nécessaire
if not os.path.exists(DATA_DIR):
    assert os.path.exists(DATA_ZIP), "dataset.zip introuvable. Veuillez le placer dans le dossier du projet."
    with zipfile.ZipFile(DATA_ZIP, "r") as zip_ref:
        zip_ref.extractall(DATA_DIR)
    print(" Dataset extrait dans", DATA_DIR)
else:
    print(" Dataset déjà présent dans", DATA_DIR)



 Dataset déjà présent dans /Users/nouhailaenihe/Desktop/Projet_NLP/dataset


In [28]:
# Télécharger les ressources NLTK nécessaires
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)



True

In [29]:
# Fonctions utilitaires

def parse_story_file(fp: str):
    """Parse un fichier .story et retourne (article, résumé)."""
    with open(fp, "r", encoding="utf-8", errors="ignore") as f:
        lines = [l.strip() for l in f.readlines()]
    
    article_lines, highlights, in_highlight = [], [], False
    for l in lines:
        if l.lower() == "@highlight":
            in_highlight = True
            continue
        if in_highlight:
            if l:
                highlights.append(l)
        else:
            if l:
                article_lines.append(l)
    
    article = re.sub(r"\s+", " ", " ".join(article_lines)).strip()
    summary = re.sub(r"\s+", " ", " ".join(highlights)).strip()
    return article, summary


def clean_text(text: str) -> str:
    text = re.sub(r"^\(CNN\)\s*-*\s*", "", text)
    text = re.sub(r"\(CNN\)\s*", "", text)
    text = re.sub(r"--", " ", text)
    text = re.sub(r"\s+", " ", text)
    text = text.replace("’", "'").replace("“", '"').replace("”", '"')
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    return text.strip()



In [30]:
# Chargement des articles
files = glob.glob(os.path.join(CNN_DIR, "*.story"))
print(f" {len(files)} fichiers .story disponibles")

if MAX_ARTICLES is not None:
    files = random.sample(files, min(MAX_ARTICLES, len(files)))
    print(f" Sous-échantillonnage à {len(files)} fichiers pour le nettoyage")

rows = []
for fp in tqdm(files, desc="Lecture des fichiers"):
    article, summary = parse_story_file(fp)
    if article and summary:
        rows.append({"article": article, "summary": summary})

df = pd.DataFrame(rows)
print("Articles chargés :", len(df))



 92579 fichiers .story disponibles
 Sous-échantillonnage à 5000 fichiers pour le nettoyage


Lecture des fichiers: 100%|██████████| 5000/5000 [00:01<00:00, 3461.50it/s]

Articles chargés : 4993





In [31]:
# Nettoyage identique au pipeline extractif
print(" Nettoyage des textes...")
df["article"] = df["article"].apply(clean_text)
df["summary"] = df["summary"].apply(clean_text)

# Suppression des doublons
before = len(df)
df = df.drop_duplicates(subset=["article", "summary"]).reset_index(drop=True)
print(f" Doublons supprimés : {before - len(df)}")

# Calcul longueurs pour diagnostic
df["len_article"] = df["article"].str.len()
df["len_summary"] = df["summary"].str.len()
print(df[["len_article", "len_summary"]].describe().round(1))



 Nettoyage des textes...
 Doublons supprimés : 17
       len_article  len_summary
count       4976.0       4976.0
mean        3916.3        255.5
std         2018.9         56.4
min          250.0         54.0
25%         2290.8        215.0
50%         3634.0        258.0
75%         5239.0        300.0
max        11415.0        447.0


In [32]:
# Option : réduire à MAX_SAMPLES pour entraînement plus rapide
if MAX_SAMPLES and len(df) > MAX_SAMPLES:
    df = df.sample(n=MAX_SAMPLES, random_state=SEED).reset_index(drop=True)
    print(f" Sous-échantillonnage à {len(df)} exemples pour l'entraînement T5")
else:
    print(f" Nombre d'exemples utilisés : {len(df)}")



 Nombre d'exemples utilisés : 4976


In [33]:
# Split train / validation / test
from sklearn.model_selection import train_test_split

train_df, temp_df = train_test_split(
    df,
    test_size=0.2,
    random_state=SEED
)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=SEED
)

if MAX_TRAIN_SAMPLES:
    train_df = train_df.sample(n=min(MAX_TRAIN_SAMPLES, len(train_df)), random_state=SEED)
if MAX_VAL_SAMPLES:
    val_df = val_df.sample(n=min(MAX_VAL_SAMPLES, len(val_df)), random_state=SEED)
if MAX_TEST_SAMPLES:
    test_df = test_df.sample(n=min(MAX_TEST_SAMPLES, len(test_df)), random_state=SEED)

print(f"Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")



Train: 3980 | Val: 498 | Test: 498


In [34]:
# Conversion en HuggingFace Dataset
hf_datasets = DatasetDict({
    "train": Dataset.from_pandas(train_df[["article", "summary"]]),
    "validation": Dataset.from_pandas(val_df[["article", "summary"]]),
    "test": Dataset.from_pandas(test_df[["article", "summary"]]),
})

# Supprimer la colonne d'index générée par pandas
hf_datasets = hf_datasets.remove_columns([col for col in hf_datasets["train"].column_names if col.startswith("__index")])

hf_datasets


DatasetDict({
    train: Dataset({
        features: ['article', 'summary'],
        num_rows: 3980
    })
    validation: Dataset({
        features: ['article', 'summary'],
        num_rows: 498
    })
    test: Dataset({
        features: ['article', 'summary'],
        num_rows: 498
    })
})

In [35]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Préfixe T5 pour summarization
TASK_PREFIX = "summarize: "

MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 128


def preprocess_function(batch: Dict[str, list]) -> Dict[str, list]:
    inputs = [TASK_PREFIX + article for article in batch["article"]]
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_INPUT_LENGTH,
        truncation=True
    )

    # Utiliser text_target au lieu de as_target_tokenizer (déprécié)
    labels = tokenizer(
        text_target=batch["summary"],
        max_length=MAX_TARGET_LENGTH,
        truncation=True
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



In [36]:
tokenized_datasets = hf_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=["article", "summary"]
)

tokenized_datasets


Map: 100%|██████████| 3980/3980 [00:02<00:00, 1851.66 examples/s]
Map: 100%|██████████| 498/498 [00:00<00:00, 2127.01 examples/s]
Map: 100%|██████████| 498/498 [00:00<00:00, 2001.52 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3980
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 498
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 498
    })
})

In [37]:
# Device déjà défini dans la cellule 3
# Vérification que device est bien défini
if 'device' not in globals():
    device = get_device()
    print(f"Device défini: {device}")
else:
    print(f"Device déjà défini: {device}")



Device déjà défini: mps


In [38]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)



In [39]:
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [40]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)



In [41]:
rouge_metric = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Calcul ROUGE moyenne sur le batch
    rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
    for pred, label in zip(decoded_preds, decoded_labels):
        score = rouge_metric.score(target=label, prediction=pred)
        for k, v in score.items():
            rouge_scores[k].append(v.fmeasure)

    result = {k: round(float(np.mean(v)) * 100, 4) for k, v in rouge_scores.items() if v}
    return result



In [42]:
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 8
EPOCHS = 3
LEARNING_RATE = 5e-5
WARMUP_STEPS = 500

# Arguments de base compatibles avec toutes les versions
base_args = {
    "output_dir": OUTPUT_DIR,
    "per_device_train_batch_size": BATCH_SIZE,
    "per_device_eval_batch_size": BATCH_SIZE,
    "gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
    "num_train_epochs": EPOCHS,
    "learning_rate": LEARNING_RATE,
    "warmup_steps": WARMUP_STEPS,
    "weight_decay": 0.01,
    "logging_steps": 100,
    "save_steps": 500,
    "save_total_limit": 2,
    "fp16": torch.cuda.is_available(),
    "report_to": ["none"]
}

# Ajouter les arguments optionnels selon la version
try:
    # Essayer les arguments modernes
    training_args = TrainingArguments(**base_args)
    print("✅ TrainingArguments créé avec arguments de base")
except TypeError as e:
    # Si erreur, retirer les arguments problématiques
    print(f"⚠️ Erreur avec certains arguments: {e}")
    # Retirer les arguments qui pourraient causer problème
    base_args.pop("report_to", None)
    base_args.pop("save_total_limit", None)
    training_args = TrainingArguments(**base_args)
    print("✅ TrainingArguments créé avec arguments minimaux")



✅ TrainingArguments créé avec arguments de base


In [43]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)



  trainer = Trainer(


## 2. Fine-tuning du modèle T5



In [44]:
# Lance l'entraînement (décommentez pour exécuter)
train_result = trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()





Step,Training Loss
100,3.5574
200,2.9992
300,2.731


***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1504145GF
  train_loss               =     3.0109
  train_runtime            = 0:36:34.97
  train_samples_per_second =       5.44
  train_steps_per_second   =      0.171


## 4. Génération de résumés (inférence)


In [46]:
def load_trained_model(model_path: str = OUTPUT_DIR):
    """Charge un modèle T5 fine-tuné depuis disk."""
    tok = AutoTokenizer.from_pretrained(model_path)
    mdl = AutoModelForSeq2SeqLM.from_pretrained(model_path)
    return tok, mdl


def generate_summary(article: str,
                     model_path: str = OUTPUT_DIR,
                     max_new_tokens: int = 128,
                     num_beams: int = 4) -> str:
    tok, mdl = load_trained_model(model_path)
    mdl.eval()
    inputs = tok(
        TASK_PREFIX + article,
        return_tensors="pt",
        truncation=True,
        max_length=MAX_INPUT_LENGTH
    )
    with torch.no_grad():
        outputs = mdl.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            num_beams=num_beams,
            early_stopping=True
        )
    return tok.decode(outputs[0], skip_special_tokens=True)



In [47]:
# Exemple (décommentez après l'entraînement et la sauvegarde)
sample_article = df["article"].iloc[0]
print("ARTICLE :", sample_article[:400], "...")
generated_summary = generate_summary(sample_article)
print("\nRÉSUMÉ GÉNÉRÉ :", generated_summary)



ARTICLE : New Haven, Connecticut A Connecticut doctor whose wife and two daughters were killed in a 2007 home invasion took the stand Tuesday to testify against one of the accused killers, recalling horrific details of being beaten and tied up by his alleged captors while fearing for the well-being of his family. William Petit, testifying on the trial's second day in New Haven Superior Court, calmly relayed ...

RÉSUMÉ GÉNÉRÉ : William Petit testifies against one of the accused killers Steven Hayes, 47, and Joshua Komisarjevsky, 30, are charged with capital murder, kidnapping, sexual assault, burglary and arson They both could face the death penalty if convicted


In [48]:
# Évaluation manuelle sur le set de test (après entraînement)
from tqdm.auto import tqdm

print("Chargement du modèle fine-tuné...")
tokenizer, model = load_trained_model(OUTPUT_DIR)

print("Génération des résumés sur le set de test...")
references, predictions = [], []
for sample in tqdm(hf_datasets["test"], desc="Evaluation test"):
    generated = generate_summary(sample["article"], model_path=OUTPUT_DIR)
    references.append(sample["summary"])
    predictions.append(generated)

print("Calcul des scores ROUGE...")
rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
for pred, ref in zip(predictions, references):
    score = rouge_metric.score(target=ref, prediction=pred)
    for k, v in score.items():
        rouge_scores[k].append(v.fmeasure)

# Afficher les résultats
results = {k: float(np.mean(v)) * 100 for k, v in rouge_scores.items()}
print("\n Résultats ROUGE sur le set de test:")
for metric, score in results.items():
    print(f"   {metric.upper()}: {score:.2f}%")
    
results



Chargement du modèle fine-tuné...
Génération des résumés sur le set de test...


Evaluation test: 100%|██████████| 498/498 [10:03<00:00,  1.21s/it]


Calcul des scores ROUGE...

 Résultats ROUGE sur le set de test:
   ROUGE1: 33.59%
   ROUGE2: 13.80%
   ROUGEL: 23.75%


{'rouge1': 33.59053456695846,
 'rouge2': 13.804220334102277,
 'rougeL': 23.750374955697488}