# BART — Résumer 10 articles au hasard + mesures avant/après

Ce notebook :
- prélève **10 articles aléatoires** depuis `df` (colonne `text`)
- résume avec **BART**
- mesure concrètement **avant / après** (mots, réduction, similarité sémantique, overlap extractif)
- sauvegarde un CSV `bart_10_random_before_after.csv`

> ⚠️ `df` doit exister (ou charger `dataset.csv` via la cellule 2).

In [1]:
# Cellule 1 — Imports
import random
import pandas as pd
import numpy as np
import torch

from transformers import pipeline




In [2]:
# Cellule 2 — Charger le dataset (si df n'existe pas déjà)
# Si tu as déjà df en mémoire, tu peux commenter cette cellule.
CSV_PATH = "dataset.csv"  # adapte si besoin (ex: "/mnt/data/dataset.csv")
df = pd.read_csv(CSV_PATH)

print("Colonnes :", list(df.columns))
print("Taille df :", df.shape)
df.head()


Colonnes : ['text', 'label_encoded']
Taille df : (600, 2)


Unnamed: 0,text,label_encoded
0,drive to 'save' festive holidays efforts are b...,0
1,brown hits back in blair rift row gordon brown...,0
2,holmes is hit by hamstring injury kelly holmes...,1
3,the future in your pocket if you are a geek or...,2
4,o'sullivan could run in worlds sonia o'sulliva...,1


In [3]:
# Cellule 3 — Vérifications
assert "text" in df.columns, "❌ La colonne 'text' est absente."


In [4]:
# Cellule 4 — Paramètres + échantillonnage aléatoire (10 articles)
N = 10
SEED = 42

MODEL_SUM = "facebook/bart-large-cnn"
MAX_LEN = 130
MIN_LEN = 35
MAX_CHARS_FOR_SUM = 6000  # tronquage simple pour démo

sample_df = df.sample(n=min(N, len(df)), random_state=SEED).reset_index(drop=True)

print("Nb d'articles sélectionnés :", len(sample_df))
sample_df.head()


Nb d'articles sélectionnés : 10


Unnamed: 0,text,label_encoded
0,fa decides not to punish mourinho the football...,1
1,crucial decision on supercasinos a decision on...,0
2,games firms 'face tough future' uk video game ...,2
3,rovers reject third ferguson bid blackburn hav...,1
4,iaaf awaits greek pair's response kostas kente...,1


In [5]:
# Cellule 5 — Charger BART (GPU si dispo)
device = 0 if torch.cuda.is_available() else -1
print("Device utilisé :", "GPU" if device == 0 else "CPU")

summarizer = pipeline(
    "summarization",
    model=MODEL_SUM,
    device=device
)


Device utilisé : CPU


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [7]:
# Cellule 6 — Fonction de résumé BART
def bart_summarize(text: str) -> str:
    text = str(text).strip()
    if not text:
        return ""
    text = text[:MAX_CHARS_FOR_SUM]  # démo rapide (si tu veux chunking propre, on peut l'ajouter)
    result = summarizer(
        text,
        max_length=MAX_LEN,
        min_length=MIN_LEN,
        do_sample=False,
        truncation=True
    )
    return result[0]["summary_text"].strip()


In [8]:
# Cellule 7 — Générer les résumés
sample_df["summary_bart"] = sample_df["text"].apply(bart_summarize)
sample_df[["text", "summary_bart"]].head(2)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unnamed: 0,text,summary_bart
0,fa decides not to punish mourinho the football...,fa decides not to punish jose mourinho. The fo...
1,crucial decision on supercasinos a decision on...,The government has plans for up to eight las v...


In [10]:
# Cellule 8 — Mesure 1 : longueur avant/après (mots) + réduction %
def word_count(text):
    return len(str(text).split())

sample_df["words_before"] = sample_df["text"].apply(word_count)
sample_df["words_after"]  = sample_df["summary_bart"].apply(word_count)

sample_df["reduction_%"] = (1 - sample_df["words_after"] / sample_df["words_before"]) * 100
sample_df["reduction_%"] = sample_df["reduction_%"].replace([np.inf, -np.inf], np.nan).fillna(0).clip(lower=0)

sample_df[["words_before", "words_after", "reduction_%"]]


Unnamed: 0,words_before,words_after,reduction_%
0,247,52,78.947368
1,194,36,81.443299
2,480,32,93.333333
3,300,39,87.0
4,261,69,73.563218
5,192,29,84.895833
6,434,41,90.552995
7,772,29,96.243523
8,369,37,89.9729
9,256,45,82.421875


In [11]:
# Cellule 9 — Mesure 2 : similarité sémantique (cosine similarity)
# Si besoin: pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util

embedder = SentenceTransformer("all-MiniLM-L6-v2")

emb_before = embedder.encode(sample_df["text"].tolist(), convert_to_tensor=True)
emb_after  = embedder.encode(sample_df["summary_bart"].tolist(), convert_to_tensor=True)

cos_scores = util.cos_sim(emb_before, emb_after).diagonal()
sample_df["semantic_similarity"] = cos_scores.cpu().numpy()

sample_df[["semantic_similarity"]]


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,semantic_similarity
0,0.759438
1,0.664145
2,0.815582
3,0.826539
4,0.850337
5,0.698977
6,0.837851
7,0.683956
8,0.7614
9,0.704157


In [12]:
# Cellule 10 — Mesure 3 : overlap extractif (3-gram) pour montrer que BART n'est pas extractif
def ngrams(tokens, n=3):
    return set(tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)) if len(tokens) >= n else set()

def extractive_overlap_ratio(original, summary, n=3):
    o = str(original).lower().split()
    s = str(summary).lower().split()
    s_ng = ngrams(s, n)
    if not s_ng:
        return 0.0
    o_ng = ngrams(o, n)
    return len(s_ng & o_ng) / len(s_ng)

sample_df["extractive_overlap_3gram"] = sample_df.apply(
    lambda r: extractive_overlap_ratio(r["text"], r["summary_bart"], n=3),
    axis=1
)

sample_df[["extractive_overlap_3gram"]]


Unnamed: 0,extractive_overlap_3gram
0,0.8
1,0.852941
2,0.933333
3,0.810811
4,0.940299
5,0.703704
6,0.820513
7,0.740741
8,0.828571
9,0.976744


In [13]:
# Cellule 11 — Affichage AVANT/APRÈS + métriques
def clip(text, n=500):
    text = str(text)
    return text[:n] + ("..." if len(text) > n else "")

show_df = pd.DataFrame({
    "AVANT (extrait)": sample_df["text"].apply(lambda x: clip(x, 500)),
    "APRÈS (résumé BART)": sample_df["summary_bart"],
    "mots_avant": sample_df["words_before"],
    "mots_après": sample_df["words_after"],
    "réduction_%": sample_df["reduction_%"].round(1),
    "sim_sémantique": sample_df["semantic_similarity"].round(3),
    "overlap_extractif": sample_df["extractive_overlap_3gram"].round(3)
})

show_df


Unnamed: 0,AVANT (extrait),APRÈS (résumé BART),mots_avant,mots_après,réduction_%,sim_sémantique,overlap_extractif
0,fa decides not to punish mourinho the football...,fa decides not to punish jose mourinho. The fo...,247,52,78.9,0.759,0.8
1,crucial decision on supercasinos a decision on...,The government has plans for up to eight las v...,194,36,81.4,0.664,0.853
2,games firms 'face tough future' uk video game ...,Three leading uk video game companies predicte...,480,32,93.3,0.816,0.933
3,rovers reject third ferguson bid blackburn hav...,Blackburn have rejected a third bid from range...,300,39,87.0,0.827,0.811
4,iaaf awaits greek pair's response kostas kente...,kostas kenteris and katerina thanou are yet to...,261,69,73.6,0.85,0.94
5,iaaf launches fight against drugs the iaaf at...,iaaf chief lamine diack and namibian athlete f...,192,29,84.9,0.699,0.704
6,ask jeeves joins web log market ask jeeves has...,Ask jeeves buys bloglines to improve the way i...,434,41,90.6,0.838,0.821
7,costin aims for comeback in 2006 jamie costin ...,costin aims for comeback in 2006. Back broken ...,772,29,96.2,0.684,0.741
8,us top of supercomputing charts the us has pus...,us top of supercomputing charts with ibm's pro...,369,37,90.0,0.761,0.829
9,commodore finds new lease of life the oncefamo...,commodore computer brand could be resurrected ...,256,45,82.4,0.704,0.977


In [14]:
# Cellule 12 — Résumé global + sauvegarde CSV
print("=== RÉSUMÉ GLOBAL (10 articles) ===")
print(f"Mots avant (moyenne): {sample_df['words_before'].mean():.1f}")
print(f"Mots après  (moyenne): {sample_df['words_after'].mean():.1f}")
print(f"Réduction moyenne    : {sample_df['reduction_%'].mean():.1f}%")
print(f"Similarité sémantique: {sample_df['semantic_similarity'].mean():.3f}")
print(f"Overlap extractif    : {sample_df['extractive_overlap_3gram'].mean():.3f}")

out_csv = "bart_10_random_before_after.csv"
show_df.to_csv(out_csv, index=False, encoding="utf-8")
print("✅ Fichier sauvegardé :", out_csv)


=== RÉSUMÉ GLOBAL (10 articles) ===
Mots avant (moyenne): 350.5
Mots après  (moyenne): 40.9
Réduction moyenne    : 85.8%
Similarité sémantique: 0.760
Overlap extractif    : 0.841
✅ Fichier sauvegardé : bart_10_random_before_after.csv
