📦 Cellule 1 – Setup & imports

In [4]:
# ===========================================
# 1) Installation rapide des dépendances
# ===========================================
import sys, subprocess, os, re, io, zipfile, requests, unicodedata
import numpy as np
import pandas as pd
import networkx as nx
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


📁 Cellule 2 – Chargement du CSV

In [5]:
df = pd.read_csv('tennis_articles.csv', encoding='latin-1')
df = df.drop(columns=['article_title'], errors='ignore')
print(f"{len(df)} articles chargés")
df.head()

8 articles chargés


Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP)  Roger Federer advanc...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


🔪 Cellule 3 – Préparation : tokenisation des phrases

In [6]:
sentences = []
for text in df['article_text'].dropna():
    sentences.extend(sent_tokenize(text))
print("Phrases extraites :", len(sentences))

Phrases extraites : 130


⬇️ Cellule 4 – Téléchargement et chargement de GloVe 100d

In [7]:
GLOVE_DIR = '/content/glove'
os.makedirs(GLOVE_DIR, exist_ok=True)

glove_zip_path = os.path.join(GLOVE_DIR, "glove.6B.zip")
glove_txt_path = os.path.join(GLOVE_DIR, "glove.6B.100d.txt")

if not os.path.isfile(glove_txt_path):
    print("Téléchargement de GloVe 6B 100d...")
    url = "http://nlp.stanford.edu/data/glove.6B.zip"
    r = requests.get(url, stream=True)
    with zipfile.ZipFile(io.BytesIO(r.content)) as z:
        z.extractall(GLOVE_DIR)
    print("✅ GloVe décompressé.")

# Chargement du dictionnaire mot → vecteur
glove_vectors = {}
with open(glove_txt_path, encoding="utf8") as f:
    for line in f:
        parts = line.split()
        if len(parts) != 101:
            continue
        word = parts[0]
        vec = np.array(parts[1:], dtype='float32')
        glove_vectors[word] = vec
print(f"{len(glove_vectors):,} mots chargés")

Téléchargement de GloVe 6B 100d...
✅ GloVe décompressé.
400,000 mots chargés


🧹 Cellule 5 – Nettoyage & normalisation

In [8]:
def clean_sentence(sentence):
    # Normalisation Unicode (corrige Hed → He'd)
    sentence = unicodedata.normalize("NFKD", sentence)
    sentence = sentence.lower().strip()
    # Garde apostrophes et lettres
    sentence = re.sub(r"[^\w\s']", " ", sentence)
    tokens = [w for w in sentence.split()
              if w not in stop_words and len(w) > 2]
    return tokens

cleaned_sentences = [clean_sentence(s) for s in sentences]
# on retire les phrases vides
cleaned_sentences = [tok for tok in cleaned_sentences if tok]

🔢 Cellule 6 – Vectorisation des phrases

In [9]:
EMB_DIM = 100
def sentence_vector(tokens):
    vecs = [glove_vectors[w] for w in tokens if w in glove_vectors]
    if not vecs:
        return np.zeros(EMB_DIM, dtype='float32')
    return np.mean(vecs, axis=0)

sentence_vectors = np.array([sentence_vector(tok) for tok in cleaned_sentences])

🕸️ Cellule 7 – Construction du graphe de similarité

In [10]:
SIM_THRESHOLD = 0.25
n = len(sentence_vectors)
sim_mat = cosine_similarity(sentence_vectors)

graph = nx.Graph()
for i in range(n):
    for j in range(i+1, n):
        if sim_mat[i, j] > SIM_THRESHOLD:
            graph.add_edge(i, j, weight=sim_mat[i, j])

print(f"Graphe créé : {graph.number_of_nodes()} nœuds, {graph.number_of_edges()} arêtes")

Graphe créé : 129 nœuds, 8247 arêtes


🔹 Cellule 8 – PageRank + pénalité longueur

In [11]:

# PageRank avec pénalité exponentielle
#    pour réduire l’avantage des phrases très longues

MAX_WORDS = 40
penalty = {i: np.exp(-len(tok)/MAX_WORDS)
           for i, tok in enumerate(cleaned_sentences)}

raw_scores = nx.pagerank(graph, weight='weight')
scores = {i: raw_scores.get(i, 0) * penalty[i] for i in range(n)}

✂️ Cellule 9 – Génération du *résumé*

In [12]:
N = 10
top_idx = sorted(scores, key=scores.get, reverse=True)[:N]
summary = ' '.join([sentences[i] for i in sorted(top_idx)])
print("=== Résumé automatique (10 phrases) ===\n")
print(summary)

=== Résumé automatique (10 phrases) ===

But ultimately tennis is just a very small part of what we do. Copil has two after also beating No. They only left me three days to decide, Federer said. He has lost eight straight finals since. It seems pretty friendly right now, said Davenport. It's just getting used to it. This is where the first rounds can be tricky. It's a big deal because...you never forget your first. Not because hed been out on a bender or anything  those days were in the past. I was always happy to work hard, he said.


📊 Cellule 10 – Analyse rapide des résultats

In [13]:
vals = list(scores.values())
print("Top-5 scores :", sorted(vals, reverse=True)[:5])
print("Écart-type   :", round(np.std(vals), 4))

Top-5 scores : [np.float64(0.00721986265212178), np.float64(0.007139190997039178), np.float64(0.00710844949774012), np.float64(0.007058373721557581), np.float64(0.0070333833215309935)]
Écart-type   : 0.001


Le pipeline est fonctionnel ; la faible variance des scores et la longueur du résumé sont les deux points à corriger immédiatement via un seuil de similarité plus élevé et un N plus petit.

Pistes d'amélioration :

*Rendre le graphe plus clairsemé : *


SIM_THRESHOLD = 0.30  # ou 0.35

→ diminuer le nombre d’arêtes → élargir l’écart entre les scores.

*Réduire la taille du résumé*


N = 4  # ou 5 maximum

*Ajouter un score de « première occurrence »
Les premières phrases d’un article ont souvent plus de valeur.*


position_bonus = {i: np.exp(-i/50) for i in range(n)}
scores = {i: raw_scores.get(i,0) * penalty[i] * position_bonus[i] for i in range(n)}

 *Passer à SBERT (optionnel mais très efficace)
Les embeddings contextuels donnent des similarités plus espacées dès le départ.*


!pip install -q sentence-transformers
from sentence_transformers import SentenceTransformer
sbert = SentenceTransformer('all-MiniLM-L6-v2')
sentence_vectors = sbert.encode(sentences, show_progress_bar=True)