# Fine-tuning du Classifieur Fake News

Ce notebook guide le fine-tuning du modele RoBERTa pour la detection de fake news.

**Etapes :**
1. Chargement et exploration du dataset
2. Fine-tuning du modele
3. Evaluation detaillee
4. Comparaison avant/apres fine-tuning
5. Test sur des exemples reels

In [None]:
import sys
sys.path.insert(0, '..')

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_theme(style='whitegrid')
print('Setup OK')

---
## 1. Chargement et exploration du dataset

In [None]:
from src.training.dataset import (
    load_liar_dataset, load_fake_news_kaggle, merge_datasets,
    get_dataset_stats, ID2LABEL, LABEL2ID
)

# Option A : LIAR seul (recommande pour commencer)
dataset = load_liar_dataset()

# Option B : LIAR + Kaggle combines (decommenter pour plus de donnees)
# liar = load_liar_dataset()
# kaggle = load_fake_news_kaggle()
# dataset = merge_datasets(liar, kaggle)

# Option C : Dataset personnalise
# from src.training.dataset import load_custom_csv
# dataset = load_custom_csv('data/raw/mon_dataset.csv')

print(dataset)

In [None]:
# Statistiques du dataset
stats = get_dataset_stats(dataset)
for split, info in stats.items():
    print(f'\n=== {split.upper()} ({info["total"]} exemples) ===')
    for label, data in info['distribution'].items():
        bar = '█' * int(float(data['pct'].replace('%', '')) / 2)
        print(f'  {label:>10s} : {data["count"]:5d} ({data["pct"]:>6s}) {bar}')

In [None]:
# Distribution des labels
train_labels = [ID2LABEL[l] for l in dataset['train']['label']]
fig = px.histogram(x=train_labels, color=train_labels,
                   color_discrete_map={'fiable': '#2ecc71', 'douteux': '#f39c12', 'fake': '#e74c3c'},
                   title='Distribution des labels (train set)')
fig.show()

In [None]:
# Longueur des textes
train_texts = dataset['train']['text']
lengths = [len(t.split()) for t in train_texts]

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(lengths, bins=50, edgecolor='black', alpha=0.7)
ax.axvline(np.mean(lengths), color='red', linestyle='--', label=f'Moyenne: {np.mean(lengths):.0f} mots')
ax.axvline(np.median(lengths), color='blue', linestyle='--', label=f'Mediane: {np.median(lengths):.0f} mots')
ax.set_xlabel('Nombre de mots')
ax.set_ylabel('Frequence')
ax.set_title('Distribution de la longueur des textes (train)')
ax.legend()
plt.tight_layout()
plt.show()

print(f'Min: {min(lengths)}, Max: {max(lengths)}, Moyenne: {np.mean(lengths):.1f}')

In [None]:
# Exemples par label
for label_id, label_name in ID2LABEL.items():
    examples = [t for t, l in zip(train_texts, dataset['train']['label']) if l == label_id][:3]
    print(f'\n=== {label_name.upper()} ===')
    for ex in examples:
        print(f'  - {ex[:120]}...' if len(ex) > 120 else f'  - {ex}')

---
## 2. Fine-tuning

In [None]:
from src.training.train import train

# Lancer le fine-tuning
# Ajuster les hyperparametres selon vos ressources :
#   - GPU : epochs=3, batch_size=16
#   - CPU : epochs=2, batch_size=8, max_length=128

metrics = train(
    model_name='roberta-base',       # ou 'camembert-base' pour le francais
    dataset_name='liar',             # 'liar', 'kaggle', 'liar+kaggle', 'custom'
    output_dir='../data/models/fake_news_detector',
    epochs=3,
    batch_size=16,
    learning_rate=2e-5,
    max_length=256,
)

In [None]:
# Afficher les metriques d'entrainement
print('=== Metriques finales ===')
for key, value in metrics['test_metrics'].items():
    if isinstance(value, float):
        print(f'  {key:25s} : {value:.4f}')

---
## 3. Evaluation detaillee

In [None]:
from src.training.evaluate import evaluate

report = evaluate(
    model_path='../data/models/fake_news_detector',
    dataset_name='liar',
)

In [None]:
# Metriques par classe
per_class = pd.DataFrame(report['per_class']).T
per_class

In [None]:
# Afficher la matrice de confusion generee
from IPython.display import Image
Image('../data/models/fake_news_detector/confusion_matrix.png')

In [None]:
# Afficher la distribution de confiance
Image('../data/models/fake_news_detector/confidence_distribution.png')

---
## 4. Comparaison avant/apres fine-tuning

In [None]:
from src.models.fake_news_detector import FakeNewsDetector

# Charger le modele fine-tune (charge automatiquement depuis data/models/)
detector_ft = FakeNewsDetector()
print(f'Modele fine-tune : {detector_ft.is_finetuned}')

# Charger le modele de base pour comparaison
detector_base = FakeNewsDetector.__new__(FakeNewsDetector)
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
detector_base.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
detector_base.tokenizer = AutoTokenizer.from_pretrained('roberta-base')
detector_base.model = AutoModelForSequenceClassification.from_pretrained(
    'roberta-base', num_labels=3
).to(detector_base.device)
detector_base.model.eval()
detector_base.is_finetuned = False
print('Modele de base charge pour comparaison')

In [None]:
# Comparer sur des exemples
test_texts = [
    "The president announced new economic measures at the press conference.",
    "BREAKING: Aliens have landed in Paris, the government is hiding the truth!!!",
    "According to WHO, the vaccine is safe and effective after phase 3 clinical trials.",
    "They're hiding everything! The elites control the world with 5G, wake up!",
    "The football match ended with a score of 2-1.",
    "Scientists have discovered that drinking bleach cures all diseases.",
    "The unemployment rate decreased by 0.5% this quarter according to official statistics.",
]

print(f'{"Texte":<60s} | {"Base":>12s} | {"Fine-tune":>12s}')
print('-' * 92)

for text in test_texts:
    pred_base = detector_base.predict(text)
    pred_ft = detector_ft.predict(text)
    
    base_str = f'{pred_base["label"]} ({pred_base["confidence"]:.0%})'
    ft_str = f'{pred_ft["label"]} ({pred_ft["confidence"]:.0%})'
    
    print(f'{text[:58]:<60s} | {base_str:>12s} | {ft_str:>12s}')

---
## 5. Test sur des posts Bluesky reels

In [None]:
import os
from dotenv import load_dotenv
from src.collector.bluesky_client import BlueskyCollector
from src.preprocessing.text_processor import preprocess_batch

load_dotenv('../.env')

collector = BlueskyCollector(os.getenv('BLUESKY_HANDLE'), os.getenv('BLUESKY_PASSWORD'))
raw_posts = collector.search_posts('fake news', lang='fr', limit=20)
processed = preprocess_batch(raw_posts)

print(f'{len(processed)} posts collectes et pretraites')

In [None]:
# Classification avec le modele fine-tune
for post in processed[:10]:
    pred = detector_ft.predict(post['clean_text'])
    emoji = {'fiable': '✅', 'douteux': '⚠️', 'fake': '❌'}[pred['label']]
    print(f'{emoji} [{pred["label"]:>7s}] ({pred["confidence"]:.0%}) @{post["author_handle"]}')
    print(f'   {post["text"][:100]}')
    print()

---
## Conclusion

Le fine-tuning ameliore significativement les performances du modele :
- Le modele de base attribue des scores quasi-aleatoires (pas entraine pour cette tache)
- Le modele fine-tune distingue les 3 classes avec un F1-score mesurable

**Pistes d'amelioration :**
- Utiliser `camembert-base` pour de meilleures performances en francais
- Augmenter le dataset avec des donnees françaises annotees
- Combiner LIAR + Kaggle pour plus de diversite
- Experimenter avec des learning rates plus bas (1e-5) et plus d'epoques