# Introduction

Dans ce notebook, nous allons implémenter une approche hybride pour la détection de fake news en nous appuyant sur lee modèle BERT. Notre démarche se décompose en deux étapes principales :

1. **Pré-entraînement général :** Nous commencerons par entraîner un modèle sur un corpus large de fake news afin d'acquérir des représentations linguistiques robustes et de capturer des caractéristiques communes à l'ensemble des données.

2. **Fine-tuning spécialisé :** Ensuite, nous affinerons ce modèle en le fine-tunant sur des sous-datasets thématiques spécifiques (par exemple, politique, COVID, divertissement) afin d'adapter le modèle aux particularités de chaque domaine.

Cette stratégie nous permet de combiner la capacité de généralisation des modèles pré-entraînés avec une expertise pointue pour chaque thème, dans le but d'optimiser la performance globale de notre système de détection de fake news.


---


**Données utilisées**



*   Dataset général de base :

  https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

*   Données politiques pour le fine-tuning :

  https://github.com/KaiDMML/FakeNewsNet/blob/master/dataset/politifact_fake.csv
  https://github.com/KaiDMML/FakeNewsNet/blob/master/dataset/politifact_real.csv

*   Données de divertissement pour le fine-tuning :

  https://github.com/KaiDMML/FakeNewsNet/blob/master/dataset/gossipcop_fake.csv
  https://github.com/KaiDMML/FakeNewsNet/blob/master/dataset/gossipcop_real.csv

*   Données de Covid pour le fine-tuning :

  Clean_Constraint_English


# Étape 1 : Préparation des données générales

In [2]:
import sklearn

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("clmentbisaillon/fake-and-real-news-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/clmentbisaillon/fake-and-real-news-dataset?dataset_version_number=1...


100%|██████████| 41.0M/41.0M [00:00<00:00, 144MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/clmentbisaillon/fake-and-real-news-dataset/versions/1


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Charger le dataset général
fake_df = pd.read_csv(path + "/Fake.csv")
real_df = pd.read_csv(path + "/True.csv")

fake_df.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
print("fake_df: ", fake_df.shape)
print("real_df: ",  real_df.shape)

fake_df:  (23481, 4)
real_df:  (21417, 4)


In [6]:
fake_df['label'] = 'fake'
real_df['label'] = 'real'

# Concaténer les deux dataframes
df = pd.concat([fake_df, real_df], ignore_index=True)
df

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",fake
...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",real
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",real
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",real
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",real


In [7]:
df = sklearn.utils.shuffle(df).reset_index(drop='True')
print(df.shape)
print(df.head())
print(df.tail())

(44898, 5)
                                               title  \
0  Georgian policeman, three terrorism suspects k...   
1  Two Florida nuclear plants likely to shut if I...   
2  FAST AND FURIOUS Hearing Rips Obama and Holder...   
3  France unveils labor reforms in first step to ...   
4  A LEGEND IN HIS OWN MIND? Joe Biden Has Regret...   

                                                text          subject  \
0  TBILISI (Reuters) - One Georgian special force...        worldnews   
1  WASHINGTON (Reuters) - Energy firm Florida Pow...        worldnews   
2  Members of a congressional committee ripped Ob...  Government News   
3  PARIS (Reuters) - French President Emmanuel Ma...        worldnews   
4  Seriously? VP Biden wants us all to know he re...         politics   

                 date label  
0  November 22, 2017   real  
1  September 6, 2017   real  
2         Jun 9, 2017  fake  
3    August 31, 2017   real  
4        May 11, 2016  fake  
                                 

In [8]:
# Proportions
print(df['label'].value_counts(normalize=True)* 100)

label
fake    52.298543
real    47.701457
Name: proportion, dtype: float64


In [9]:
df = df.drop(['title', 'subject', 'date'], axis=1)
df.head()

Unnamed: 0,text,label
0,TBILISI (Reuters) - One Georgian special force...,real
1,WASHINGTON (Reuters) - Energy firm Florida Pow...,real
2,Members of a congressional committee ripped Ob...,fake
3,PARIS (Reuters) - French President Emmanuel Ma...,real
4,Seriously? VP Biden wants us all to know he re...,fake


In [10]:
# Encoding des labels
df['label'] = df['label'].apply(lambda x: 0 if x.lower() == "fake" else 1)
df.head()

Unnamed: 0,text,label
0,TBILISI (Reuters) - One Georgian special force...,1
1,WASHINGTON (Reuters) - Energy firm Florida Pow...,1
2,Members of a congressional committee ripped Ob...,0
3,PARIS (Reuters) - French President Emmanuel Ma...,1
4,Seriously? VP Biden wants us all to know he re...,0


In [11]:
from sklearn.model_selection import train_test_split

# 80% pour l'entraînement et 30% pour un ensemble temporaire
train_data, temp_data = train_test_split(df, test_size=0.2, random_state=42)

# 2. 15% du total pour le test et 15% du total pour la validation
test_data, val_data = train_test_split(temp_data, test_size=0.5, random_state=42)


In [12]:
import gc
del df
gc.collect()

62

# Étape 2 : Pré-entraînement sur le dataset général

In [13]:
# Chargement du modèle et du tokenizer
# Nous utiliserons le modèle BERT issu des modèles HuggingFace Transformers

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
def tokenize_texts(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Tokeniser les textes
train_encodings = tokenizer(list(train_data['text']), truncation=True, padding=True)
test_encodings = tokenizer(list(test_data['text']), truncation=True, padding=True)
val_encodings = tokenizer(list(val_data['text']), truncation=True, padding=True)

# Création d'un Dataset personnalisé
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, list(train_data['label']))
test_dataset = NewsDataset(test_encodings, list(test_data['label']))
val_dataset = NewsDataset(val_encodings, list(val_data['label']))


In [15]:
import torch

del train_data, test_data, val_data, train_encodings, test_encodings, val_encodings
gc.collect()

torch.cuda.empty_cache()

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=50,
    eval_steps=100,
    save_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33maliyahadjibade[0m ([33maliyahadjibade-universit-d-abomey-calavi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
100,0.0299,0.001508
200,0.0157,0.008284
300,0.0001,0.003534
400,0.0001,0.003824


In [None]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

# Évaluation
eval_results = trainer.evaluate(eval_dataset=val_dataset)
print("Evaluation results :", eval_results)


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_model(trainer, dataset, target_names=["Real", "Fake"], plot_confusion=True):
    """
    Évalue un modèle à partir d'un trainer HuggingFace sur le dataset fourni.

    Paramètres:
      - trainer: l'instance Trainer utilisée pour entraîner le modèle.
      - dataset: le dataset à évaluer (ex: val_data ou test_data).
      - target_names: liste des noms des classes (par défaut ["Real", "Fake"]).
      - plot_confusion: booléen indiquant si la matrice de confusion doit être affichée.

    Affiche:
      - Les métriques globales (accuracy, précision, recall, F1-score en moyenne macro).
      - Le rapport de classification détaillé par classe.
      - La matrice de confusion.
    """
    # Générer les prédictions à partir du dataset
    predictions = trainer.predict(dataset)
    preds = np.argmax(predictions.predictions, axis=1)
    labels = predictions.label_ids

    # Calcul des métriques globales
    accuracy = accuracy_score(labels, preds)
    precision_macro = precision_score(labels, preds, average='macro')
    recall_macro = recall_score(labels, preds, average='macro')
    f1_macro = f1_score(labels, preds, average='macro')

    print("=== Résultats Globaux ===")
    print(f"Accuracy : {accuracy:.4f}")
    print(f"Précision (macro) : {precision_macro:.4f}")
    print(f"Recall (macro) : {recall_macro:.4f}")
    print(f"F1-score (macro) : {f1_macro:.4f}\n")

    # Rapport détaillé par classe
    print("=== Rapport de Classification ===")
    print(classification_report(labels, preds, target_names=target_names))

    # Calcul et affichage de la matrice de confusion
    cm = confusion_matrix(labels, preds)
    print("=== Matrice de Confusion ===")
    print(cm)

    if plot_confusion:
        plt.figure(figsize=(6,4))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names, yticklabels=target_names)
        plt.xlabel("Prédictions")
        plt.ylabel("Véritables")
        plt.title("Matrice de Confusion")
        plt.show()


In [None]:
evaluate_model(trainer, val_dataset)

In [None]:
evaluate_model(trainer, test_dataset)