# **A Multi-Label Dataset of French Fake News**

This NLP project was carried out as part of the *"Machine Learning for NLP"* course in the second year of the master's program at ENSAE. It is based on the paper [**A Multi-Label Dataset of French Fake News: Human and Machine Insights**](https://arxiv.org/abs/2403.16099) by Icard et al. (2024), and uses the associated GitHub repository [**OBSINFOX**](https://github.com/obs-info/obsinfox), which provides both the dataset used in this study and its accompanying documentation.

---

### Dataset Overview — OBSINFOX

* **Labels**: 11 distinct labels annotated for each document, with detailed definitions provided in the paper and repository README.
* **Metadata**: title, annotator ID, article URL.
* **Sources**: The dataset includes articles from **17 French media sources** identified as unreliable by watchdog organizations such as *NewsGuard* and *Conspiracy Watch*.

The dataset consists of **100 documents** carefully selected from the aforementioned sources using a specific methodology, which will be explained later in the notebook.

---

### Why This Dataset Matters for Fake News Detection

This dataset provides a rich foundation for studying how linguistic cues can signal that a text is **counterfactual**, **subjective**, or **satirical**. Several key aspects make it particularly valuable:

* Each document is annotated along **11 complementary dimensions**, encompassing linguistic, stylistic, and factual properties.
* Annotations were conducted by a panel of **8 expert annotators**, ensuring high-quality and nuanced evaluation.
* The multi-label structure enables **multi-dimensional analysis** of fake news, capturing not only factual inaccuracies but also subtleties of tone, exaggeration, and bias.

As such, OBSINFOX serves as a powerful resource for training and evaluating models capable of detecting weak signals of misinformation and editorial bias in French-language news content.

---

### *Importing Required Packages*

In [5]:
import pandas as pd
import numpy as np
import requests

### **Preprocessing the OBSINFOX Dataset**

In this section, we begin by importing the OBSINFOX dataset and applying an initial round of preprocessing to prepare the data for analysis and modeling.

The preprocessing pipeline will include the following steps:

1. **Loading the dataset**: Reading the structured data and metadata from the OBSINFOX repository.

2. **Text cleaning**: Removing unnecessary whitespace, special characters, and correcting encoding issues if present.

3. **Normalization**: Lowercasing text and standardizing punctuation to reduce vocabulary size.

4. **Exploratory filtering**: Ensuring the dataset is consistent and complete by removing empty or malformed entries.


This stage is essential to ensure the quality and consistency of the textual data before moving on to any linguistic analysis or model training.

In [3]:
df = pd.read_csv('obsinfox.csv')
df.head()

Unnamed: 0,URL,Title,Fake News,"Places, Dates, People",Facts,Opinions,Subjective,Reported information,Sources Cited,False Information,Insinuation,Exaggeration,Offbeat Title,Annotator
0,https://lesakerfrancophone.fr/la-relation-entr...,La relation entre la technologie et la religion,0,1,1,1,1,0,1,0,0,0,0,rater1
1,https://www.breizh-info.com/2021/01/27/157958/...,"Confinement. Les habitants de Brest, Morlaix e...",0,1,1,0,0,0,1,0,0,0,0,rater1
2,https://reseauinternational.net/la-chine-le-pr...,La Chine : Le premier marché mondial de Smartp...,0,1,1,0,0,0,1,0,0,0,0,rater1
3,https://lezarceleurs.blogspot.com/2021/12/emma...,"Emmanuel à Olivier : « Tiens bon, on les aura ...",1,1,1,1,1,0,0,1,0,1,0,rater1
4,https://lesakerfrancophone.fr/selon-ubs-les-pr...,"Selon UBS, les « propriétés d’assurance tant d...",0,1,1,1,1,0,1,0,0,0,0,rater1


#### *Metadonnées & Labels*

Chaque ligne du dataset contient les différentes annotations réalisées pour un texte et un annotateur. La construction du dataset inclut des métadonnées visant à donner du contexte aux différents textes : titre, URL, ainsi que l'ID anonyme de l'annotateur. Il y a donc 800 lignes dans le dataset correspondant aux 100 textes annotés par 8 personnes. 

The OBSINFOX dataset includes eleven distinct labels that capture various **factual**, **stylistic**, and **interpretative** features of news articles. These labels were designed to reflect both the objective content and the subjective framing often present in fake or misleading news. Below is a description of each label:


* **Fake News**
  The article contains at least one false or exaggerated fact.

* **Places, Dates, People**
  The article refers to at least one identifiable place, date, or person.

* **Facts**
  The article reports at least one factual element — that is, a state of affairs or event, whether true or false.

* **Opinions**
  The article expresses at least one opinion, judgment, or personal interpretation.

* **Subjective**
  The article contains more opinions than verifiable facts, highlighting a subjective tone or perspective.

* **Reported Information**
  The article relays information that is attributed to an external source and is not directly endorsed by the author.

* **Sources Cited**
  The article includes at least one cited source that supports or contextualizes a factual claim.

* **False Information**
  The article contains information that is demonstrably false or factually incorrect.

* **Insinuation**
  The article implies or suggests a certain interpretation without stating it explicitly.

* **Exaggeration**
  The article presents a real fact using language or framing that amplifies or distorts its significance.

* **Offbeat Title**
  The article has a misleading or sensational headline that does not accurately reflect the actual content.

These labels aim to offer a nuanced characterization of the articles' content. Some of them are **objectively measurable** (e.g., *Places, Dates, People*, *Facts*, *Sources Cited*), while others are more **subject to interpretation**, reflecting the annotators’ perception and judgment.

Studying the **co-occurrence patterns** among these labels provides valuable insight into how alternative or misleading news content is structured and perceived by human readers.

In [11]:
print(" DataFrame Summary")
print("="*40)
print(f"Number of rows    : {df.shape[0]}")
print(f"Number of columns : {df.shape[1]}")
print("\n Column List: \n")

for i, col in enumerate(df.columns):
    print(f"{i+1:>2}. {col}")


 DataFrame Summary
Number of rows    : 800
Number of columns : 14

 Column List: 

 1. URL
 2. Title
 3. Fake News
 4. Places, Dates, People
 5. Facts
 6. Opinions
 7. Subjective
 8. Reported information
 9. Sources Cited
10. False Information
11. Insinuation
12. Exaggeration
13. Offbeat Title
14. Annotator


Du fait de sa petite taille, les données dans le dataset sont très qualitatives avec aucune valeur manquante et aucune duplication de lignes, ce qui facilite l'étape de nettoyage dans le preprocessing. 

In [None]:
df.isna().sum()

URL                      0
Title                    0
Fake News                0
Places, Dates, People    0
Facts                    0
Opinions                 0
Subjective               0
Reported information     0
Sources Cited            0
False Information        0
Insinuation              0
Exaggeration             0
Offbeat Title            0
Annotator                0
dtype: int64

In [18]:
df.duplicated().sum()

0

Pour avoir une facilité d'accès aux différentes annotations pour un article spécifique, on ajoute au dataframe une colonne ``article_ID`` qui identifie de manière unique les 100 textes présents ici. 

In [None]:
df['article_id'] = 

On présente quelques manipulation typique que l'on peut faire sur le dataset à travers quelques exemples: 

- On peut faire la moyenne des annotations réalisées par un annotateur à des fins de comparaison (certains vont être plus critiques par exemple)
- Pour un article donné, on peut comparer les annotations 

In [None]:
cols = df.columns[2:13]  
grouped_means = df.groupby('Annotator')[cols].mean()

print("Mean values by Annotator for selected columns")
display(grouped_means)

📈 Summary: Mean values by Annotator for selected columns


Unnamed: 0_level_0,Fake News,"Places, Dates, People",Facts,Opinions,Subjective,Reported information,Sources Cited,False Information,Insinuation,Exaggeration,Offbeat Title
Annotator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
rater1,0.21,0.9,0.93,0.8,0.71,0.14,0.75,0.08,0.22,0.37,0.03
rater2,0.29,0.94,0.99,0.93,0.74,0.21,0.82,0.09,0.24,0.44,0.21
rater3,0.64,1.0,0.96,0.74,0.63,0.14,0.68,0.46,0.46,0.61,0.14
rater4,0.34,1.0,1.0,0.87,0.52,0.22,0.87,0.32,0.54,0.42,0.16
rater5,0.47,0.93,0.96,0.66,0.67,0.76,0.76,0.22,0.73,0.53,0.17
rater6,0.34,0.75,0.93,0.72,0.6,0.21,0.6,0.11,0.42,0.35,0.04
rater7,0.29,1.0,0.91,0.45,0.47,0.66,0.78,0.24,0.43,0.35,0.01
rater8,0.44,1.0,0.99,0.66,0.48,0.11,0.46,0.34,0.59,0.52,0.08


In [None]:
grouped_means.T.plot(kind='bar', figsize=(12, 6))

plt.title("📊 Mean Annotation Scores by Annotator")
plt.xlabel("Annotation Labels")
plt.ylabel("Mean Score")
plt.xticks(rotation=45)
plt.legend(title="Annotator")
plt.tight_layout()
plt.show()

--- 

### **Statistiques Descriptives**

Afin de nous approprier les données et avant d'aller plus dans l'analyse, nous allons sortir une série de statistiques pour décrire au mieux le dataset. 

### *Human Annotations*

We now turn our attention to evaluating the quality and structure of the human annotations in the OBSINFOX dataset.

Specifically, we aim to:

* Assess the **consistency of the annotations** with expectations based on the dataset’s construction methodology,
* Examine the **class balance** across the different labels,
* Measure the **inter-annotator agreement**,
* Analyze the **correlation between labels** to better understand their relationships.

This analysis is a crucial step in our study, as it allows us to evaluate the **reliability and informativeness** of the labels provided. Understanding how human annotators identify fake news — and which linguistic or stylistic cues they rely on — will guide the rest of our project.

In particular, this insight will help us identify:

* Which labels are the most **discriminative** or **informative** for fake news detection,
* How closely machine learning models can replicate or differ from **human reasoning**.

Later in the project, we will compare these human-driven patterns with the predictions and internal logic of automated models, bridging the gap between human and machine interpretations of misinformation.

---
### **Topic & Genre Analysis de OBSINFOX**

Aller sur le site internet pour générer une free API key et faire l'analyse de TOPIC / GENRE

In [6]:
api_key = "YOUR_GATE_CLOUD_API_KEY"
url = "https://cloud.gate.ac.uk/process-document/gatecloud-service-name"

headers = {
    "Accept": "application/json",
    "Content-Type": "application/x-yaml",
    "GATECloud-API-Key": api_key
}

files = {
    'input': ('document.txt', open('document.txt', 'rb'))
}

response = requests.post(url, headers=headers, files=files)
result = response.json()

print(result)


FileNotFoundError: [Errno 2] No such file or directory: 'document.txt'

---
### **Understanding Human vs Machine Caracterisation of Fake News**

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "VAGOsolutions/SauerkrautLM-7b-HerO"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

ValueError: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Les dernières nouvelles sur l'économie mondiale indiquent que"
result = generator(prompt, max_length=100, do_sample=True, temperature=0.7)

print(result[0]['generated_text'])