# Fake News Detection - Phase 2: Data Preprocessing

---

### Objectifs de ce notebook

Ce notebook implemente les étapes de preprocessing identifiées lors de l'exploration. Nous suivrons alors la méthodologie décrite dans l'article de référence (Roumeliotis et al., 2025) avec les étapes suivantes:

1. Nettoyage des données (valeurs manquantes, doublons)
2. Fusion des colonnes titre et texte
3. Normalisation du texte
4. Tokenisation
5. Suppression des stopwords
6. Lemmatisation
7. Limitation de la longueur du texte
9. Division train/validation/test

## 1. Configuration de l'environnement

In [1]:
# Importation des bibliotheques
import pandas as pd
import numpy as np
import re
import string
from pathlib import Path
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings('ignore')

# Telechargement des ressources NLTK
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

True

In [2]:
# Chemins
raw = Path('../data/raw')
processed = Path('../data/processed')
models = Path('../models')

# Créer dossiers si nécessaire
for path in [raw, processed]:
    path.mkdir(parents=True, exist_ok=True)

## 2. Chargement des données

Nous chargeons le dataset combiné "combined_raw" lors de l'exploration.

In [3]:
# Chargement du dataset
df = pd.read_csv(raw/"combined_raw.csv", sep=",")

print(f"Dataset chargé: {len(df)} articles")
print(f"Colonnes: {list(df.columns)}")
df.head(3)

Dataset chargé: 44898 articles
Colonnes: ['title', 'text', 'subject', 'date', 'Detection', 'detect_label', 'text_length_chars', 'text_length_words']


Unnamed: 0,title,text,subject,date,Detection,detect_label,text_length_chars,text_length_words
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0,Fake,1028,171
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1,True,4820,771
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1,True,1848,304


## 3. Nettoyage des données

### 3.1 Gestion des valeurs manquantes

In [4]:
# Etat avant nettoyage
print("Valeurs manquantes avant nettoyage:")
print(df.apply(lambda x: x.astype(str).str.replace(r'\s+', '', regex=True).eq('').sum()))

initial_count = len(df)
print(f"Nombre d'obs initial : {initial_count}")

Valeurs manquantes avant nettoyage:
title                  0
text                 631
subject                0
date                   0
Detection              0
detect_label           0
text_length_chars      0
text_length_words      0
dtype: int64
Nombre d'obs initial : 44898


In [5]:
# Supprimer les lignes où text ou Detection ne contiennent que des espaces
df = df[~df['text'].astype(str).str.replace(r'\s+', '', regex=True).eq('')]

print(f"Nombres d'articles supprimés: {initial_count - len(df)}")

Nombres d'articles supprimés: 631


### 3.2 Suppression des doublons

In [6]:
initial_count = len(df)
df = df.drop_duplicates(subset=['text'], keep='first')

print(f"Doublons supprimés: {initial_count - len(df)}")
print(f"Articles restants: {len(df)}")

Doublons supprimés: 5623
Articles restants: 38644


## 4. Fusion des colonnes titre et texte

Conformement à la méthodologie de l'article, nous combinons le titre et le texte pour fournir au modéle un contexte complet.

In [7]:
df['combined_text'] = df['title'].astype(str) + " " + df['text'].astype(str)
print("Colonnes 'title' et 'text' fusionnées dans 'combined_text'")

Colonnes 'title' et 'text' fusionnées dans 'combined_text'


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38644 entries, 0 to 44896
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   title              38644 non-null  object
 1   text               38644 non-null  object
 2   subject            38644 non-null  object
 3   date               38644 non-null  object
 4   Detection          38644 non-null  int64 
 5   detect_label       38644 non-null  object
 6   text_length_chars  38644 non-null  int64 
 7   text_length_words  38644 non-null  int64 
 8   combined_text      38644 non-null  object
dtypes: int64(3), object(6)
memory usage: 2.9+ MB


In [9]:
# Preparation des données pour la division
df_first = df.copy()
# Ajout de l'identifiant
df_first['id'] = range(1, len(df_first) + 1)
df_first['label'] = df_first['Detection']

# Reorganisation des colonnes
cols = ['id', 'combined_text', 'label']
df_first = df_first[cols]

df_first['text'] = df_first['combined_text']

cols = ['id', 'text', 'label']
df_first = df_first[cols]


In [10]:
df_first.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38644 entries, 0 to 44896
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      38644 non-null  int64 
 1   text    38644 non-null  object
 2   label   38644 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ MB


In [11]:
X = df_first[['id', 'text']]
y = df_first['label']

# Premiere division: 80% train+val, 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Deuxieme division: 80% train, 20% val (du train+val)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.20, random_state=42, stratify=y_train_val
)

print("Division du dataset:")
print(f"  Training set:   {len(X_train):,} samples ({len(X_train)/len(df_first)*100:.1f}%)")
print(f"  Validation set: {len(X_val):,} samples ({len(X_val)/len(df_first)*100:.1f}%)")
print(f"  Test set:       {len(X_test):,} samples ({len(X_test)/len(df_first)*100:.1f}%)")

Division du dataset:
  Training set:   24,732 samples (64.0%)
  Validation set: 6,183 samples (16.0%)
  Test set:       7,729 samples (20.0%)


In [12]:
# Creation des DataFrames interim
train_df = pd.DataFrame({
    'id': X_train['id'].values,
    'text': X_train['text'].values,
    'label': y_train.values
})

val_df = pd.DataFrame({
    'id': X_val['id'].values,
    'text': X_val['text'].values,
    'label': y_val.values
})

test_df = pd.DataFrame({
    'id': X_test['id'].values,
    'text': X_test['text'].values,
    'label': y_test.values
})

print("Datasets prepares:")
print(f"  train_df: {train_df.shape}")
print(f"  val_df:   {val_df.shape}")
print(f"  test_df:  {test_df.shape}")

Datasets prepares:
  train_df: (24732, 3)
  val_df:   (6183, 3)
  test_df:  (7729, 3)


In [13]:
# Creation du repertoire de sortie
import os
output_path = "../data/interim/"
os.makedirs(output_path, exist_ok=True)

# Sauvegarde des fichiers CSV
train_df.to_csv(output_path + "train.csv", index=False)
val_df.to_csv(output_path + "validation.csv", index=False)
test_df.to_csv(output_path + "test.csv", index=False)

# Sauvegarde du dataset complet preprocesse
df.to_csv(output_path + "full_interim.csv", index=False)

print("Fichiers sauvegardes:")
print(f"  - {output_path}train.csv")
print(f"  - {output_path}validation.csv")
print(f"  - {output_path}test.csv")
print(f"  - {output_path}full_interim.csv")

Fichiers sauvegardes:
  - ../data/interim/train.csv
  - ../data/interim/validation.csv
  - ../data/interim/test.csv
  - ../data/interim/full_interim.csv


## 5. Preprocessing du texte

### 5.1 Definition des fonctions de preprocessing

In [14]:
# Initialisation des outils NLP
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Conversion en minuscules
    text = str(text).lower()
    
    # Suppression des URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Suppression des mentions et hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Suppression de la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Suppression des chiffres
    text = re.sub(r'\d+', '', text)
    
    # Suppression des espaces multiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [15]:
def preprocess_text(text):
    # Nettoyage initial
    text = clean_text(text)
    
    # Tokenisation
    tokens = word_tokenize(text)
    
    # Filtrage: stopwords et tokens courts
    tokens = [token for token in tokens 
              if token not in stop_words and len(token) > 2]
    
    # Lemmatisation
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Reconstruction
    return ' '.join(tokens)


### 5.2 Application du preprocessing

In [16]:
# Test sur un exemple
sample_text = df['combined_text'].iloc[0]
print("Texte original (extrait):")
print(sample_text[:200])
print("\nTexte preprocesse:")
print(preprocess_text(sample_text)[:200])

Texte original (extrait):
Ben Stein Calls Out 9th Circuit Court: Committed a ‘Coup d’état’ Against the Constitution 21st Century Wire says Ben Stein, reputable professor from, Pepperdine University (also of some Hollywood fame

Texte preprocesse:
ben stein call circuit court committed coup état constitution century wire say ben stein reputable professor pepperdine university also hollywood fame appearing show film ferris bueller day made provo


In [17]:
# Application du preprocessing sur tout le dataset
df['processed_text'] = df['combined_text'].apply(preprocess_text)

In [18]:
# Verification des resultats
print("Comparaison avant/apres preprocessing:")
print(f"Original ({len(df['combined_text'].iloc[5])} chars):")
print(df['combined_text'].iloc[5][:150])
print(f"Preprocessed ({len(df['processed_text'].iloc[5])} chars):")
print(df['processed_text'].iloc[5][:150])

Comparaison avant/apres preprocessing:
Original (2216 chars):
 Paul Ryan Responds To Dem’s Sit-In On Gun Control In The Most DISGUSTING Way (VIDEO) On Wednesday, Democrats took a powerful stance against the GOP s
Preprocessed (1478 chars):
paul ryan responds dem sitin gun control disgusting way video wednesday democrat took powerful stance gop refusal vote gun control measure staging sit


## 7. Suppression des textes vides

Après le preprocessing, certains textes peuvent devenir vides. Nous les supprimons.

In [19]:
# Verification des textes vides
empty_texts = (df['processed_text'].str.strip() == '').sum()
print(f"Textes vides apres preprocessing: {empty_texts}")

# Suppression si necessaire
initial_count = len(df)
df = df[df['processed_text'].str.strip() != '']

print(f"Articles supprimes: {initial_count - len(df)}")
print(f"Articles restants: {len(df)}")

Textes vides apres preprocessing: 5
Articles supprimes: 5
Articles restants: 38639


## 8. Ajout d'un identifiant unique


In [20]:
# Ajout de l'identifiant
df['id'] = range(1, len(df) + 1)

# Reorganisation des colonnes
cols = ['id', 'processed_text', 'Detection']
additional_cols = [c for c in df.columns if c not in cols]
df = df[cols + additional_cols]

df.head()

Unnamed: 0,id,processed_text,Detection,title,text,subject,date,detect_label,text_length_chars,text_length_words,combined_text
0,1,ben stein call circuit court committed coup ét...,0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",Fake,1028,171,Ben Stein Calls Out 9th Circuit Court: Committ...
1,2,trump drop steve bannon national security coun...,1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",True,4820,771,Trump drops Steve Bannon from National Securit...
2,3,puerto rico expects lift jones act shipping re...,1,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",True,1848,304,Puerto Rico expects U.S. to lift Jones Act shi...
3,4,oops trump accidentally confirmed leaked israe...,0,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",Fake,1244,183,OOPS: Trump Just Accidentally Confirmed He Le...
4,5,donald trump head scotland reopen golf resort ...,1,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",True,3137,529,Donald Trump heads for Scotland to reopen a go...


## 10. Division du dataset

Nous divisons le dataset selon le schema de l'article de reference:
- **Training set**: 80%
- **Validation set**: 20% du training (soit 16% du total)
- **Test set**: 20%

In [21]:
# Preparation des données pour la division
X = df[['id', 'processed_text']]
y = df['Detection']

# Premiere division: 80% train+val, 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Deuxieme division: 80% train, 20% val (du train+val)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.20, random_state=42, stratify=y_train_val
)

print("Division du dataset:")
print(f"  Training set:   {len(X_train):,} samples ({len(X_train)/len(df)*100:.1f}%)")
print(f"  Validation set: {len(X_val):,} samples ({len(X_val)/len(df)*100:.1f}%)")
print(f"  Test set:       {len(X_test):,} samples ({len(X_test)/len(df)*100:.1f}%)")

Division du dataset:
  Training set:   24,728 samples (64.0%)
  Validation set: 6,183 samples (16.0%)
  Test set:       7,728 samples (20.0%)


In [22]:
# Verification de la distribution des classes dans chaque split
print("Distribution des classes par split:")
print("Training set:")
print(y_train.value_counts())
print("Validation set:")
print(y_val.value_counts())
print("Test set:")
print(y_test.value_counts())

Distribution des classes par split:
Training set:
Detection
1    13562
0    11166
Name: count, dtype: int64
Validation set:
Detection
1    3391
0    2792
Name: count, dtype: int64
Test set:
Detection
1    4238
0    3490
Name: count, dtype: int64


## 11. Preparation des datasets finaux

In [23]:
# Creation des DataFrames finaux
train_df = pd.DataFrame({
    'id': X_train['id'].values,
    'text': X_train['processed_text'].values,
    'label': y_train.values
})

val_df = pd.DataFrame({
    'id': X_val['id'].values,
    'text': X_val['processed_text'].values,
    'label': y_val.values
})

test_df = pd.DataFrame({
    'id': X_test['id'].values,
    'text': X_test['processed_text'].values,
    'label': y_test.values
})

print("Datasets prepares:")
print(f"  train_df: {train_df.shape}")
print(f"  val_df:   {val_df.shape}")
print(f"  test_df:  {test_df.shape}")

Datasets prepares:
  train_df: (24728, 3)
  val_df:   (6183, 3)
  test_df:  (7728, 3)


## 12. Sauvegarde des datasets

In [24]:
# Creation du repertoire de sortie
import os
output_path = "../data/processed/"
os.makedirs(output_path, exist_ok=True)

# Sauvegarde des fichiers CSV
train_df.to_csv(output_path + "train.csv", index=False)
val_df.to_csv(output_path + "validation.csv", index=False)
test_df.to_csv(output_path + "test.csv", index=False)

# Sauvegarde du dataset complet preprocesse
df.to_csv(output_path + "full_processed.csv", index=False)

print("Fichiers sauvegardes:")
print(f"  - {output_path}train.csv")
print(f"  - {output_path}validation.csv")
print(f"  - {output_path}test.csv")
print(f"  - {output_path}full_processed.csv")

Fichiers sauvegardes:
  - ../data/processed/train.csv
  - ../data/processed/validation.csv
  - ../data/processed/test.csv
  - ../data/processed/full_processed.csv


## 13. Résumé du preprocessing

In [25]:
print("RESUME DU PREPROCESSING")
print("--------------------------------------------------------")

print("1. NETTOYAGE DES données")
print("   - Colonnes inutiles supprimees")
print("   - Valeurs manquantes gerees")
print("   - Doublons supprimes")

print("2. PREPROCESSING DU TEXTE")
print("   - Mise en minuscules")
print("   - Suppression URLs, mentions, hashtags")
print("   - Suppression ponctuation et chiffres")
print("   - Tokenisation")
print("   - Suppression des stopwords")
print("   - Lemmatisation")

print("4. DIVISION DU DATASET")
print(f"   - Training:   {len(train_df)} samples")
print(f"   - Validation: {len(val_df)} samples")
print(f"   - Test:       {len(test_df)} samples")

print("5. FICHIERS GENERES")
print(f"   - {output_path}train.csv")
print(f"   - {output_path}validation.csv")
print(f"   - {output_path}test.csv")

RESUME DU PREPROCESSING
--------------------------------------------------------
1. NETTOYAGE DES données
   - Colonnes inutiles supprimees
   - Valeurs manquantes gerees
   - Doublons supprimes
2. PREPROCESSING DU TEXTE
   - Mise en minuscules
   - Suppression URLs, mentions, hashtags
   - Suppression ponctuation et chiffres
   - Tokenisation
   - Suppression des stopwords
   - Lemmatisation
4. DIVISION DU DATASET
   - Training:   24728 samples
   - Validation: 6183 samples
   - Test:       7728 samples
5. FICHIERS GENERES
   - ../data/processed/train.csv
   - ../data/processed/validation.csv
   - ../data/processed/test.csv


---

## Prochaine étape

Le notebook suivant (`fnd_03_modeling_classical.ipynb`) implementera les modeles classiques de classification (TF-IDF + Logistic Regression, Naive Bayes, SVM) comme baseline avant de passer aux modeles de deep learning.

---

**References:**
- Roumeliotis, K.I., Tselikas, N.D., & Nasiopoulos, D.K. (2025). Fake News Detection and Classification: A Comparative Study of CNNs, LLMs, and NLP Models. *Future Internet*, 17, 28.