# Fake News Detection - Phase 2: Data Preprocessing

---

### Objectifs de ce notebook

Ce notebook implemente les étapes de preprocessing identifiées lors de l'exploration. Nous suivrons alors la méthodologie décrite dans l'article de référence (Roumeliotis et al., 2025) avec les étapes suivantes:

1. Nettoyage des données (valeurs manquantes, doublons)
2. Fusion des colonnes titre et texte
3. Normalisation du texte
4. Tokenisation
5. Suppression des stopwords
6. Lemmatisation
7. Limitation de la longueur du texte
9. Division train/validation/test

## 1. Configuration de l'environnement

In [1]:
# Importation des bibliotheques
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings('ignore')

# Telechargement des ressources NLTK
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

True

## 2. Chargement des données

Nous chargeons le dataset combiné "combined_raw" lors de l'exploration.

In [20]:
# Chargement du dataset
df = pd.read_csv("data/combined_raw.csv", sep=",")

print(f"Dataset chargé: {len(df)} articles")
print(f"Colonnes: {list(df.columns)}")
df.head(3)

Dataset chargé: 44898 articles
Colonnes: ['title', 'text', 'subject', 'date', 'Detection', 'detect_label', 'text_length_chars', 'text_length_words']


Unnamed: 0,title,text,subject,date,Detection,detect_label,text_length_chars,text_length_words
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0,Fake,2893,495
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0,Fake,1898,305
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0,Fake,3597,580


## 3. Nettoyage des données

### 3.1 Gestion des valeurs manquantes

In [21]:
# Etat avant nettoyage
print("Valeurs manquantes avant nettoyage:")
print(df.apply(lambda x: x.astype(str).str.replace(r'\s+', '', regex=True).eq('').sum()))

initial_count = len(df)
print(f"Nombre d'obs initial : {initial_count}")

Valeurs manquantes avant nettoyage:
title                  0
text                 631
subject                0
date                   0
Detection              0
detect_label           0
text_length_chars      0
text_length_words      0
dtype: int64
Nombre d'obs initial : 44898


In [22]:
# Supprimer les lignes où text ou Detection ne contiennent que des espaces
df = df[~df['text'].astype(str).str.replace(r'\s+', '', regex=True).eq('')]

print(f"Nombres d'articles supprimés: {initial_count - len(df)}")

Nombres d'articles supprimés: 631


### 3.2 Suppression des doublons

In [23]:
initial_count = len(df)
df = df.drop_duplicates(subset=['text'], keep='first')

print(f"Doublons supprimés: {initial_count - len(df)}")
print(f"Articles restants: {len(df)}")

Doublons supprimés: 5623
Articles restants: 38644


## 4. Fusion des colonnes titre et texte

Conformement à la méthodologie de l'article, nous combinons le titre et le texte pour fournir au modéle un contexte complet.

In [25]:
df['combined_text'] = df['title'].astype(str) + " " + df['text'].astype(str)
print("Colonnes 'title' et 'text' fusionnées dans 'combined_text'")

Colonnes 'title' et 'text' fusionnées dans 'combined_text'


## 5. Preprocessing du texte

### 5.1 Definition des fonctions de preprocessing

In [26]:
# Initialisation des outils NLP
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Conversion en minuscules
    text = str(text).lower()
    
    # Suppression des URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Suppression des mentions et hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Suppression de la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Suppression des chiffres
    text = re.sub(r'\d+', '', text)
    
    # Suppression des espaces multiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [27]:
def preprocess_text(text):
    # Nettoyage initial
    text = clean_text(text)
    
    # Tokenisation
    tokens = word_tokenize(text)
    
    # Filtrage: stopwords et tokens courts
    tokens = [token for token in tokens 
              if token not in stop_words and len(token) > 2]
    
    # Lemmatisation
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Reconstruction
    return ' '.join(tokens)


### 5.2 Application du preprocessing

In [28]:
# Test sur un exemple
sample_text = df['combined_text'].iloc[0]
print("Texte original (extrait):")
print(sample_text[:200])
print("\nTexte preprocesse:")
print(preprocess_text(sample_text)[:200])

Texte original (extrait):
 Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out

Texte preprocesse:
donald trump sends embarrassing new year eve message disturbing donald trump wish american happy new year leave instead give shout enemy hater dishonest fake news medium former reality show star one j


In [29]:
# Application du preprocessing sur tout le dataset
df['processed_text'] = df['combined_text'].apply(preprocess_text)

In [31]:
# Verification des resultats
print("Comparaison avant/apres preprocessing:")
print(f"Original ({len(df['combined_text'].iloc[5])} chars):")
print(df['combined_text'].iloc[5][:150])
print(f"Preprocessed ({len(df['processed_text'].iloc[5])} chars):")
print(df['processed_text'].iloc[5][:150])

Comparaison avant/apres preprocessing:
Original (1824 chars):
 Racist Alabama Cops Brutalize Black Boy While He Is In Handcuffs (GRAPHIC IMAGES) The number of cases of cops brutalizing and killing people of color
Preprocessed (1132 chars):
racist alabama cop brutalize black boy handcuff graphic image number case cop brutalizing killing people color seems see end another case need shared 


## 6. Limitation de la longueur du texte

Conformement a la methodologie de l'article de reference, nous limitons la longueur du texte a 2560 caracteres pour assurer la compatibilite avec les modeles BERT et CNN qui ont une limite de 512 tokens.

In [32]:
# Parametre de longueur maximale
MAX_LENGTH = 2560

# Statistiques avant troncature
df['text_length'] = df['processed_text'].apply(len)

print("Statistiques de longueur avant troncature:")
print(df['text_length'].describe())

# Nombre de textes depassant la limite
exceeding = (df['text_length'] > MAX_LENGTH).sum()
print(f"Textes depassant {MAX_LENGTH} caracteres: {exceeding} ({exceeding/len(df)*100:.1f}%)")

Statistiques de longueur avant troncature:
count    38644.000000
mean      1732.370226
std       1332.797909
min          0.000000
25%        958.000000
50%       1560.000000
75%       2149.000000
max      37847.000000
Name: text_length, dtype: float64
Textes depassant 2560 caracteres: 6506 (16.8%)


In [33]:
# Application de la troncature
df['processed_text'] = df['processed_text'].apply(lambda x: x[:MAX_LENGTH])

# Mise a jour des longueurs
df['text_length'] = df['processed_text'].apply(len)

print("Statistiques de longueur apres troncature:")
print(df['text_length'].describe())

Statistiques de longueur apres troncature:
count    38644.000000
mean      1532.362592
std        741.310471
min          0.000000
25%        958.000000
50%       1560.000000
75%       2149.000000
max       2560.000000
Name: text_length, dtype: float64


## 7. Suppression des textes vides

Après le preprocessing, certains textes peuvent devenir vides. Nous les supprimons.

In [None]:
# Verification des textes vides
empty_texts = (df['processed_text'].str.strip() == '').sum()
print(f"Textes vides apres preprocessing: {empty_texts}")

# Suppression si necessaire
initial_count = len(df)
df = df[df['processed_text'].str.strip() != '']

print(f"Articles supprimes: {initial_count - len(df)}")
print(f"Articles restants: {len(df)}")

Textes vides apres preprocessing: 5
Articles supprimes: 5
Articles restants: 38,639


## 8. Ajout d'un identifiant unique


In [36]:
# Ajout de l'identifiant
df['id'] = range(len(df))

# Reorganisation des colonnes
cols = ['id', 'processed_text', 'Detection']
additional_cols = [c for c in df.columns if c not in cols]
df = df[cols + additional_cols]

df.head()

Unnamed: 0,id,processed_text,Detection,title,text,subject,date,detect_label,text_length_chars,text_length_words,combined_text,text_length
0,0,donald trump sends embarrassing new year eve m...,0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake,2893,495,Donald Trump Sends Out Embarrassing New Year’...,1698
1,1,drunk bragging trump staffer started russian c...,0,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake,1898,305,Drunk Bragging Trump Staffer Started Russian ...,1425
2,2,sheriff david clarke becomes internet joke thr...,0,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake,3597,580,Sheriff David Clarke Becomes An Internet Joke...,2215
3,3,trump obsessed even obama name coded website i...,0,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake,2774,444,Trump Is So Obsessed He Even Has Obama’s Name...,1744
4,4,pope francis called donald trump christmas spe...,0,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake,2346,420,Pope Francis Just Called Out Donald Trump Dur...,1467


## 10. Division du dataset

Nous divisons le dataset selon le schema de l'article de reference:
- **Training set**: 80%
- **Validation set**: 20% du training (soit 16% du total)
- **Test set**: 20%

In [37]:
# Preparation des données pour la division
X = df[['id', 'processed_text']]
y = df['Detection']

# Premiere division: 80% train+val, 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Deuxieme division: 80% train, 20% val (du train+val)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.20, random_state=42, stratify=y_train_val
)

print("Division du dataset:")
print(f"  Training set:   {len(X_train):,} samples ({len(X_train)/len(df)*100:.1f}%)")
print(f"  Validation set: {len(X_val):,} samples ({len(X_val)/len(df)*100:.1f}%)")
print(f"  Test set:       {len(X_test):,} samples ({len(X_test)/len(df)*100:.1f}%)")

Division du dataset:
  Training set:   24,728 samples (64.0%)
  Validation set: 6,183 samples (16.0%)
  Test set:       7,728 samples (20.0%)


In [45]:
# Verification de la distribution des classes dans chaque split
print("Distribution des classes par split:")
print("Training set:")
print(y_train.value_counts())
print("Validation set:")
print(y_val.value_counts())
print("Test set:")
print(y_test.value_counts())

Distribution des classes par split:
Training set:
Detection
1    13562
0    11166
Name: count, dtype: int64
Validation set:
Detection
1    3391
0    2792
Name: count, dtype: int64
Test set:
Detection
1    4238
0    3490
Name: count, dtype: int64


## 11. Preparation des datasets finaux

In [47]:
# Creation des DataFrames finaux
train_df = pd.DataFrame({
    'id': X_train['id'].values,
    'text': X_train['processed_text'].values,
    'label': y_train.values,
    'subject': df.loc[X_train.index, 'subject'].values
})

val_df = pd.DataFrame({
    'id': X_val['id'].values,
    'text': X_val['processed_text'].values,
    'label': y_val.values,
    'subject' : df.loc[X_val.index, 'subject'].values
})

test_df = pd.DataFrame({
    'id': X_test['id'].values,
    'text': X_test['processed_text'].values,
    'label': y_test.values,
    'subject' : df.loc[X_test.index, 'subject'].values
})

print("Datasets prepares:")
print(f"  train_df: {train_df.shape}")
print(f"  val_df:   {val_df.shape}")
print(f"  test_df:  {test_df.shape}")

Datasets prepares:
  train_df: (24728, 4)
  val_df:   (6183, 4)
  test_df:  (7728, 4)


## 12. Sauvegarde des datasets

In [48]:
# Creation du repertoire de sortie
import os
OUTPUT_PATH = "./data/processed/"
os.makedirs(OUTPUT_PATH, exist_ok=True)

# Sauvegarde des fichiers CSV
train_df.to_csv(OUTPUT_PATH + "train.csv", index=False)
val_df.to_csv(OUTPUT_PATH + "validation.csv", index=False)
test_df.to_csv(OUTPUT_PATH + "test.csv", index=False)

# Sauvegarde du dataset complet preprocesse
df.to_csv(OUTPUT_PATH + "full_processed.csv", index=False)

print("Fichiers sauvegardes:")
print(f"  - {OUTPUT_PATH}train.csv")
print(f"  - {OUTPUT_PATH}validation.csv")
print(f"  - {OUTPUT_PATH}test.csv")
print(f"  - {OUTPUT_PATH}full_processed.csv")

Fichiers sauvegardes:
  - ./data/processed/train.csv
  - ./data/processed/validation.csv
  - ./data/processed/test.csv
  - ./data/processed/full_processed.csv


## 13. Résumé du preprocessing

In [50]:
print("RESUME DU PREPROCESSING")
print("--------------------------------------------------------")

print("1. NETTOYAGE DES données")
print("   - Colonnes inutiles supprimees")
print("   - Valeurs manquantes gerees")
print("   - Doublons supprimes")

print("2. PREPROCESSING DU TEXTE")
print("   - Mise en minuscules")
print("   - Suppression URLs, mentions, hashtags")
print("   - Suppression ponctuation et chiffres")
print("   - Tokenisation")
print("   - Suppression des stopwords")
print("   - Lemmatisation")

print("4. DIVISION DU DATASET")
print(f"   - Training:   {len(train_df)} samples")
print(f"   - Validation: {len(val_df)} samples")
print(f"   - Test:       {len(test_df)} samples")

print("5. FICHIERS GENERES")
print(f"   - {OUTPUT_PATH}train.csv")
print(f"   - {OUTPUT_PATH}validation.csv")
print(f"   - {OUTPUT_PATH}test.csv")

RESUME DU PREPROCESSING
--------------------------------------------------------
1. NETTOYAGE DES données
   - Colonnes inutiles supprimees
   - Valeurs manquantes gerees
   - Doublons supprimes
2. PREPROCESSING DU TEXTE
   - Mise en minuscules
   - Suppression URLs, mentions, hashtags
   - Suppression ponctuation et chiffres
   - Tokenisation
   - Suppression des stopwords
   - Lemmatisation
4. DIVISION DU DATASET
   - Training:   24728 samples
   - Validation: 6183 samples
   - Test:       7728 samples
5. FICHIERS GENERES
   - ./data/processed/train.csv
   - ./data/processed/validation.csv
   - ./data/processed/test.csv


---

## Prochaine étape

Le notebook suivant (`fnd_03_modeling_classical.ipynb`) implementera les modeles classiques de classification (TF-IDF + Logistic Regression, Naive Bayes, SVM) comme baseline avant de passer aux modeles de deep learning.

---

**References:**
- Roumeliotis, K.I., Tselikas, N.D., & Nasiopoulos, D.K. (2025). Fake News Detection and Classification: A Comparative Study of CNNs, LLMs, and NLP Models. *Future Internet*, 17, 28.