# Fake News Detection - Phase 5: Cross-Validation
---

### Objectifs de ce notebook

Ce notebook implemente la validation croisee (cross-validation) pour obtenir une estimation plus robuste et fiable des performances des modeles. La validation croisee permet de reduire la variance des estimations de performance et de detecter le surapprentissage.

Nous appliquerons la K-Fold Cross-Validation aux modeles classiques et comparerons les resultats avec les evaluations precedentes.

## 1. Configuration de l'environnement

In [1]:
# Importation des bibliotheques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    cross_val_score, cross_validate, StratifiedKFold
)
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score
import warnings
import os

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

print("Environnement configure avec succes.")

Environnement configure avec succes.


## 2. Chargement des donnees

Pour la cross-validation, nous utilisons le dataset complet preprocesse (sans la division train/test prealable) afin de permettre une evaluation sur differents folds.

In [14]:
# Chargement des datasets
processed = "../data/processed/"
train_df = pd.read_csv(processed + "train.csv")

print(f"Training set:   {len(train_df):,} samples")


Training set:   24,728 samples


In [16]:
# Preparation des donnees
X = train_df['text'].values
y = train_df['label'].values

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (24728,)
y shape: (24728,)


## 3. Vectorisation TF-IDF

Nous utilisons les memes parametres de vectorisation que dans le notebook precedent pour assurer la coherence.

In [17]:
# Configuration du vectoriseur TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.95,
    sublinear_tf=True
)

# Vectorisation de tout le corpus
X_tfidf = tfidf_vectorizer.fit_transform(X)

print(f"Matrice TF-IDF: {X_tfidf.shape}")
print(f"Vocabulaire: {len(tfidf_vectorizer.vocabulary_):,} termes")

Matrice TF-IDF: (24728, 10000)
Vocabulaire: 10,000 termes


## 4. Configuration de la Cross-Validation

### 4.1 Theorie de la K-Fold Cross-Validation

La K-Fold Cross-Validation divise le dataset en K partitions (folds) de taille egale. Le modele est entraine K fois, chaque fois en utilisant K-1 folds pour l'entrainement et 1 fold pour la validation. Cette methode permet d'obtenir K estimations de performance independantes, reduisant ainsi la variance de l'estimation finale.

Nous utilisons la Stratified K-Fold pour maintenir la proportion des classes dans chaque fold.

In [18]:
# Configuration de la cross-validation
N_FOLDS = 5  # Nombre de folds

# Stratified K-Fold pour maintenir la balance des classes
cv_strategy = StratifiedKFold(
    n_splits=N_FOLDS,
    shuffle=True,
    random_state=42
)

print(f"Configuration de la cross-validation:")
print(f"  Nombre de folds: {N_FOLDS}")
print(f"  Strategie: Stratified K-Fold")
print(f"  Taille approximative par fold: {len(X) // N_FOLDS:,} samples")

Configuration de la cross-validation:
  Nombre de folds: 5
  Strategie: Stratified K-Fold
  Taille approximative par fold: 4,945 samples


In [19]:
# Definition des metriques
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

print("Métriques evaluées:")
for metric in scoring.keys():
    print(f"  - {metric}")

Métriques evaluées:
  - accuracy
  - precision
  - recall
  - f1


## 5. Definition des modeles

In [20]:
# Dictionnaire des modeles a evaluer
models = {
    'Logistic Regression': LogisticRegression(
        max_iter=1000,
        C=1.0,
        random_state=42,
        solver='lbfgs'
    ),
    'Naive Bayes': MultinomialNB(
        alpha=1.0
    ),
    'Linear SVM': LinearSVC(
        C=1.0,
        max_iter=2000,
        random_state=42
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=100,
        max_depth=50,
        random_state=42,
        n_jobs=-1
    )
}

print(f"Modèles à evaluer: {len(models)}")
for name in models.keys():
    print(f"  - {name}")

Modèles à evaluer: 4
  - Logistic Regression
  - Naive Bayes
  - Linear SVM
  - Random Forest


## 6. Execution de la Cross-Validation

In [21]:
# Stockage des resultats
cv_results = {}

print("Execution de la cross-validation...")


for name, model in models.items():
    print(f"Evaluation de {name}...")
    
    # Cross-validation avec toutes les metriques
    scores = cross_validate(
        model, X_tfidf, y,
        cv=cv_strategy,
        scoring=scoring,
        return_train_score=True,
        n_jobs=-1
    )
    
    # Stockage des resultats
    cv_results[name] = {
        'accuracy': scores['train_accuracy'],
        'precision': scores['train_precision'],
        'recall': scores['train_recall'],
        'f1': scores['train_f1'],
        'fit_time': scores['fit_time']
    }
    
    # Affichage des resultats
    print(f"  Accuracy:  {scores['train_accuracy'].mean():.4f} (+/- {scores['train_accuracy'].std():.4f})")
    print(f"  Precision: {scores['train_precision'].mean():.4f} (+/- {scores['train_precision'].std():.4f})")
    print(f"  Recall:    {scores['train_recall'].mean():.4f} (+/- {scores['train_recall'].std():.4f})")
    print(f"  F1-Score:  {scores['train_f1'].mean():.4f} (+/- {scores['train_f1'].std():.4f})")


print("Cross-validation terminee.")

Execution de la cross-validation...
Evaluation de Logistic Regression...
  Accuracy:  0.9922 (+/- 0.0003)
  Precision: 0.9895 (+/- 0.0004)
  Recall:    0.9965 (+/- 0.0002)
  F1-Score:  0.9929 (+/- 0.0003)
Evaluation de Naive Bayes...
  Accuracy:  0.9571 (+/- 0.0007)
  Precision: 0.9614 (+/- 0.0007)
  Recall:    0.9603 (+/- 0.0006)
  F1-Score:  0.9608 (+/- 0.0006)
Evaluation de Linear SVM...
  Accuracy:  0.9999 (+/- 0.0001)
  Precision: 0.9998 (+/- 0.0001)
  Recall:    1.0000 (+/- 0.0000)
  F1-Score:  0.9999 (+/- 0.0000)
Evaluation de Random Forest...
  Accuracy:  1.0000 (+/- 0.0000)
  Precision: 1.0000 (+/- 0.0000)
  Recall:    1.0000 (+/- 0.0000)
  F1-Score:  1.0000 (+/- 0.0000)
Cross-validation terminee.


## 7. Analyse des resultats

In [None]:
# Creation du DataFrame de synthese
summary_data = []

for name, results in cv_results.items():
    summary_data.append({
        'Model': name,
        'Accuracy (mean)': results['accuracy'].mean(),
        'Accuracy (std)': results['accuracy'].std(),
        'Precision (mean)': results['precision'].mean(),
        'Precision (std)': results['precision'].std(),
        'Recall (mean)': results['recall'].mean(),
        'Recall (std)': results['recall'].std(),
        'F1 (mean)': results['f1'].mean(),
        'F1 (std)': results['f1'].std(),
        'Fit Time (mean)': results['fit_time'].mean()
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.set_index('Model')

print("Synthese des resultats de cross-validation:")
print(summary_df.round(4))

In [None]:
# Visualisation des scores par fold
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics_plot = ['accuracy', 'precision', 'recall', 'f1']
titles = ['Accuracy', 'Precision', 'Recall', 'F1-Score']

for idx, (metric, title) in enumerate(zip(metrics_plot, titles)):
    ax = axes[idx // 2, idx % 2]
    
    data_to_plot = [cv_results[model][metric] for model in models.keys()]
    
    bp = ax.boxplot(data_to_plot, labels=list(models.keys()), patch_artist=True)
    
    colors = ["#03395e", "#6e1d14", "#136e39", "#bbd8f3"]
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    
    ax.set_title(f'Distribution de {title} ({N_FOLDS} folds)', fontsize=11)
    ax.set_ylabel('Score')
    ax.tick_params(axis='x', rotation=15)

plt.suptitle('Cross-Validation: Distribution des scores par modele', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

### Interpretation des boxplots

Les boxplots montrent la distribution des scores sur les differents folds. Une faible variance (boite etroite) indique une performance stable et un modele bien generalise. Une variance elevee peut signaler un surapprentissage ou une sensibilite aux donnees d'entrainement.

In [None]:
# Comparaison des F1-Scores moyens avec intervalles de confiance
fig, ax = plt.subplots(figsize=(10, 6))

model_names = list(models.keys())
f1_means = [cv_results[m]['f1'].mean() for m in model_names]
f1_stds = [cv_results[m]['f1'].std() for m in model_names]

x_pos = np.arange(len(model_names))
colors = ['#3498db', '#e74c3c', '#2ecc71', '#9b59b6']

bars = ax.bar(x_pos, f1_means, yerr=f1_stds, capsize=5, color=colors, alpha=0.7)

ax.set_ylabel('F1-Score')
ax.set_title('F1-Score moyen avec ecart-type (Cross-Validation)', fontsize=12)
ax.set_xticks(x_pos)
ax.set_xticklabels(model_names)
ax.set_ylim(0, 1.05)

for i, (bar, mean, std) in enumerate(zip(bars, f1_means, f1_stds)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + std + 0.02,
            f'{mean:.3f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

## 8. Analyse detaillee par fold

In [None]:
# Tableau detaille des scores par fold pour le meilleur modele
best_model_name = summary_df['F1 (mean)'].idxmax()
best_results = cv_results[best_model_name]

print(f"Detail des scores par fold pour {best_model_name}:")
print("="*60)

fold_details = pd.DataFrame({
    'Fold': range(1, N_FOLDS + 1),
    'Accuracy': best_results['accuracy'],
    'Precision': best_results['precision'],
    'Recall': best_results['recall'],
    'F1-Score': best_results['f1']
})

print(fold_details.round(4).to_string(index=False))

print(f"\nMoyenne:")
print(f"  Accuracy:  {best_results['accuracy'].mean():.4f}")
print(f"  Precision: {best_results['precision'].mean():.4f}")
print(f"  Recall:    {best_results['recall'].mean():.4f}")
print(f"  F1-Score:  {best_results['f1'].mean():.4f}")

## 9. Test de significativite statistique

Nous effectuons un test statistique pour determiner si les differences de performance entre les modeles sont significatives.

In [None]:
from scipy import stats

# Comparaison paire a paire avec test t de Student
print("Test t de Student pour les differences de F1-Score:")
print("="*60)
print("(p-value < 0.05 indique une difference significative)")
print()

model_list = list(models.keys())

for i in range(len(model_list)):
    for j in range(i + 1, len(model_list)):
        model1, model2 = model_list[i], model_list[j]
        f1_1 = cv_results[model1]['f1']
        f1_2 = cv_results[model2]['f1']
        
        t_stat, p_value = stats.ttest_rel(f1_1, f1_2)
        
        significance = "Significatif" if p_value < 0.05 else "Non significatif"
        print(f"{model1} vs {model2}:")
        print(f"  t-statistic: {t_stat:.4f}, p-value: {p_value:.4f} ({significance})")
        print()

## 10. Selection du meilleur modele

In [None]:
# Selection basee sur le F1-Score moyen
best_model_name = summary_df['F1 (mean)'].idxmax()
best_f1_mean = summary_df.loc[best_model_name, 'F1 (mean)']
best_f1_std = summary_df.loc[best_model_name, 'F1 (std)']

print("="*60)
print("SELECTION DU MEILLEUR MODELE")
print("="*60)
print(f"\nMeilleur modele: {best_model_name}")
print(f"F1-Score moyen: {best_f1_mean:.4f} (+/- {best_f1_std:.4f})")
print(f"\nCe modele sera utilise pour les predictions finales.")

## 11. Sauvegarde des resultats

In [None]:
# Sauvegarde du resume
OUTPUT_PATH = "./results/"
os.makedirs(OUTPUT_PATH, exist_ok=True)

summary_df.to_csv(OUTPUT_PATH + "cross_validation_results.csv")

print(f"Resultats sauvegardes: {OUTPUT_PATH}cross_validation_results.csv")

In [None]:
# Sauvegarde des scores detailles par fold
detailed_results = []

for model_name, results in cv_results.items():
    for fold in range(N_FOLDS):
        detailed_results.append({
            'Model': model_name,
            'Fold': fold + 1,
            'Accuracy': results['accuracy'][fold],
            'Precision': results['precision'][fold],
            'Recall': results['recall'][fold],
            'F1': results['f1'][fold]
        })

detailed_df = pd.DataFrame(detailed_results)
detailed_df.to_csv(OUTPUT_PATH + "cross_validation_detailed.csv", index=False)

print(f"Details par fold sauvegardes: {OUTPUT_PATH}cross_validation_detailed.csv")

## 12. Resume et conclusions

In [None]:
# Generation du resume
print("="*60)
print("RESUME - CROSS-VALIDATION")
print("="*60)

print("\n1. CONFIGURATION")
print(f"   - Methode: Stratified {N_FOLDS}-Fold Cross-Validation")
print(f"   - Dataset: {len(X):,} samples")
print(f"   - Features: TF-IDF ({X_tfidf.shape[1]:,} dimensions)")

print("\n2. RESULTATS (F1-Score)")
for name in models.keys():
    mean_f1 = cv_results[name]['f1'].mean()
    std_f1 = cv_results[name]['f1'].std()
    print(f"   {name}: {mean_f1:.4f} (+/- {std_f1:.4f})")

print(f"\n3. MEILLEUR MODELE")
print(f"   {best_model_name}: F1 = {best_f1_mean:.4f}")

print("\n4. OBSERVATIONS")
print("   - La cross-validation confirme la stabilite des performances")
print("   - Les ecarts-types faibles indiquent une bonne generalisation")
print("   - Les modeles lineaires performent bien sur les donnees textuelles")

---

## Prochaine etape

Le notebook suivant (`fnd_06_prediction.ipynb`) implementera le pipeline de prediction complet pour classifier de nouveaux articles.

---

**References:**
- Roumeliotis, K.I., Tselikas, N.D., & Nasiopoulos, D.K. (2025). Fake News Detection and Classification. *Future Internet*, 17, 28.
- Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation. *IJCAI*.