# Fake News Detection - Phase 5: Optimisation et √âvaluation des Mod√®les Classiques

### Objectifs de ce notebook

**Objectif** : Optimiser les hyperparam√®tres des mod√®les classiques via GridSearchCV, √©valuer leurs performances sur les donn√©es de test, sauvegarder les mod√®les optimis√©s et comparer avec les mod√®les non optimis√©s.

**Contenu** :
- Chargement des donn√©es preprocess√©es
- Optimisation des hyperparam√®tres (GridSearchCV)
- √âvaluation sur les donn√©es de test
- Sauvegarde des mod√®les (pickle) et performances (Excel)
- Comparaison mod√®les optimis√©s vs non optimis√©s

## 1 Importation des biblioth√®ques

In [1]:
import pandas as pd
import numpy as np
import pickle
import warnings
from datetime import datetime

# Machine Learning
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# M√©triques
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, classification_report, confusion_matrix
)

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')

print(" Biblioth√®ques import√©es avec succ√®s")

 Biblioth√®ques import√©es avec succ√®s


## 2 Chargement des donn√©es

In [12]:
# Chargement des datasets
processed = "../data/processed/"
classical = "../models/classical/"

train_df = pd.read_csv(processed + "train.csv")
val_df = pd.read_csv(processed + "validation.csv")
test_df = pd.read_csv(processed + "test.csv")

print(f"Training set:   {len(train_df):,} samples")
print(f"Validation set: {len(val_df):,} samples")
print(f"Test set:       {len(test_df):,} samples")

Training set:   24,728 samples
Validation set: 6,183 samples
Test set:       7,728 samples


In [13]:
# Preparation des donnees
X_train = train_df['text'].values
y_train = train_df['label'].values

X_val = val_df['text'].values
y_val = val_df['label'].values

X_test = test_df['text'].values
y_test = test_df['label'].values

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train shape: (24728,)
y_train shape: (24728,)


In [14]:
# Load the vectorizer that was used during training
with open('../models/classical/tfidf_vectorizer.pkl', 'rb') as f:  
    vectorizer = pickle.load(f)

X_train= vectorizer.transform(X_train)
X_test= vectorizer.transform(X_test)

## 3 Chargement des mod√®les de base (non optimis√©s)

In [15]:
baseline_results = {}

models_baseline = {
    'LinearSVC': "../models/classical/linear_svm.pkl",
    'RandomForest': "../models/classical/random_forest.pkl",
    'NaiveBayes': "../models/classical/naive_bayes.pkl",
    'LogisticRegression': '../models/classical/logistic_regression.pkl'
}

# Now use the transformed data for predictions
for name, path in models_baseline.items():
    try:
        with open(path, 'rb') as f:
            model = pickle.load(f)
        y_pred = model.predict(X_test)  # Use transformed data
        baseline_results[name] = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1': f1_score(y_test, y_pred)
        }
    except FileNotFoundError:
        print(f"Mod√®le {name} non trouv√©, sera ignor√© dans la comparaison")

## 4 Optimisation des hyperparam√®tres

Utilisation de GridSearchCV avec validation crois√©e 5-fold pour trouver les meilleurs hyperparam√®tres.

### 4.1 Linear SVM

In [18]:
print("Optimisation de Linear SVM...")

# Grille d'hyperparam√®tres
param_grid_svm = {
    'C': [0.1, 1, 10],
    'loss': ['hinge', 'squared_hinge'],
    'max_iter': [1000, 2000]
}

# GridSearch
grid_svm = GridSearchCV(
    LinearSVC(random_state=42, dual='auto'),
    param_grid_svm,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_svm.fit(X_train, y_train)

print(f"Meilleurs param√®tres: {grid_svm.best_params_}")
print(f"Meilleur score CV: {grid_svm.best_score_:.4f}")

# Sauvegarde
with open('../models/classic_ops/linear_svc_optimized.pkl', 'wb') as f:
    pickle.dump(grid_svm.best_estimator_, f)

print(" Mod√®le sauvegard√©")

Optimisation de Linear SVM...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Meilleurs param√®tres: {'C': 10, 'loss': 'squared_hinge', 'max_iter': 1000}
Meilleur score CV: 0.9956
 Mod√®le sauvegard√©


### 4.2 Random Forest

In [19]:
print("Optimisation de Random Forest...")

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_rf = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=5,
    scoring = 'accuracy',
    n_jobs=-1,
    verbose=1
)

grid_rf.fit(X_train, y_train)

print(f"Meilleurs param√®tres: {grid_rf.best_params_}")
print(f"Meilleur score CV: {grid_rf.best_score_:.4f}")

with open('../models/classic_ops/random_forest_optimized.pkl', 'wb') as f:
    pickle.dump(grid_rf.best_estimator_, f)

print(" Mod√®le sauvegard√©")

Optimisation de Random Forest...
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Meilleurs param√®tres: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300}
Meilleur score CV: 0.9962
 Mod√®le sauvegard√©


### 4.3 Naive Bayes

In [20]:
print("Optimisation de Naive Bayes...")

param_grid_nb = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'fit_prior': [True, False]
}

grid_nb = GridSearchCV(
    MultinomialNB(),
    param_grid_nb,
    cv=5,
    scoring = 'accuracy',
    n_jobs=-1,
    verbose=1
)

grid_nb.fit(X_train, y_train)

print(f"Meilleurs param√®tres: {grid_nb.best_params_}")
print(f"Meilleur score CV: {grid_nb.best_score_:.4f}")

with open('../models/classic_ops/naive_bayes_optimized.pkl', 'wb') as f:
    pickle.dump(grid_nb.best_estimator_, f)

print("Mod√®le sauvegard√©")

Optimisation de Naive Bayes...
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Meilleurs param√®tres: {'alpha': 0.1, 'fit_prior': False}
Meilleur score CV: 0.9583
Mod√®le sauvegard√©


### 4.4 Logistic Regression

In [None]:
print("Optimisation de Logistic Regression...")

param_grid_lr = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear'],
    'max_iter': [1000, 2000]
}

grid_lr = GridSearchCV(
    LogisticRegression(random_state=42),
    param_grid_lr,
    cv=5,
    scoring = 'accuracy',
    n_jobs=-1,
    verbose=1
)

grid_lr.fit(X_train, y_train)

print(f"Meilleurs param√®tres: {grid_lr.best_params_}")
print(f"Meilleur score CV: {grid_lr.best_score_:.4f}")

with open('../models/classic_ops/logistic_regression_optimized.pkl', 'wb') as f:
    pickle.dump(grid_lr.best_estimator_, f)

print(" Mod√®le sauvegard√©")

Optimisation de Logistic Regression...
Fitting 5 folds for each of 12 candidates, totalling 60 fits


## 5 √âvaluation sur les donn√©es de test

In [None]:
# Dictionnaire pour stocker les mod√®les optimis√©s
optimized_models = {
    'LinearSVC': grid_svm.best_estimator_,
    'RandomForest': grid_rf.best_estimator_,
    'NaiveBayes': grid_nb.best_estimator_,
    'LogisticRegression': grid_lr.best_estimator_
}

# √âvaluation de chaque mod√®le
optimized_results = {}

for name, model in optimized_models.items():
    
    # Pr√©dictions
    y_pred = model.predict(X_test)
    
    # M√©triques
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    optimized_results[name] = {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'predictions': y_pred
    }
    
    print(f"Accuracy:  {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall:    {rec:.4f}")
    print(f"F1-Score:  {f1:.4f}")
    
    print("Rapport de classification:")
    print(classification_report(y_test, y_pred, 
                                target_names=['True News', 'Fake News']))

print("√âvaluation termin√©e")

## 6 Matrices de confusion

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, (name, results) in enumerate(optimized_results.items()):
    cm = confusion_matrix(y_test, results['predictions'])
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['True', 'Fake'],
                yticklabels=['True', 'Fake'],
                ax=axes[idx])
    
    axes[idx].set_title(f'{name} (Optimis√©)\nF1-Score: {results["f1"]:.4f}')
    axes[idx].set_ylabel('Vraie Classe')
    axes[idx].set_xlabel('Classe Pr√©dite')

plt.tight_layout()
plt.savefig('../reports/figures/confusion_matrices_optimized.png', dpi=300, bbox_inches='tight')
plt.show()

print("Matrices de confusion sauvegard√©es")

## 7 Comparaison mod√®les optimis√©s vs non optimis√©s

In [None]:
import json

# Trouver le meilleur mod√®le parmi tous (baseline + optimis√©s + deep learning)
all_results = []

# Mod√®les classiques
for model_name in optimized_results.keys():
    if model_name in baseline_results:
        all_results.append({
            'Mod√®le': f"{model_name} (Baseline)",
            'F1-Score': baseline_results[model_name]['f1'],
            'Accuracy': baseline_results[model_name]['accuracy'],
            'Precision': baseline_results[model_name]['precision'],
            'Recall': baseline_results[model_name]['recall']
        })
    
    all_results.append({
        'Mod√®le': f"{model_name} (Optimis√©)",
        'F1-Score': optimized_results[model_name]['f1'],
        'Accuracy': optimized_results[model_name]['accuracy'],
        'Precision': optimized_results[model_name]['precision'],
        'Recall': optimized_results[model_name]['recall']
    })

# Mod√®les Deep Learning
deep_models = {
    'CNN': '../models/deep/cnn_metrics.json',
    'BiLSTM': '../models/deep/history_optimal.json'
}

for model_name, json_path in deep_models.items():
    try:
        with open(json_path, 'r') as f:
            metrics = json.load(f)
        
        all_results.append({
            'Mod√®le': model_name,
            'F1-Score': metrics['f1_score'],
            'Accuracy': metrics['accuracy'],
            'Precision': metrics['precision'],
            'Recall': metrics['recall']
        })
        print(f"‚úì {model_name} charg√©")
    except FileNotFoundError:
        print(f"‚ö† {json_path} non trouv√©")
    except KeyError as e:
        print(f"‚ö† Cl√© manquante dans {model_name}: {e}")

# Trier et afficher
df_all = pd.DataFrame(all_results).sort_values('F1-Score', ascending=False)

print("\n" + "="*80)
print("CLASSEMENT COMPLET DES MOD√àLES (par F1-Score)")
print("="*80)
print(df_all.to_string(index=False))

# Top 3
print("\n" + "="*80)
print("üèÜ TOP 3 MEILLEURS MOD√àLES")
print("="*80)
for i in range(min(3, len(df_all))):
    model = df_all.iloc[i]
    print(f"\n#{i+1} - {model['Mod√®le']}")
    print(f"   F1-Score:  {model['F1-Score']:.4f}")
    print(f"   Accuracy:  {model['Accuracy']:.4f}")
    print(f"   Precision: {model['Precision']:.4f}")
    print(f"   Recall:    {model['Recall']:.4f}")
print("="*80)

In [None]:
# Cr√©er un DataFrame de comparaison
comparison_data = []

for model_name in optimized_results.keys():
    # Baseline
    if model_name in baseline_results:
        comparison_data.append({
            'Mod√®le': model_name,
            'Version': 'Baseline',
            'Accuracy': baseline_results[model_name]['accuracy'],
            'Precision': baseline_results[model_name]['precision'],
            'Recall': baseline_results[model_name]['recall'],
            'F1-Score': baseline_results[model_name]['f1']
        })
    
    # Optimis√©
    comparison_data.append({
        'Mod√®le': model_name,
        'Version': 'Optimis√©',
        'Accuracy': optimized_results[model_name]['accuracy'],
        'Precision': optimized_results[model_name]['precision'],
        'Recall': optimized_results[model_name]['recall'],
        'F1-Score': optimized_results[model_name]['f1']
    })

df_comparison = pd.DataFrame(comparison_data)
print("\nComparaison des performances:")
print(df_comparison.to_string(index=False))

# Calcul des am√©liorations
print("Am√©liorations apport√©es par l'optimisation:")

for model_name in optimized_results.keys():
    if model_name in baseline_results:
        f1_base = baseline_results[model_name]['f1']
        f1_opt = optimized_results[model_name]['f1']
        improvement = ((f1_opt - f1_base) / f1_base) * 100
        
        print(f"\n{model_name}:")
        print(f"  F1 Baseline: {f1_base:.4f}")
        print(f"  F1 Optimis√©: {f1_opt:.4f}")
        print(f"  Am√©lioration: {improvement:+.2f}%")

## 8 Visualisation comparative

In [None]:
# Graphique de comparaison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
    df_metric = df_comparison.pivot(index='Mod√®le', columns='Version', values=metric)
    
    df_metric.plot(kind='bar', ax=axes[idx], width=0.8)
    axes[idx].set_title(f'Comparaison: {metric}', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel(metric, fontsize=12)
    axes[idx].set_xlabel('Mod√®le', fontsize=12)
    axes[idx].legend(title='Version', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)
    axes[idx].set_ylim([0.9, 1.0])  # Ajuster selon vos r√©sultats
    
    # Ajouter les valeurs sur les barres
    for container in axes[idx].containers:
        axes[idx].bar_label(container, fmt='%.4f', fontsize=8)

plt.tight_layout()
plt.savefig('../reports/figures/comparison_baseline_vs_optimized.png', dpi=300, bbox_inches='tight')
plt.show()

print("Graphiques de comparaison sauvegard√©s")

## 4.9 Identification du meilleur mod√®le

In [None]:
# Trouver le meilleur mod√®le parmi tous (baseline + optimis√©s)
all_results = []

for model_name in optimized_results.keys():
    if model_name in baseline_results:
        all_results.append({
            'Mod√®le': f"{model_name} (Baseline)",
            'F1-Score': baseline_results[model_name]['f1'],
            'Accuracy': baseline_results[model_name]['accuracy']
        })
    
    all_results.append({
        'Mod√®le': f"{model_name} (Optimis√©)",
        'F1-Score': optimized_results[model_name]['f1'],
        'Accuracy': optimized_results[model_name]['accuracy']
    })

df_all = pd.DataFrame(all_results).sort_values('F1-Score', ascending=False)


print("CLASSEMENT COMPLET DES MOD√àLES (par F1-Score)")

print(df_all.to_string(index=False))

best_model = df_all.iloc[0]

print(" MEILLEUR MOD√àLE")
print("="*60)
print(f"Mod√®le: {best_model['Mod√®le']}")
print(f"F1-Score: {best_model['F1-Score']:.4f}")
print(f"Accuracy: {best_model['Accuracy']:.4f}")
print("="*60)

## 4.10 Sauvegarde des r√©sultats dans Excel

In [None]:
# Cr√©er un fichier Excel avec plusieurs feuilles
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
excel_file = f'results/model_performance_{timestamp}.xlsx'

with pd.ExcelWriter(excel_file, engine='openpyxl') as writer:
    # Feuille 1: Comparaison compl√®te
    df_comparison.to_excel(writer, sheet_name='Comparaison', index=False)
    
    # Feuille 2: Classement
    df_all.to_excel(writer, sheet_name='Classement', index=False)
    
    # Feuille 3: Hyperparam√®tres optimaux
    hyperparam_data = [
        {'Mod√®le': 'LinearSVC', 'Param√®tres': str(grid_svm.best_params_)},
        {'Mod√®le': 'RandomForest', 'Param√®tres': str(grid_rf.best_params_)},
        {'Mod√®le': 'NaiveBayes', 'Param√®tres': str(grid_nb.best_params_)},
        {'Mod√®le': 'LogisticRegression', 'Param√®tres': str(grid_lr.best_params_)}
    ]
    df_hyperparam = pd.DataFrame(hyperparam_data)
    df_hyperparam.to_excel(writer, sheet_name='Hyperparam√®tres', index=False)
    
    # Feuille 4: Meilleur mod√®le
    best_summary = pd.DataFrame([{
        'Meilleur Mod√®le': best_model['Mod√®le'],
        'F1-Score': best_model['F1-Score'],
        'Accuracy': best_model['Accuracy'],
        'Date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }])
    best_summary.to_excel(writer, sheet_name='Meilleur Mod√®le', index=False)

print(f" R√©sultats sauvegard√©s dans: {excel_file}")

## 4.11 R√©sum√© final

In [None]:
print("\n" + "="*80)
print("R√âSUM√â DE L'OPTIMISATION")
print("="*80)

print(" Fichiers g√©n√©r√©s:")
print("  - Mod√®les optimis√©s: models/*_optimized.pkl")
print(f"  - Performances: {excel_file}")
print("  - Visualisations: figures/*.png")

print(" Principaux r√©sultats:")
print(f"  - Meilleur mod√®le: {best_model['Mod√®le']}")
print(f"  - Performance: F1={best_model['F1-Score']:.4f}, Acc={best_model['Accuracy']:.4f}")

# Compter les am√©liorations
improvements = 0
for model_name in optimized_results.keys():
    if model_name in baseline_results:
        if optimized_results[model_name]['f1'] > baseline_results[model_name]['f1']:
            improvements += 1

print(f" {improvements}/{len(optimized_results)} mod√®les am√©lior√©s par l'optimisation")
