# 03 - Fake vs Real News Classification - Embedding-based Model

This notebook explores an alternative approach to the TF-IDF model by using word embeddings to represent the text data.

## Project Goal
Build a classifier to distinguish between real (1) and fake (0) news articles using word embeddings.

## Workflow
1. Load preprocessed training and testing data.
2. Use a pre-trained language model from `spaCy` to generate document embeddings.
3. Train multiple classification models on the embeddings.
4. Evaluate and compare model performance against the TF-IDF models.
5. Select the best performing model and save it.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import spacy
import joblib
#from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB # Using GaussianNB for continuous features
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, classification_report)
import time
import os

# Load the spaCy model
nlp = spacy.load('en_core_web_lg')

print("Libraries and spaCy model loaded successfully!")

Libraries and spaCy model loaded successfully!


In [2]:
# Load the preprocessed training and testing data
print("Loading preprocessed data...")

train_data = pd.read_csv('../data/processed/train.csv')
test_data = pd.read_csv('../data/processed/test.csv')


Loading preprocessed data...


In [3]:
train_data.head()

Unnamed: 0,text_processed,label,title_length,text_length,title_word_count,text_word_count
0,hear presid trump doubl condemn evil kkk white...,0,125,3940,18,677
1,wow christian author give unexpect brilliant a...,0,157,1874,25,324
2,obama black live matter terrorist join peopl l...,0,160,7174,24,1177
3,brexit deal risk chao drug suppli report warn ...,1,60,1434,11,224
4,insan iowa republican liter want let toddler c...,0,66,2836,10,496


In [4]:

print(f"Training data shape: {train_data.shape}")
print(f"Testing data shape: {test_data.shape}")


X_train_text = train_data['text_processed']
y_train = train_data['label']
X_test_text = test_data['text_processed']
y_test = test_data['label']

print(f"Data loaded")

Training data shape: (31948, 6)
Testing data shape: (7987, 6)
Data loaded


In [None]:
# Generate document embeddings
print("Generating document embeddings...")

def get_embedding(text):
    """Generate a document embedding by averaging word vectors."""
    doc = nlp(text)
    # We filter out stop words and punctuation, and only use tokens with vectors.
    vectors = [token.vector for token in doc if token.has_vector and not token.is_stop and not token.is_punct]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        # If no vectors are found, return a zero vector of the same dimension.
        return np.zeros(nlp.meta['vectors']['width'])

# Apply the function to the text columns
# This may take a few minutes to run
X_train = np.array([get_embedding(text) for text in X_train_text])
X_test = np.array([get_embedding(text) for text in X_test_text])

print(f"Embeddings generated successfully!")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

Generating document embeddings...


In [9]:
# Define machine learning models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'Gaussian Naive Bayes': GaussianNB()
}

print(f"Defined {len(models)} models for training.")

Defined 4 models for training.


In [None]:
# Train and evaluate each model
print("Training and evaluating models...")
print("=" * 60)

results = {}

for name, model in models.items():
    print(f"\nüîÑ Training {name}...")
    start_time = time.time()
    
    # Fit the model on training data
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions on test set
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='binary')
    recall = recall_score(y_test, y_pred, average='binary')
    f1 = f1_score(y_test, y_pred, average='binary')
    
    # Store results
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'training_time': training_time
    }
    
    # Display results
    print(f"  ‚úì Training completed in {training_time:.1f} seconds")
    print(f"    Accuracy:    {accuracy:.4f}")
    print(f"    Precision:   {precision:.4f}")
    print(f"    Recall:      {recall:.4f}")
    print(f"    F1-Score:    {f1:.4f}")

print(f"\nüéØ Model training completed!")

Training and evaluating models...

üîÑ Training Logistic Regression...
  ‚úì Training completed in 0.7 seconds
    Accuracy:    0.9253
    Precision:   0.9165
    Recall:      0.9360
    F1-Score:    0.9262

üîÑ Training Random Forest...
  ‚úì Training completed in 41.9 seconds
    Accuracy:    0.9135
    Precision:   0.9113
    Recall:      0.9165
    F1-Score:    0.9139

üîÑ Training SVM...


In [None]:
# Create model comparison table
comparison_df = pd.DataFrame(results).T.drop(columns=['model'])
print("üìä MODEL COMPARISON (Embeddings)")
print("=" * 80)
print(comparison_df.round(4))

# Identify best model based on F1-score
best_model_name = comparison_df['f1'].idxmax()
best_model = results[best_model_name]['model']
print(f"\nüèÜ Best performing model: {best_model_name}")

# Save the best model
os.makedirs('../outputs/model', exist_ok=True)
model_filename = f'../outputs/model/best_embedding_model_{best_model_name.lower().replace(" ", "_")}.pkl'
joblib.dump(best_model, model_filename)
print(f"‚úì Best model saved to: {model_filename}")

## Results and Comparison

The table above shows the performance of different classifiers trained on the document embeddings. 

### Comparison with TF-IDF Models

Let's compare these results with the TF-IDF based models from the previous notebook.

| Model (TF-IDF)         | Accuracy | F1-Score |
|------------------------|----------|----------|
| Logistic Regression    | 0.9850   | 0.9850   |
| Random Forest          | 0.9932   | 0.9933   |
| Naive Bayes            | 0.9395   | 0.9395   |
| Support Vector Machine | 0.9922   | 0.9923   |

**Observations:**

* The TF-IDF models, especially Random Forest and SVM, achieved near-perfect scores, which is suspicious given the nature of the dataset.
* The embedding-based models have lower scores, which might be a more realistic representation of the model's performance on unseen data.
* The difference in performance suggests that the TF-IDF approach might be overfitting to the specific vocabulary of the training data.
* The embedding-based approach, by capturing semantic meaning, might be more robust and generalize better, even with a lower score on this specific test set.

**Conclusion:**

While the TF-IDF models show higher metrics, the embedding-based models are likely to be more reliable in a real-world scenario. The choice between the two would depend on the specific goals of the project. If the goal is to have a model that understands the content better and is less susceptible to simple keyword manipulation, the embedding-based model is a better choice.