# Simple Fake News Classification

This notebook combines data preprocessing and model training in a streamlined workflow.

## Overview
- **Goal**: Build a classifier to distinguish between real (1) and fake (0) news
- **Approach**: Extract features, apply standard NLP preprocessing, and train a Random Forest classifier
- **Model**: Random Forest with TF-IDF features


## 1. Import Required Libraries


In [13]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("Libraries imported successfully!")
except:
    print("NLTK data download failed")


Libraries imported successfully!


## 2. Load Dataset


In [14]:
df = pd.read_csv('../dataset/data.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts().sort_index())


Dataset shape: (39942, 5)
Columns: ['label', 'title', 'text', 'subject', 'date']

Label distribution:
label
0    19943
1    19999
Name: count, dtype: int64


## 3. Text Preprocessing

Apply preprocessing to remove news sources and caps+colon patterns, then standard NLP preprocessing.


## 4. Standard Text Preprocessing

Apply preprocessing to remove news sources and caps+colon patterns, then standard NLP preprocessing: lowercase conversion, URL/email removal, tokenization, stopword removal, and stemming.


In [15]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def standard_preprocess(text):
    if pd.isna(text) or text == "":
        return ""
    
    text_str = str(text)
    
    text_str = re.sub(r'\(reuters\)|\(reuter\)|\(ap\)|\(afp\)', '', text_str, flags=re.IGNORECASE)
    
    text_str = re.sub(r'\breuters\b', '', text_str, flags=re.IGNORECASE)
    text_str = re.sub(r'\breuter\b', '', text_str, flags=re.IGNORECASE)
    
    text_str = re.sub(r'^[A-Z]{5,}:\s*', '', text_str)
    
    text_str = text_str.lower()
    text_str = re.sub(r'http\S+|www\S+|https\S+', '', text_str)
    text_str = re.sub(r'\S+@\S+', '', text_str)
    text_str = re.sub(r'\s+', ' ', text_str)
    
    try:
        tokens = word_tokenize(text_str)
    except:
        tokens = text_str.split()
    
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [token for token in tokens if len(token) > 2]
    
    return ' '.join(tokens)


Apply preprocessing to title and text columns, then combine them.


In [16]:
df['title_processed'] = df['title'].apply(standard_preprocess)
df['text_processed'] = df['text'].apply(standard_preprocess)
df['combined_text'] = df['title_processed'] + ' ' + df['text_processed']

empty_text = df['combined_text'].str.strip() == ''
df = df[~empty_text].reset_index(drop=True)
print(f"Processed {len(df)} articles")


Processed 39935 articles


## 5. Data Preparation

Drop subject and date columns, handle missing values, and create train/test split.


In [17]:
df = df.drop(columns=['subject', 'date'])
print(f"Columns after dropping subject and date: {df.columns.tolist()}")


Columns after dropping subject and date: ['label', 'title', 'text', 'title_processed', 'text_processed', 'combined_text']


In [18]:
missing_values = df.isnull().sum()
print("Missing values:")
print(missing_values[missing_values > 0])

df = df.fillna('')
print(f"Dataset shape after handling missing values: {df.shape}")


Missing values:
Series([], dtype: int64)
Dataset shape after handling missing values: (39935, 6)


Check if "reuter" or "reuters" is still present in the processed text (validation).


In [19]:
reuter_in_processed = df['combined_text'].str.contains(r'\breuter', case=False, regex=True, na=False)
reuters_in_processed = df['combined_text'].str.contains(r'\breuters\b', case=False, regex=True, na=False)

print(f"Articles with 'reuter' in processed text: {reuter_in_processed.sum()}")
print(f"Articles with 'reuters' in processed text: {reuters_in_processed.sum()}")

if reuter_in_processed.sum() > 0 or reuters_in_processed.sum() > 0:
    print("\n‚ö†Ô∏è  Warning: Some processed texts still contain 'reuter' or 'reuters'")
    print("Sample articles with 'reuter' or 'reuters' in processed text:")
    mask = reuter_in_processed | reuters_in_processed
    for idx in df[mask].index[:3]:
        print(f"\n  Article {idx}:")
        print(f"    Original text (first 150 chars): {df.loc[idx, 'text'][:150]}...")
        print(f"    Processed text (first 150 chars): {df.loc[idx, 'combined_text'][:150]}...")
else:
    print("\n‚úì No 'reuter' or 'reuters' found in processed text - preprocessing working correctly!")


Articles with 'reuter' in processed text: 15
Articles with 'reuters' in processed text: 0

Sample articles with 'reuter' or 'reuters' in processed text:

  Article 5119:
    Original text (first 150 chars):  BERKELEY, Calif./LANSING, Mich. (Reuters) - Supporters of Donald Trump clashed with counter-protesters at a rally in the famously left-leaning city o...
    Processed text (first 150 chars): day pro-trump ralli california march turn violent berkeley calif./lans mich. support donald trump clash counter-protest ralli famous left-lean citi be...

  Article 5621:
    Original text (first 150 chars):  WASHINGTON (Reuters) - The U.S. House of Representatives voted on Monday to require law enforcement authorities to obtain a search warrant before see...
    Processed text (first 150 chars): u.s. hous pass bill requir warrant search old email washington u.s. hous repres vote monday requir law enforc author obtain search warrant seek old em...

  Article 12012:
    Original text (first 150 

Create stratified train/test split (80/20).


In [20]:
X = df['combined_text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nTraining label distribution:")
print(y_train.value_counts().sort_index())
print(f"\nTest label distribution:")
print(y_test.value_counts().sort_index())


Training set: 31948 samples
Test set: 7987 samples

Training label distribution:
label
0    15949
1    15999
Name: count, dtype: int64

Test label distribution:
label
0    3987
1    4000
Name: count, dtype: int64


## 6. Model Training

Train a Logistic Regression classifier with TF-IDF features.


In [21]:
tfidf = TfidfVectorizer(
    max_features=1000,
    stop_words='english',
    ngram_range=(1, 1),
    min_df=2,
    max_df=0.95
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF features: {X_train_tfidf.shape[1]}")


TF-IDF features: 1000


In [22]:
lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    C=0.01
)

lr_model.fit(X_train_tfidf, y_train)
print("Logistic Regression model trained successfully")


Logistic Regression model trained successfully


## 7. Model Evaluation

Evaluate the model performance on the test set.


In [23]:
y_pred = lr_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')

print("Model Performance Metrics:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")


Model Performance Metrics:
Accuracy:  0.9356
Precision: 0.9251
Recall:    0.9483
F1-Score:  0.9365


## 7.1. Feature Importance Analysis

Analyze which words/features the model relies on most for predictions.


In [24]:
feature_names = tfidf.get_feature_names_out()
coefficients = lr_model.coef_[0]

feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients,
    'abs_coefficient': np.abs(coefficients)
}).sort_values('abs_coefficient', ascending=False)

print("Top 20 Most Important Features (for predicting Real News - positive coefficients):")
print("=" * 70)
top_real = feature_importance.head(20)
for idx, row in top_real.iterrows():
    print(f"{row['feature']:<30} Coefficient: {row['coefficient']:>8.4f}")

print("\n" + "=" * 70)
print("Top 20 Features that indicate Fake News (negative coefficients):")
print("=" * 70)
top_fake = feature_importance.tail(20).sort_values('coefficient')
for idx, row in top_fake.iterrows():
    print(f"{row['feature']:<30} Coefficient: {row['coefficient']:>8.4f}")


Top 20 Most Important Features (for predicting Real News - positive coefficients):
said                           Coefficient:   2.8243
video                          Coefficient:  -1.9056
th                             Coefficient:  -1.2681
imag                           Coefficient:  -1.2022
hillari                        Coefficient:  -1.0605
trump                          Coefficient:  -1.0109
washington                     Coefficient:   0.9245
minist                         Coefficient:   0.8664
obama                          Coefficient:  -0.8301
watch                          Coefficient:  -0.7905
featur                         Coefficient:  -0.7869
know                           Coefficient:  -0.7858
like                           Coefficient:  -0.7834
senat                          Coefficient:   0.7425
govern                         Coefficient:   0.7373
wednesday                      Coefficient:   0.7269
america                        Coefficient:  -0.7250
com             

## 7.2. Misclassification Analysis

Examine articles that were incorrectly classified to understand model failures.


In [25]:
false_positives = (y_test == 0) & (y_pred == 1)
false_negatives = (y_test == 1) & (y_pred == 0)

fp_count = false_positives.sum()
fn_count = false_negatives.sum()

print("Misclassification Analysis:")
print("=" * 50)
print(f"False Positives (Fake predicted as Real): {fp_count}")
print(f"False Negatives (Real predicted as Fake): {fn_count}")
print(f"Total Misclassifications: {fp_count + fn_count}")

if fp_count > 0:
    print(f"\nSample False Positives (Fake news predicted as Real):")
    fp_indices = X_test[false_positives].index[:3]
    for idx in fp_indices:
        print(f"\n  Article {idx}:")
        print(f"    Title: {df.loc[idx, 'title'][:100]}...")
        print(f"    Text (first 200 chars): {df.loc[idx, 'text'][:200]}...")

if fn_count > 0:
    print(f"\nSample False Negatives (Real news predicted as Fake):")
    fn_indices = X_test[false_negatives].index[:3]
    for idx in fn_indices:
        print(f"\n  Article {idx}:")
        print(f"    Title: {df.loc[idx, 'title'][:100]}...")
        print(f"    Text (first 200 chars): {df.loc[idx, 'text'][:200]}...")


Misclassification Analysis:
False Positives (Fake predicted as Real): 307
False Negatives (Real predicted as Fake): 207
Total Misclassifications: 514

Sample False Positives (Fake news predicted as Real):

  Article 36034:
    Text (first 200 chars): Deputy Assistant to the President Sebastian Gorka warned Syrian President Bashar Assad after evidence arose earlier this week that the Assad regime could be planning another chemical weapons attack on...

  Article 35351:
    Title: CHICAGO THUG PRESIDENT PERSONALLY LEAKED CHUCK SCHUMER‚ÄôS OPPOSITION TO HIS DANGEROUS IRAN DEAL...
    Text (first 200 chars): Schumer asked the President not to mention his decision publicly until he could make a formal announcement on Friday.If that s how Obama treats his closest friends who refuse to align with his reckles...

  Article 37250:
    Title: THE TOTAL COST TO TAXPAYERS FOR MOOCH‚ÄôS EUROPEAN VACATION IS SICKENING...
    Text (first 200 chars): And they re heading to Cape Cod next! Where does it

## 8. Cross-Validation

Perform 5-fold cross-validation on the training data to get a more realistic performance estimate and check for overfitting.


In [26]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=1000,
        stop_words='english',
        ngram_range=(1, 1),
        min_df=2,
        max_df=0.95
    )),
    ('classifier', LogisticRegression(
        random_state=42,
        max_iter=1000,
        C=0.01
    ))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_accuracy = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1)
cv_precision = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='precision', n_jobs=-1)
cv_recall = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='recall', n_jobs=-1)
cv_f1 = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1)

print("Cross-Validation Results (5-fold):")
print("=" * 50)
print(f"Accuracy:  {cv_accuracy.mean():.4f} ¬± {cv_accuracy.std():.4f}")
print(f"Precision: {cv_precision.mean():.4f} ¬± {cv_precision.std():.4f}")
print(f"Recall:    {cv_recall.mean():.4f} ¬± {cv_recall.std():.4f}")
print(f"F1-Score:  {cv_f1.mean():.4f} ¬± {cv_f1.std():.4f}")

print("\n" + "=" * 50)
print("Comparison with Test Set Performance:")
print("=" * 50)
print(f"{'Metric':<12} {'CV Score (mean ¬± std)':<25} {'Test Score':<15} {'Difference':<15}")
print("-" * 67)

cv_acc_str = f"{cv_accuracy.mean():.4f} ¬± {cv_accuracy.std():.4f}"
cv_prec_str = f"{cv_precision.mean():.4f} ¬± {cv_precision.std():.4f}"
cv_rec_str = f"{cv_recall.mean():.4f} ¬± {cv_recall.std():.4f}"
cv_f1_str = f"{cv_f1.mean():.4f} ¬± {cv_f1.std():.4f}"

test_acc_str = f"{accuracy:.4f}"
test_prec_str = f"{precision:.4f}"
test_rec_str = f"{recall:.4f}"
test_f1_str = f"{f1:.4f}"

print(f"{'Accuracy':<12} {cv_acc_str:<25} {test_acc_str:<15} {(accuracy - cv_accuracy.mean()):.4f}")
print(f"{'Precision':<12} {cv_prec_str:<25} {test_prec_str:<15} {(precision - cv_precision.mean()):.4f}")
print(f"{'Recall':<12} {cv_rec_str:<25} {test_rec_str:<15} {(recall - cv_recall.mean()):.4f}")
print(f"{'F1-Score':<12} {cv_f1_str:<25} {test_f1_str:<15} {(f1 - cv_f1.mean()):.4f}")

print("\n" + "=" * 50)
print("Interpretation:")
print("=" * 50)
if accuracy - cv_accuracy.mean() > 0.05:
    print("‚ö†Ô∏è  WARNING: Test accuracy is significantly higher than CV accuracy.")
    print("   This suggests potential overfitting. The model may not generalize well to new data.")
elif accuracy - cv_accuracy.mean() > 0.02:
    print("‚ö†Ô∏è  CAUTION: Test accuracy is slightly higher than CV accuracy.")
    print("   Some overfitting may be present.")
else:
    print("‚úì Test and CV scores are similar. Model appears to generalize well.")


Cross-Validation Results (5-fold):
Accuracy:  0.9343 ¬± 0.0022
Precision: 0.9232 ¬± 0.0045
Recall:    0.9476 ¬± 0.0012
F1-Score:  0.9352 ¬± 0.0020

Comparison with Test Set Performance:
Metric       CV Score (mean ¬± std)     Test Score      Difference     
-------------------------------------------------------------------
Accuracy     0.9343 ¬± 0.0022           0.9356          0.0014
Precision    0.9232 ¬± 0.0045           0.9251          0.0019
Recall       0.9476 ¬± 0.0012           0.9483          0.0006
F1-Score     0.9352 ¬± 0.0020           0.9365          0.0013

Interpretation:
‚úì Test and CV scores are similar. Model appears to generalize well.


## 9. Save Results

Save model performance metrics and summary to a markdown file.


In [27]:
import os
from datetime import datetime

os.makedirs('../outputs', exist_ok=True)

results_content = f"""# Simple Fake News Classification Results

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

## Dataset Information
- Total articles: {len(df):,}
- Training samples: {len(X_train):,}
- Test samples: {len(X_test):,}

## Preprocessing
- Removed news sources: (reuters), (reuter), (ap), (afp) - case insensitive
- Removed standalone "reuters" and "reuter" as whole words - case insensitive
- Removed caps+colon patterns at start of text (e.g., "UNBELIEVABLE:")

## Model Configuration
- Algorithm: Logistic Regression
- TF-IDF Features: {X_train_tfidf.shape[1]:,}
- N-grams: (1, 1) - unigrams only
- Max features: 1000
- Regularization (C): 0.01 (strong regularization to reduce overfitting)

## Model Performance (Test Set)
- **Accuracy**: {accuracy:.4f} ({accuracy*100:.2f}%)
- **Precision**: {precision:.4f}
- **Recall**: {recall:.4f}
- **F1-Score**: {f1:.4f}

## Cross-Validation Performance (5-fold)
- **CV Accuracy**: {cv_accuracy.mean():.4f} ¬± {cv_accuracy.std():.4f}
- **CV Precision**: {cv_precision.mean():.4f} ¬± {cv_precision.std():.4f}
- **CV Recall**: {cv_recall.mean():.4f} ¬± {cv_recall.std():.4f}
- **CV F1-Score**: {cv_f1.mean():.4f} ¬± {cv_f1.std():.4f}

## Overfitting Analysis
- Test vs CV Accuracy Difference: {(accuracy - cv_accuracy.mean()):.4f}
- {'‚ö†Ô∏è WARNING: Significant overfitting detected' if accuracy - cv_accuracy.mean() > 0.05 else '‚ö†Ô∏è CAUTION: Some overfitting may be present' if accuracy - cv_accuracy.mean() > 0.02 else '‚úì Model generalizes well'}

## Summary
The Logistic Regression classifier achieved {accuracy*100:.2f}% accuracy on the test set, with a balanced F1-score of {f1:.4f}. Cross-validation shows {cv_accuracy.mean()*100:.2f}% accuracy, indicating {'potential overfitting' if accuracy - cv_accuracy.mean() > 0.02 else 'good generalization'}. The model uses enhanced preprocessing to remove news sources and caps+colon patterns (common in fake news), followed by standard NLP preprocessing (stemming, stopword removal).
"""

with open('../outputs/simple_classification_results.md', 'w') as f:
    f.write(results_content)

print("Results saved to ../outputs/simple_classification_results.md")


Results saved to ../outputs/simple_classification_results.md


## 10. Export Model

Export the trained model and TF-IDF vectorizer for GCP deployment.


In [28]:
import joblib
from sklearn.pipeline import Pipeline

os.makedirs('../deployable', exist_ok=True)

model_path = '../deployable/exported_model.pkl'
vectorizer_path = '../deployable/tfidf_vectorizer.pkl'
pipeline_path = '../deployable/model_pipeline.pkl'

joblib.dump(lr_model, model_path)
print(f"‚úì Model saved to {model_path}")

joblib.dump(tfidf, vectorizer_path)
print(f"‚úì TF-IDF vectorizer saved to {vectorizer_path}")

pipeline = Pipeline([
    ('tfidf', tfidf),
    ('classifier', lr_model)
])
joblib.dump(pipeline, pipeline_path)
print(f"‚úì Combined pipeline saved to {pipeline_path}")

print(f"\nüì¶ Model export completed successfully!")
print(f"   Files ready for GCP deployment:")
print(f"   - {model_path}")
print(f"   - {vectorizer_path}")
print(f"   - {pipeline_path}")


‚úì Model saved to ../deployable/exported_model.pkl
‚úì TF-IDF vectorizer saved to ../deployable/tfidf_vectorizer.pkl
‚úì Combined pipeline saved to ../deployable/model_pipeline.pkl

üì¶ Model export completed successfully!
   Files ready for GCP deployment:
   - ../deployable/exported_model.pkl
   - ../deployable/tfidf_vectorizer.pkl
   - ../deployable/model_pipeline.pkl
