# Generate Predictions for Validation Data

This notebook loads the trained Logistic Regression model and generates predictions for validation_data.csv, replacing the label column (currently 2) with model predictions (0=fake, 1=real).

## Overview
- **Goal**: Generate predictions for validation data using the trained model
- **Model**: Logistic Regression (from notebook 03)
- **Input**: `dataset/validation_data.csv` (with label=2)
- **Output**: Updated validation_data.csv with predictions (0 or 1)


## 1. Import Required Libraries


In [1]:
import pandas as pd
import numpy as np
import re
import joblib
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import warnings
warnings.filterwarnings('ignore')

try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("Libraries imported successfully!")
except:
    print("NLTK data download failed")


Libraries imported successfully!


## 2. Load Trained Model and Vectorizer

Load the trained Logistic Regression model and TF-IDF vectorizer from the deployable directory.


In [2]:
import os

model_path = '../deployable/exported_model.pkl'
vectorizer_path = '../deployable/tfidf_vectorizer.pkl'
pipeline_path = '../deployable/model_pipeline.pkl'

try:
    # Try loading pipeline first (easiest option)
    if os.path.exists(pipeline_path):
        pipeline = joblib.load(pipeline_path)
        model = None
        vectorizer = None
        print(f"✓ Pipeline loaded from {pipeline_path}")
        print("  Using combined pipeline for predictions")
    else:
        # Load model and vectorizer separately
        model = joblib.load(model_path)
        vectorizer = joblib.load(vectorizer_path)
        pipeline = None
        print(f"✓ Model loaded from {model_path}")
        print(f"✓ Vectorizer loaded from {vectorizer_path}")
        print("  Using separate model and vectorizer for predictions")
except FileNotFoundError as e:
    print(f"❌ Error: Model files not found!")
    print(f"   Please run notebook 03_simple_classification.ipynb first to export the model.")
    raise
except Exception as e:
    print(f"❌ Error loading model: {e}")
    raise


✓ Pipeline loaded from ../deployable/model_pipeline.pkl
  Using combined pipeline for predictions


## 3. Load Validation Data

Load the validation dataset that needs predictions.


In [3]:
validation_df = pd.read_csv('../dataset/validation_data.csv')

print(f"Validation data shape: {validation_df.shape}")
print(f"Columns: {validation_df.columns.tolist()}")
print(f"\nLabel distribution:")
print(validation_df['label'].value_counts().sort_index())

print(f"\nSample articles:")
for i in range(min(3, len(validation_df))):
    print(f"\n  Article {i+1}:")
    print(f"    Title: {validation_df.iloc[i]['title'][:100]}...")
    print(f"    Text length: {len(validation_df.iloc[i]['text'])} characters")
    print(f"    Current label: {validation_df.iloc[i]['label']}")


Validation data shape: (4956, 5)
Columns: ['label', 'title', 'text', 'subject', 'date']

Label distribution:
label
2    4956
Name: count, dtype: int64

Sample articles:

  Article 1:
    Title: UK's May 'receiving regular updates' on London tube station incident: PM's office...
    Text length: 389 characters
    Current label: 2

  Article 2:
    Title: UK transport police leading investigation of London incident, counter-terrorism police aware...
    Text length: 499 characters
    Current label: 2

  Article 3:
    Title: Pacific nations crack down on North Korean ships as Fiji probes more than 20 vessels...
    Text length: 2685 characters
    Current label: 2


## 4. Apply Preprocessing

Apply the same preprocessing function used during training to match the model's expected input format.


In [4]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def standard_preprocess(text):
    """Preprocess text to match training preprocessing."""
    if pd.isna(text) or text == "":
        return ""
    
    text_str = str(text)
    
    # Remove news sources (case-insensitive)
    text_str = re.sub(r'\(reuters\)|\(reuter\)|\(ap\)|\(afp\)', '', text_str, flags=re.IGNORECASE)
    text_str = re.sub(r'\breuters\b', '', text_str, flags=re.IGNORECASE)
    text_str = re.sub(r'\breuter\b', '', text_str, flags=re.IGNORECASE)
    
    # Remove caps+colon patterns at start
    text_str = re.sub(r'^[A-Z]{5,}:\s*', '', text_str)
    
    text_str = text_str.lower()
    text_str = re.sub(r'http\S+|www\S+|https\S+', '', text_str)
    text_str = re.sub(r'\S+@\S+', '', text_str)
    text_str = re.sub(r'\s+', ' ', text_str)
    
    try:
        tokens = word_tokenize(text_str)
    except:
        tokens = text_str.split()
    
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [token for token in tokens if len(token) > 2]
    
    return ' '.join(tokens)

print("Preprocessing validation data...")
validation_df['title_processed'] = validation_df['title'].apply(standard_preprocess)
validation_df['text_processed'] = validation_df['text'].apply(standard_preprocess)
validation_df['combined_text'] = validation_df['title_processed'] + ' ' + validation_df['text_processed']

# Remove articles with empty processed text
empty_text = validation_df['combined_text'].str.strip() == ''
empty_count = empty_text.sum()
if empty_count > 0:
    print(f"⚠️  Removing {empty_count} articles with empty processed text")
    validation_df = validation_df[~empty_text].reset_index(drop=True)

print(f"✓ Preprocessing completed: {len(validation_df)} articles ready for prediction")
print(f"  Average processed text length: {validation_df['combined_text'].str.split().str.len().mean():.1f} words")


Preprocessing validation data...
⚠️  Removing 2 articles with empty processed text
✓ Preprocessing completed: 4954 articles ready for prediction
  Average processed text length: 299.8 words


## 5. Generate Predictions

Use the trained model to generate predictions for all validation articles.


In [5]:
print("Generating predictions...")

if pipeline:
    # Use pipeline (includes vectorization + prediction)
    predictions = pipeline.predict(validation_df['combined_text'])
    probabilities = pipeline.predict_proba(validation_df['combined_text'])
else:
    # Use separate model and vectorizer
    text_vectorized = vectorizer.transform(validation_df['combined_text'])
    predictions = model.predict(text_vectorized)
    probabilities = model.predict_proba(text_vectorized)

# Replace label column with predictions
validation_df['label'] = predictions

print(f"✓ Predictions generated for {len(validation_df)} articles")
print(f"\nPrediction distribution:")
print(validation_df['label'].value_counts().sort_index())
print(f"\n  Fake (0): {(validation_df['label'] == 0).sum():,} articles ({(validation_df['label'] == 0).sum()/len(validation_df)*100:.1f}%)")
print(f"  Real (1): {(validation_df['label'] == 1).sum():,} articles ({(validation_df['label'] == 1).sum()/len(validation_df)*100:.1f}%)")

print(f"\nSample predictions:")
for i in range(min(5, len(validation_df))):
    pred = int(predictions[i])
    conf = probabilities[i, pred]
    pred_name = 'Fake' if pred == 0 else 'Real'
    title = validation_df.iloc[i]['title'][:80]
    print(f"  {i+1}. {pred_name} (confidence: {conf:.3f}) | '{title}...'")


Generating predictions...
✓ Predictions generated for 4954 articles

Prediction distribution:
label
0    2987
1    1967
Name: count, dtype: int64

  Fake (0): 2,987 articles (60.3%)
  Real (1): 1,967 articles (39.7%)

Sample predictions:
  1. Real (confidence: 0.668) | 'UK's May 'receiving regular updates' on London tube station incident: PM's offic...'
  2. Real (confidence: 0.580) | 'UK transport police leading investigation of London incident, counter-terrorism ...'
  3. Real (confidence: 0.762) | 'Pacific nations crack down on North Korean ships as Fiji probes more than 20 ves...'
  4. Real (confidence: 0.778) | 'Three suspected al Qaeda militants killed in Yemen drone strike...'
  5. Real (confidence: 0.815) | 'Chinese academics prod Beijing to consider North Korea contingencies...'


## 6. Save Updated Validation Data

Save the validation data with predictions replacing the original label column.


In [6]:
# Prepare output dataframe with original columns (excluding processed columns)
output_df = validation_df[['label', 'title', 'text', 'subject', 'date']].copy()

# Verify the output format
print("Output format verification:")
print(f"  Columns: {output_df.columns.tolist()}")
print(f"  Shape: {output_df.shape}")
print(f"  Label range: {output_df['label'].min()} to {output_df['label'].max()}")
print(f"  Label types: {sorted(output_df['label'].unique())}")

# Save to validation_data.csv (overwriting original)
output_path = '../dataset/validation_data.csv'
output_df.to_csv(output_path, index=False)

print(f"\n✓ Updated validation_data.csv saved to {output_path}")
print(f"  Total articles: {len(output_df):,}")
print(f"  Fake predictions: {(output_df['label'] == 0).sum():,}")
print(f"  Real predictions: {(output_df['label'] == 1).sum():,}")

# Also save a backup copy with timestamp
from datetime import datetime
backup_path = f'../dataset/validation_data_backup_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
output_df.to_csv(backup_path, index=False)
print(f"✓ Backup saved to {backup_path}")

print(f"\n✅ Validation data predictions completed successfully!")


Output format verification:
  Columns: ['label', 'title', 'text', 'subject', 'date']
  Shape: (4954, 5)
  Label range: 0 to 1
  Label types: [np.int64(0), np.int64(1)]

✓ Updated validation_data.csv saved to ../dataset/validation_data.csv
  Total articles: 4,954
  Fake predictions: 2,987
  Real predictions: 1,967
✓ Backup saved to ../dataset/validation_data_backup_20251115_130214.csv

✅ Validation data predictions completed successfully!
