# RoBERTa Evaluation on ISOT Dataset

## Introduction

This notebook evaluates a fine-tuned RoBERTa model on two distinct fake news detection scenarios:
1. **Titles-only dataset**: Using only article headlines
2. **Full-text dataset**: Using complete articles with both titles and text

This dual evaluation approach provides insights into how well RoBERTa performs with limited context versus full article context. As a larger and more sophisticated model compared to DistilBERT and TinyBERT, RoBERTa may capture more subtle linguistic patterns but at the cost of greater computational requirements. Understanding these trade-offs is crucial for selecting the appropriate model for different deployment scenarios.

## 1. Setting Up the Environment

Let's start by importing all necessary libraries and setting up utility functions to monitor resource usage:

In [None]:
# Import necessary libraries
import os
import time
import numpy as np
import pandas as pd
import torch
import psutil
import gc
import re

These core libraries provide the foundation for data manipulation, model evaluation, and resource monitoring - especially important for understanding RoBERTa's computational requirements.

In [None]:
# Import model and evaluation libraries
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from datasets import Dataset as HFDataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

These specialized libraries enable us to work with the RoBERTa model and evaluate its performance using standard metrics. RoBERTa uses a different tokenizer and model architecture compared to BERT-based models like DistilBERT.

In [None]:
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

Visualization tools will help us understand and communicate performance results more effectively, especially for comparing RoBERTa with other models.

In [None]:
# Set device - using CPU for edge device testing
device = torch.device("cpu")
print(f"Using device: {device}")

I deliberately chose to use the CPU rather than GPU for this evaluation because:
1. It provides a consistent comparison with other models evaluated on CPU
2. CPU performance metrics are more relevant for assessing deployment feasibility on standard hardware
3. RoBERTa's larger size makes it particularly important to understand its CPU performance characteristics

In [None]:
# Function to get current memory usage
def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # Convert to MB

print(f"Starting memory usage: {get_memory_usage():.2f} MB")

This utility function is especially important for RoBERTa, which has approximately 125M parameters compared to DistilBERT's 67M. Understanding its memory footprint is crucial for determining its viability in memory-constrained environments.

## 2. Loading the Pre-trained Model

Now I'll load the RoBERTa model that was previously fine-tuned on the ISOT dataset:

In [None]:
# Load the pre-trained RoBERTa model
print("\nLoading model...")
model_path = "../ml_models/roberta-fake-news-detector"

In [None]:
# Initialize tokenizer
start_time = time.time()
tokenizer = RobertaTokenizer.from_pretrained(model_path)

In [None]:
# Load model
model = RobertaForSequenceClassification.from_pretrained(model_path)
model.to(device)  # Move to CPU
load_time = time.time() - start_time

In [None]:
print(f"Model loaded in {load_time:.2f} seconds")
print(f"Memory usage after loading model: {get_memory_usage():.2f} MB")

Model loading time and memory footprint are critical metrics for RoBERTa. As a larger model, we expect it to have higher memory requirements and potentially longer loading times than DistilBERT or TinyBERT, which is important to quantify for deployment planning.

## 3. Data Loading and Preparation

Next, I'll load various data sources for our evaluation:

In [None]:
# Load all data sources
print("\nLoading all data sources...")

In [None]:
# 1. Load titles_only_real.csv (real news, titles only)
try:
    titles_only_real_df = pd.read_csv('../data/titles_only_real.csv')
    # Ensure label is 1 for real news
    titles_only_real_df['label'] = 1
    print(f"Loaded {len(titles_only_real_df)} real news titles from titles_only_real.csv")
except Exception as e:
    print(f"Error loading titles_only_real.csv: {e}")
    titles_only_real_df = pd.DataFrame(columns=['title', 'label'])

In [None]:
# 2. Load FakeNewsNet (fake news only)
try:
    fake_news_net_df = pd.read_csv('./datasets/simplified_FakeNewsNet.csv')
    # Keep only fake news
    fake_news_net_df = fake_news_net_df[fake_news_net_df['label'] == 0]
    print(f"Loaded {len(fake_news_net_df)} fake news articles from FakeNewsNet")
except Exception as e:
    print(f"Error loading FakeNewsNet: {e}")
    fake_news_net_df = pd.DataFrame(columns=['title', 'label'])

In [None]:
# 3. Load fake_news_evaluation.csv (fake news with text)
try:
    fake_news_eval_df = pd.read_csv('./datasets/fake_news_evaluation.csv')
    # Ensure label is 0 for fake news
    fake_news_eval_df['label'] = 0
    print(f"Loaded {len(fake_news_eval_df)} fake news articles from fake_news_evaluation.csv")
except Exception as e:
    print(f"Error loading fake_news_evaluation.csv: {e}")
    fake_news_eval_df = pd.DataFrame(columns=['title', 'text', 'label'])

In [None]:
# 4. Load manual_real.csv (real news with text)
try:
    manual_real_df = pd.read_csv('./datasets/manual_real.csv')
    # Ensure label is 1 for real news
    manual_real_df['label'] = 1
    print(f"Loaded {len(manual_real_df)} real news articles with text from manual_real.csv")
except Exception as e:
    print(f"Error loading manual_real.csv: {e}")
    manual_real_df = pd.DataFrame(columns=['title', 'text', 'label'])

I'm using the same diverse set of data sources as in the DistilBERT evaluation to enable direct model comparisons. Consistent data is essential for making fair comparisons between different model architectures.

## 4. Preparing Title-Only Dataset

Now I'll prepare the title-only dataset for evaluation:

In [None]:
# Create title-only dataset
print("\nPreparing title-only dataset...")

# Get the target size (number of real news titles) for balancing
real_titles_count = len(titles_only_real_df)
print(f"Target size for balanced dataset: {real_titles_count} articles per class")

In [None]:
# Prepare fake news data (titles only)
# 1. From FakeNewsNet
if 'text' not in fake_news_net_df.columns:
    fake_news_net_df['text'] = fake_news_net_df['title']
else:
    # Use only title as text
    fake_news_net_df['text'] = fake_news_net_df['title']

In [None]:
# 2. From fake_news_evaluation.csv
fake_news_eval_titles_df = fake_news_eval_df.copy()
fake_news_eval_titles_df['text'] = fake_news_eval_titles_df['title']  # Use only title, not full text

In [None]:
# Combine all fake news sources (titles only)
fake_news_title_only = pd.concat([fake_news_net_df[['text', 'label']], 
                                 fake_news_eval_titles_df[['text', 'label']]], 
                                 ignore_index=True)

In [None]:
# Balance fake news to match real news count
# This allows us to handle growing real news dataset without manual adjustment
if len(fake_news_title_only) > real_titles_count:
    print(f"Balancing fake news dataset: sampling {real_titles_count} articles from {len(fake_news_title_only)} total")
    # Sample randomly to match the real news count
    fake_news_title_only = fake_news_title_only.sample(n=real_titles_count, random_state=42)
else:
    print(f"Note: Not enough fake news articles ({len(fake_news_title_only)}) to match real news count ({real_titles_count})")

In [None]:
# Prepare real news data (titles only)
if 'text' not in titles_only_real_df.columns:
    titles_only_real_df['text'] = titles_only_real_df['title']

In [None]:
# Combine fake and real news (titles only)
title_only_dataset_df = pd.concat([fake_news_title_only, titles_only_real_df[['text', 'label']]], 
                                 ignore_index=True)

In [None]:
# Shuffle to mix real and fake news
title_only_dataset_df = title_only_dataset_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
print(f"Prepared title-only dataset with {len(title_only_dataset_df)} articles")
print(f"Class distribution: {title_only_dataset_df['label'].value_counts().to_dict()}")

In [None]:
# Convert to HuggingFace Dataset format
title_only_dataset = HFDataset.from_pandas(title_only_dataset_df)

For the title-only dataset, I follow the same procedure as with DistilBERT to ensure fair comparisons. Testing RoBERTa on headline-only data is particularly interesting because:

1. RoBERTa's more sophisticated encoding of context might provide advantages even with limited text
2. The model's larger capacity might extract more nuanced features from headlines
3. It helps quantify whether RoBERTa's additional complexity offers benefits in minimal-context scenarios

## 5. Preparing Full-Text Dataset

Similarly, I'll prepare the full-text dataset:

In [None]:
# Create full-text dataset
print("\nPreparing full-text dataset...")

In [None]:
# Prepare fake news data (with full text)
# Use fake_news_eval_df which already has text
fake_news_full_text_df = fake_news_eval_df.copy()
fake_news_full_text_df['text'] = fake_news_full_text_df['title'] + " " + fake_news_full_text_df['text'].fillna('')

In [None]:
# Prepare real news data (with full text)
manual_real_text_df = manual_real_df.copy()
manual_real_text_df['text'] = manual_real_text_df['title'] + " " + manual_real_text_df['text'].fillna('')

In [None]:
# Balance the datasets if needed
fake_count = len(fake_news_full_text_df)
real_count = len(manual_real_text_df)
target_count = min(fake_count, real_count)

print(f"Full-text dataset - Fake: {fake_count}, Real: {real_count}")

In [None]:
# Balance the datasets if needed
if fake_count > real_count:
    print(f"Balancing full-text dataset: sampling {real_count} fake articles from {fake_count}")
    fake_news_full_text_df = fake_news_full_text_df.sample(n=real_count, random_state=42)
elif real_count > fake_count:
    print(f"Balancing full-text dataset: sampling {fake_count} real articles from {real_count}")
    manual_real_text_df = manual_real_text_df.sample(n=fake_count, random_state=42)

In [None]:
# Combine fake and real news (with full text)
full_text_dataset_df = pd.concat([fake_news_full_text_df[['text', 'label']], 
                                manual_real_text_df[['text', 'label']]], 
                                ignore_index=True)

In [None]:
# Shuffle to mix real and fake news
full_text_dataset_df = full_text_dataset_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
print(f"Prepared full-text dataset with {len(full_text_dataset_df)} articles")
print(f"Class distribution: {full_text_dataset_df['label'].value_counts().to_dict()}")

In [None]:
# Convert to HuggingFace Dataset format
full_text_dataset = HFDataset.from_pandas(full_text_dataset_df)

The full-text evaluation is where RoBERTa's capabilities might shine the most:

1. Its advanced attention mechanisms should be able to better capture long-range dependencies in text
2. The model's larger capacity might better represent complex linguistic patterns
3. The additional pre-training data used in RoBERTa might help it better understand nuanced language

## 6. Evaluation Utility Functions

Next, I'll define utility functions for evaluation, adapted for RoBERTa:

In [None]:
# Define tokenization function
def tokenize_dataset(dataset):
    """Tokenize a dataset using the RoBERTa tokenizer"""
    print(f"Tokenizing dataset with {len(dataset)} examples...")
    tokenize_start_time = time.time()
    
    # Define tokenization function
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            padding='max_length',
            truncation=True,
            max_length=512,
            return_tensors=None
        )
    
    # Clean dataset to handle edge cases
    def clean_dataset(example):
        example['text'] = str(example['text']) if example['text'] is not None else ""
        return example
    
    # Clean and tokenize
    cleaned_dataset = dataset.map(clean_dataset)
    tokenized_dataset = cleaned_dataset.map(tokenize_function, batched=True)
    tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    
    tokenize_time = time.time() - tokenize_start_time
    print(f"Dataset tokenized in {tokenize_time:.2f} seconds")
    print(f"Memory usage after tokenization: {get_memory_usage():.2f} MB")
    
    return tokenized_dataset

In [None]:
# Define model evaluation function - Part 1: Setup
def evaluate_model_setup(tokenized_dataset, dataset_name):
    """Evaluate the model on a tokenized dataset and return metrics and resource usage"""
    print(f"\nEvaluating model on {dataset_name} dataset...")
    
    # Reset all counters and lists
    all_preds = []
    all_labels = []
    total_inference_time = 0
    sample_count = 0
    inference_times = []
    memory_usages = []
    
    # Create DataLoader
    from torch.utils.data import DataLoader
    eval_dataloader = DataLoader(
        tokenized_dataset, 
        batch_size=8,  # Smaller batch size for RoBERTa due to larger model size
        shuffle=False
    )
    
    print(f"Starting evaluation on {len(tokenized_dataset)} examples")
    
    return eval_dataloader, all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages

Note that I reduced the batch size for RoBERTa to 8 (compared to 16 for DistilBERT) due to its larger memory requirements. This adjustment ensures fair evaluation while accommodating RoBERTa's greater resource needs.

In [None]:
# Define model evaluation function - Part 2: Inference loop
def run_evaluation_loop(eval_dataloader, all_preds, all_labels, total_inference_time, 
                       sample_count, inference_times, memory_usages):
    """Run the evaluation loop on the provided dataloader"""
    
    # Evaluation loop
    model.eval()
    with torch.no_grad():
        for batch_idx, batch in enumerate(eval_dataloader):
            # Track batch progress
            if batch_idx % 5 == 0:
                print(f"Processing batch {batch_idx}/{len(eval_dataloader)}")
            
            # Extract batch data
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            # Record batch size
            current_batch_size = input_ids.size(0)
            sample_count += current_batch_size
            
            # Memory tracking
            memory_usages.append(get_memory_usage())
            
            # Time the inference
            start_time = time.time()
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            batch_inference_time = time.time() - start_time
            inference_times.append(batch_inference_time)
            total_inference_time += batch_inference_time
            
            # Get predictions
            logits = outputs.logits
            predictions = torch.softmax(logits, dim=-1)
            predicted_labels = torch.argmax(predictions, dim=1).cpu().numpy()
            
            # Store predictions and labels
            all_preds.extend(predicted_labels)
            all_labels.extend(labels.cpu().numpy())
    
    print(f"Evaluation complete. Total predictions: {len(all_preds)}, Total labels: {len(all_labels)}")
    
    return all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages

In [None]:
# Define model evaluation function - Part 3: Metrics calculation
def calculate_metrics(all_preds, all_labels, total_inference_time, sample_count, 
                     inference_times, memory_usages, dataset_name):
    """Calculate performance metrics from evaluation results"""
    
    if len(all_preds) == len(all_labels):
        accuracy = accuracy_score(all_labels, all_preds)
        precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='weighted')
        
        print(f"\nEvaluation Results for {dataset_name} dataset:")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")
        
        # Create confusion matrix
        cm = np.zeros((2, 2), dtype=int)
        for true_label, pred_label in zip(all_labels, all_preds):
            cm[true_label, pred_label] += 1
        
        print(f"\nConfusion Matrix for {dataset_name} dataset:")
        print(cm)
        
        # Resource consumption analysis
        print(f"\nResource Consumption Analysis for {dataset_name} dataset:")
        print(f"Total evaluation time: {total_inference_time:.2f} seconds")
        print(f"Average inference time per batch: {np.mean(inference_times):.4f} seconds")
        print(f"Average inference time per sample: {total_inference_time/sample_count*1000:.2f} ms")
        print(f"Peak memory usage: {max(memory_usages):.2f} MB")
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'confusion_matrix': cm,
            'inference_time_per_sample': total_inference_time/sample_count*1000,
            'peak_memory': max(memory_usages),
        }
    else:
        print("ERROR: Cannot calculate metrics - prediction and label counts don't match")
        return None

In [None]:
# Define model evaluation function - Part 4: Visualization
def visualize_results(metrics_dict, all_labels, all_preds, inference_times, memory_usages, dataset_name):
    """Create visualizations of evaluation results"""
    
    # Plot confusion matrix
    cm = metrics_dict['confusion_matrix']
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fake', 'Real'], yticklabels=['Fake', 'Real'])
    plt.title(f'Confusion Matrix - {dataset_name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.savefig(f'roberta_confusion_matrix_{dataset_name.lower().replace(" ", "_")}.png')
    plt.show()
    
    # Plot resource usage
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 1, 1)
    plt.plot(inference_times)
    plt.title(f'Inference Time per Batch (CPU) - {dataset_name}')
    plt.xlabel('Batch')
    plt.ylabel('Time (seconds)')
    
    plt.subplot(2, 1, 2)
    plt.plot(memory_usages, label='System Memory')
    plt.title(f'Memory Usage During Evaluation (CPU) - {dataset_name}')
    plt.xlabel('Batch')
    plt.ylabel('Memory (MB)')
    plt.legend()
    
    plt.tight_layout()
    plt.savefig(f'roberta_resource_usage_{dataset_name.lower().replace(" ", "_")}.png')
    plt.show()
    
    # Generate classification report
    print(f"\nDetailed Classification Report for {dataset_name}:")
    report = classification_report(all_labels, all_preds, target_names=['Fake News', 'Real News'])
    print(report)
    
    metrics_dict['classification_report'] = report
    return metrics_dict

In [None]:
# Combined evaluation function
def evaluate_model(tokenized_dataset, dataset_name):
    """Complete evaluation pipeline"""
    # Setup
    eval_dataloader, all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages = evaluate_model_setup(tokenized_dataset, dataset_name)
    
    # Run inference
    all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages = run_evaluation_loop(
        eval_dataloader, all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages
    )
    
    # Calculate metrics
    metrics_dict = calculate_metrics(
        all_preds, all_labels, total_inference_time, sample_count, inference_times, memory_usages, dataset_name
    )
    
    if metrics_dict:
        # Visualize results
        metrics_dict = visualize_results(
            metrics_dict, all_labels, all_preds, inference_times, memory_usages, dataset_name
        )
    
    return metrics_dict

The evaluation functions maintain the same structure as for DistilBERT, with modifications to handle RoBERTa's specific resource requirements. The careful tracking of memory and inference time is especially valuable for RoBERTa, which we expect to be more resource-intensive than both DistilBERT and TinyBERT.

## 7. Evaluating Title-Only Dataset

Now I'll evaluate the model on the title-only dataset:

In [None]:
# Tokenize the title-only dataset
title_only_tokenized = tokenize_dataset(title_only_dataset)

In [None]:
# Evaluate model on title-only dataset
title_only_results = evaluate_model(title_only_tokenized, "Title-Only")

For RoBERTa, the title-only evaluation is particularly interesting because:
1. It tests whether RoBERTa's advanced architecture provides benefits even with minimal context
2. It helps quantify whether RoBERTa's additional complexity is justified for headline-only screening
3. It provides insight into whether RoBERTa's more sophisticated language understanding can extract better signals from limited text

## 8. Evaluating Full-Text Dataset

Next, I'll evaluate the model on the full-text dataset:

In [None]:
# Tokenize the full-text dataset
full_text_tokenized = tokenize_dataset(full_text_dataset)

In [None]:
# Evaluate model on full-text dataset
full_text_results = evaluate_model(full_text_tokenized, "Full-Text")

The full-text evaluation for RoBERTa is expected to showcase its strengths:
1. Its architecture was specifically designed to better handle longer text sequences
2. The additional pre-training data used for RoBERTa should help it better understand complex news articles
3. The increased parameter count might allow it to capture more subtle linguistic patterns that distinguish real from fake news

## 9. Comparing Results Between Datasets

Finally, I'll create a comparative analysis of both approaches:

In [None]:
# Create comparison table
if title_only_results and full_text_results:
    comparison_df = pd.DataFrame({
        'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'Inference Time (ms/sample)', 'Peak Memory (MB)'],
        'Title-Only': [
            title_only_results['accuracy'],
            title_only_results['precision'],
            title_only_results['recall'],
            title_only_results['f1'],
            title_only_results['inference_time_per_sample'],
            title_only_results['peak_memory']
        ],
        'Full-Text': [
            full_text_results['accuracy'],
            full_text_results['precision'],
            full_text_results['recall'],
            full_text_results['f1'],
            full_text_results['inference_time_per_sample'],
            full_text_results['peak_memory']
        ]
    })

In [None]:
# Format and display comparison table
comparison_df['Title-Only'] = comparison_df['Title-Only'].apply(
    lambda x: f"{x:.4f}" if isinstance(x, (int, float)) and x < 100 else f"{x:.2f}")
comparison_df['Full-Text'] = comparison_df['Full-Text'].apply(
    lambda x: f"{x:.4f}" if isinstance(x, (int, float)) and x < 100 else f"{x:.2f}")

print("Performance Comparison Between Datasets:")
print(comparison_df.to_string(index=False))

In [None]:
# Create visualization of metrics comparison
metrics = comparison_df.iloc[:4]  # Just the first 4 metrics (accuracy, precision, recall, f1)

# Convert to numeric for plotting
metrics['Title-Only'] = metrics['Title-Only'].astype(float)
metrics['Full-Text'] = metrics['Full-Text'].astype(float)

plt.figure(figsize=(10, 6))
bar_width = 0.35
index = np.arange(len(metrics))

plt.bar(index, metrics['Title-Only'], bar_width, label='Title-Only')
plt.bar(index + bar_width, metrics['Full-Text'], bar_width, label='Full-Text')

plt.xlabel('Metrics')
plt.ylabel('Score')
plt.title('RoBERTa Performance Comparison: Title-Only vs Full-Text')
plt.xticks(index + bar_width / 2, metrics['Metric'])
plt.legend()
plt.tight_layout()
plt.savefig('roberta_performance_comparison.png')
plt.show()

This comparative analysis addresses several key questions:
1. Does RoBERTa's more sophisticated architecture provide significant benefits over simpler models?
2. Is the increase in computational requirements justified by improved performance?
3. How does the performance gap between title-only and full-text approaches compare to other models?
4. Does RoBERTa's additional complexity translate to better handling of limited context?

## 10. Conclusion and Cleanup

In [None]:
# Free up memory
del model
gc.collect()
print(f"Final memory usage: {get_memory_usage():.2f} MB")

The table and visualization above provide a clear comparison between using only titles versus full text for fake news detection with RoBERTa. Key findings include:

1. **Performance Characteristics**: 
   - RoBERTa, with its larger size and more sophisticated architecture, is expected to achieve higher accuracy than DistilBERT and TinyBERT, particularly on full-text articles
   - The performance gap between title-only and full-text approaches provides insight into how effectively RoBERTa leverages additional context

2. **Resource Requirements**:
   - RoBERTa demonstrates significantly higher memory usage and longer inference times compared to lighter models like TinyBERT and DistilBERT
   - The specific resource metrics help determine whether RoBERTa is viable for different deployment scenarios

3. **Practical Applications**:
   - For high-stakes applications where maximum accuracy is critical and computational resources are abundant, RoBERTa may be the preferred choice
   - For resource-constrained environments, the trade-off between RoBERTa's potentially higher accuracy and its increased computational demands must be carefully considered

RoBERTa represents the upper end of the performance-efficiency spectrum in our comparative evaluation of transformer models for fake news detection. Its advanced architecture and additional pre-training make it a strong candidate for scenarios where accuracy is paramount, but its resource requirements may limit its applicability in edge or mobile environments.

By directly comparing RoBERTa, DistilBERT, and TinyBERT across multiple metrics and evaluation scenarios, we gain comprehensive insights into the performance-efficiency trade-offs for fake news detection, enabling informed model selection based on specific deployment constraints and requirements.