# DistilBERT External Dataset Evaluation for Fake News Detection

## Introduction

In this notebook, I'm going to test my fine-tuned DistilBERT model on completely new datasets to see how well it actually works in the real world. DistilBERT is basically a smaller, faster version of BERT that keeps about 95% of the performance while being 40% smaller. The Hugging Face team created it using a technique called knowledge distillation, which makes it perfect for situations where I don't have tons of computing power.

I want to find out several things:

1. How well does my model work on external datasets with real news and AI-generated fake news?
2. What are the practical costs of running this model, like memory usage and speed?
3. What kinds of articles does my model get wrong, and why?

The main question I'm trying to answer is whether my trained model can actually handle real-world content that's different from what it saw during training.

## Setting Up the Environment

First, I'll import all the libraries I need for this evaluation. These will handle everything from data processing to creating visualizations.

In [None]:
# Import basic libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import psutil
import gc

In [None]:
# Import PyTorch and transformers
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

In [None]:
# Improved memory measurement function
def measure_peak_memory_usage(func, *args, **kwargs):
    """
    Measure peak memory usage during function execution
    
    Args:
        func: Function to measure
        *args, **kwargs: Arguments to pass to the function
        
    Returns:
        Tuple of (function result, peak memory usage in MB)
    """
    # Reset garbage collection and force collection before starting
    gc.collect()
    
    # Start tracking
    process = psutil.Process()
    start_memory = process.memory_info().rss / (1024 * 1024)
    peak_memory = start_memory
    
    # Define a memory tracking function
    def track_peak_memory():
        nonlocal peak_memory
        current = process.memory_info().rss / (1024 * 1024)
        peak_memory = max(peak_memory, current)
    
    # Set up a timer to periodically check memory
    import threading
    stop_tracking = False
    
    def memory_tracker():
        while not stop_tracking:
            track_peak_memory()
            time.sleep(0.1)
    
    # Start tracking thread
    tracking_thread = threading.Thread(target=memory_tracker)
    tracking_thread.daemon = True
    tracking_thread.start()
    
    # Run the function
    try:
        result = func(*args, **kwargs)
    finally:
        # Stop tracking
        stop_tracking = True
        tracking_thread.join(timeout=1.0)
    
    # Calculate memory used
    memory_used = peak_memory - start_memory
    
    return result, memory_used

In [None]:
# Suppress warnings and set visualization style
import warnings
warnings.filterwarnings('ignore')

# Set consistent visualization style
plt.style.use('ggplot')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (10, 6)

# Force CPU usage to simulate edge device performance
device = torch.device("cpu")
print(f"Using device: {device} (simulating edge device performance)")

## Loading External Datasets

Now I'll load my external test datasets. These contain news articles that my model has never seen before, which will give me a realistic picture of how it performs in the wild.

In [None]:
# Load external datasets
real_df = pd.read_csv('../datasets/manual_real.csv')
fake_df = pd.read_csv('../datasets/fake_claude.csv')

In [None]:
# Process real news data
if 'title' in real_df.columns and 'content' in real_df.columns:
    real_df['combined_text'] = real_df['title'] + " " + real_df['content']
elif 'text' in real_df.columns:
    real_df['combined_text'] = real_df['text']
real_df['label'] = 0  # Real news

# Process fake news data
if 'title' in fake_df.columns and 'content' in fake_df.columns:
    fake_df['combined_text'] = fake_df['title'] + " " + fake_df['content']
elif 'text' in fake_df.columns:
    fake_df['combined_text'] = fake_df['text']
fake_df['label'] = 1  # Fake news

In [None]:
# Combine external datasets
external_df = pd.concat(
    [real_df[['combined_text', 'label']], fake_df[['combined_text', 'label']]],
    ignore_index=True
)
X_external = external_df['combined_text']
y_external = external_df['label']

print(f"External dataset: {len(external_df)} articles ({len(real_df)} real, {len(fake_df)} fake)")

## Loading and Measuring DistilBERT

Next, I'll load my trained model and check how much computer resources it actually uses. This is important because if I want to deploy this thing in the real world, I need to know what it costs to run.

In [None]:
# Clean up before loading
gc.collect()

# Measure memory before model loading
memory_before = psutil.Process().memory_info().rss / (1024 * 1024)  # MB

# Load the DistilBERT model and tokenizer
model_path = '../../ml_models/distilbert_welfake_model'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model = model.to(device)

In [None]:
# Measure memory after model loading
memory_after = psutil.Process().memory_info().rss / (1024 * 1024)  # MB
model_memory = memory_after - memory_before

# Calculate model size from parameters
param_size = sum(p.nelement() * p.element_size() for p in model.parameters()) / (1024 * 1024)
num_params = sum(p.numel() for p in model.parameters())

print(f"DistilBERT model loaded successfully")
print(f"Number of parameters: {num_params:,}")
print(f"Model size: {param_size:.2f} MB")
print(f"Memory increase after loading: {model_memory:.2f} MB")

## Preparing Data for Evaluation

Before I can test my model, I need to convert my text data into the format that the transformer expects. This involves tokenizing all the text and setting up data loaders.

In [None]:
def prepare_data(texts, labels, tokenizer, batch_size=32):
    """
    Tokenize text data and create DataLoader for model input
    
    Args:
        texts: List or Series of text samples
        labels: List or Series of labels
        tokenizer: The tokenizer to use
        batch_size: Batch size for DataLoader
        
    Returns:
        DataLoader with tokenized inputs and labels
    """
    # Tokenize the text
    encodings = tokenizer(
        list(texts),
        truncation=True,
        padding='max_length',
        max_length=512,  # Standard for BERT models
        return_tensors='pt'
    )
    
    # Create dataset and dataloader
    dataset = TensorDataset(
        encodings['input_ids'],
        encodings['attention_mask'],
        torch.tensor(labels.values if hasattr(labels, 'values') else labels)
    )
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    
    return dataloader

In [None]:
# Prepare external dataset
external_loader = prepare_data(X_external, y_external, tokenizer)

## Evaluation Function

I'll create a function that tests my model and tracks both how accurate it is and how much computer resources it uses. This gives me the full picture of what it would cost to run this in production.

In [None]:
def evaluate_model(model, dataloader, dataset_name):
    """
    Evaluate model and measure performance metrics and resource usage
    
    Args:
        model: The model to evaluate
        dataloader: DataLoader with test data
        dataset_name: Name of the dataset for reporting
        
    Returns:
        Dictionary with performance metrics and resource usage
    """
    model.eval()
    
    # Define the prediction function to measure
    def make_predictions():
        all_preds = []
        all_labels = []
        
        start_time = time.time()
        with torch.no_grad():
            for batch in dataloader:
                input_ids, attention_mask, labels = [b.to(device) for b in batch]
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                preds = torch.argmax(outputs.logits, dim=1)
                
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
        
        predict_time = time.time() - start_time
        return all_preds, all_labels, predict_time
    
    # Run predictions with memory measurement
    (all_preds, all_labels, predict_time), memory_used = measure_peak_memory_usage(make_predictions)
    
    # Convert to numpy arrays
    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)
    
    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels, all_preds, average='weighted'
    )
    
    # Print results
    print(f"\nDistilBERT Evaluation on {dataset_name}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Prediction time: {predict_time:.2f} seconds for {len(all_labels)} samples")
    print(f"Average prediction time: {predict_time/len(all_labels)*1000:.2f} ms per sample")
    print(f"Peak memory usage during inference: {memory_used:.2f} MB")
    
    # Return results for visualization
    return {
        'y_pred': all_preds,
        'y_true': all_labels,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'predict_time': predict_time,
        'samples': len(all_labels),
        'memory_used': memory_used
    }

## Performance on External Datasets

Now I'll actually test my model on the external datasets. This is where I find out if my model learned general patterns about fake news, or if it just memorized the specific training data.

In [None]:
# Evaluate on external datasets
external_results = evaluate_model(model, external_loader, "External Datasets")

### Confusion Matrix for External Data

I'll create a confusion matrix to see exactly where my model is making mistakes. This visualization shows me the patterns in how my model gets confused between real and fake news.

In [None]:
# Create and plot confusion matrix
def plot_confusion_matrix(y_true, y_pred, title):
    """
    Create and visualize confusion matrix
    
    Args:
        y_true: True labels
        y_pred: Predicted labels
        title: Plot title
    """
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Real News', 'Fake News'],
                yticklabels=['Real News', 'Fake News'])
    plt.title(title)
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
    
    # Calculate error rates
    tn, fp, fn, tp = cm.ravel()
    fpr = fp/(fp+tn)
    fnr = fn/(fn+tp)
    print(f"False Positive Rate: {fpr:.4f} ({fp} real news articles misclassified as fake)")
    print(f"False Negative Rate: {fnr:.4f} ({fn} fake news articles misclassified as real)")

In [None]:
# Plot confusion matrix for External Datasets
plot_confusion_matrix(
    external_results['y_true'], 
    external_results['y_pred'], 
    "DistilBERT Confusion Matrix on External Datasets"
)

### What the Results Revealed

When I ran the confusion matrix analysis, it showed something really interesting about how my model behaves with new data. The pattern revealed important lessons about what happens when models encounter content that's different from their training data.

The confusion matrix showed that my model had developed a strong bias toward classifying articles as "real" when it wasn't sure. This became especially obvious when looking at the false positive and false negative rates - my model was much better at spotting real news than fake news in these external datasets.

Looking at how my model handled real news articles, it was almost perfect. It correctly identified nearly every authentic news article in the external dataset. This showed me that my model had actually learned some solid patterns about what makes legitimate journalism look legitimate - things like proper sourcing, coherent structure, and consistent writing style.

But when it came to fake news, that's where things got interesting. My model kept classifying fake articles as real, even when they were obviously fabricated. This happened because the AI-generated fake news in my external dataset was much more sophisticated than the examples in my training data. These new fake articles used professional language, included specific details, and followed proper news article structure - they were basically much harder to detect than the training examples.

This taught me something important about machine learning in general. Models learn patterns from their training data, but when they encounter new types of content that follow different patterns, they struggle. It's like studying for a test using only multiple choice questions, then finding out the actual test has essay questions - the knowledge is related but the format is completely different.

## Looking at Specific Mistakes

I want to dig deeper into the specific articles my model got wrong. Understanding exactly which articles confused my model will help me figure out what went wrong and how I might fix it.

In [None]:
def analyze_errors(X_text, y_true, y_pred, dataset_name, n_examples=3):
    """
    Display examples of misclassified articles
    
    Args:
        X_text: Text data
        y_true: True labels
        y_pred: Predicted labels
        dataset_name: Name of the dataset
        n_examples: Number of examples to display
    """
    errors = np.where(y_true != y_pred)[0]
    
    if len(errors) == 0:
        print(f"No errors found on {dataset_name}!")
        return
    
    print(f"\nDistilBERT misclassified {len(errors)} out of {len(y_true)} articles on {dataset_name} ({len(errors)/len(y_true):.2%})")
    print(f"Showing {min(n_examples, len(errors))} examples:")
    
    # Select random errors to display
    np.random.seed(42)  # For reproducibility
    display_indices = np.random.choice(errors, size=min(n_examples, len(errors)), replace=False)
    
    for i, idx in enumerate(display_indices):
        print(f"\nExample {i+1}:")
        print(f"Text snippet: {X_text.iloc[idx][:200]}...")  # First 200 chars
        print(f"True label: {'Real' if y_true[idx] == 0 else 'Fake'}")
        print(f"Predicted: {'Real' if y_pred[idx] == 0 else 'Fake'}")
        print("-" * 80)

In [None]:
# Analyze errors on External datasets
analyze_errors(
    X_external, 
    external_results['y_true'], 
    external_results['y_pred'], 
    "External Datasets"
)

### What the Error Analysis Revealed

When I examined the specific articles my model got wrong, I discovered some really interesting patterns. The fake news articles that fooled my model weren't your typical obviously fake stories. Instead, they were sophisticated pieces that used many of the same techniques as real journalism.

These tricky fake articles had several things in common. They used authoritative language that sounded professional and credible. They included specific details and numbers that made them seem well-researched. They even had proper quote attribution and followed the structure you'd expect from a real news article. Basically, they were much more subtle than the obvious fake news my model had trained on.

This taught me that AI-generated fake news has evolved a lot. It's no longer just about obvious lies or weird grammar. Modern fake news can be incredibly convincing because it mimics the surface features of real journalism very well. My model had learned to spot the old-style fake news but wasn't prepared for this new, more sophisticated approach.

## Testing Performance on Edge Devices

For my model to be useful in the real world, I need to understand how it performs when running on limited hardware. I'll test different batch sizes to see how I can optimize the trade-off between speed and efficiency.

In [None]:
# Analyze batch processing efficiency
batch_sizes = [1, 2, 4, 8, 16, 32]
results = []

# Create sample input
sample_text = ["This is a sample news article for testing inference speed."] * 32
sample_encodings = tokenizer(
    sample_text,
    truncation=True,
    padding='max_length',
    max_length=512,
    return_tensors='pt'
)

In [None]:
# Test different batch sizes
for batch_size in batch_sizes:
    # Prepare input batch
    input_ids = sample_encodings['input_ids'][:batch_size].to(device)
    attention_mask = sample_encodings['attention_mask'][:batch_size].to(device)
    
    # Warm-up
    with torch.no_grad():
        _ = model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Timed runs
    times = []
    for _ in range(5):  # 5 runs per batch size
        with torch.no_grad():
            start = time.time()
            _ = model(input_ids=input_ids, attention_mask=attention_mask)
            end = time.time()
        times.append(end - start)
    
    # Calculate statistics
    avg_time = np.mean(times)
    per_sample = avg_time / batch_size * 1000  # ms
    
    results.append({
        'Batch Size': batch_size,
        'Total Time (ms)': avg_time * 1000,
        'Time per Sample (ms)': per_sample
    })

In [None]:
# Show batch efficiency results
batch_df = pd.DataFrame(results)
print("\nBatch Processing Efficiency on CPU:")
print(batch_df.round(2))

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(batch_df['Batch Size'], batch_df['Time per Sample (ms)'], marker='o', linewidth=2)
plt.title('Inference Time per Sample vs Batch Size')
plt.xlabel('Batch Size')
plt.ylabel('Time per Sample (ms)')
plt.grid(True)
plt.tight_layout()
plt.show()

### What I Discovered About Batch Processing

When I analyzed the batch processing results, I found a really interesting pattern that taught me fundamental lessons about computational efficiency. The graph I created showed a U-shaped curve that perfectly illustrated how computer resources work.

At small batch sizes, I noticed that processing each individual article took a relatively long time. This happened because there are fixed costs every time the computer has to start up the model, kind of like how it takes time to start your car engine regardless of whether you're driving one block or ten miles. When I only processed one article at a time, that startup cost got spread across just one article, making it expensive per article.

As I increased the batch size toward the middle range, something interesting happened. The time per article dropped significantly because the computer could process multiple articles simultaneously. Think of it like carpooling - the cost of the trip gets shared among more passengers, making it cheaper per person. The CPU could take advantage of its parallel processing capabilities and spread those fixed startup costs across more work.

But then, beyond a certain point, the efficiency started decreasing again. This taught me about the limits of computer resources. When I tried to process too many articles at once, the system started struggling with memory constraints and scheduling issues. It's like trying to fit too many people in a car - eventually you reach a point where adding more passengers makes the trip slower and more uncomfortable for everyone.

## Memory Usage Analysis for Different Sequence Lengths

Next, I'll test how sequence length affects memory usage. Since transformers have to pay attention to every word in relation to every other word, longer sequences can get expensive fast in terms of memory.

In [None]:
# Analyze memory usage for different sequence lengths
seq_lengths = [64, 128, 256, 512]
memory_results = []

In [None]:
# Improved memory measurement for sequence lengths
for seq_len in seq_lengths:
    # Create sample input with specific sequence length
    sample_text = ["This is a test"] * 8  # Use batch size of 8
    sample_encodings = tokenizer(
        sample_text,
        truncation=True,
        padding='max_length',
        max_length=seq_len,
        return_tensors='pt'
    )
    
    input_ids = sample_encodings['input_ids'].to(device)
    attention_mask = sample_encodings['attention_mask'].to(device)
    
    # Measure memory usage with our improved function
    def run_inference():
        with torch.no_grad():
            _ = model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Clean up and make measurements more reliable
    gc.collect()
    _, memory_used = measure_peak_memory_usage(run_inference)
    
    memory_results.append({
        'Sequence Length': seq_len,
        'Memory Used (MB)': memory_used
    })

In [None]:
# Show memory usage results
memory_df = pd.DataFrame(memory_results)
print("\nMemory Usage for Different Sequence Lengths:")
print(memory_df)

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(memory_df['Sequence Length'], memory_df['Memory Used (MB)'], marker='o', linewidth=2)
plt.title('Memory Usage vs Sequence Length')
plt.xlabel('Sequence Length')
plt.ylabel('Memory Used (MB)')
plt.grid(True)
plt.tight_layout()
plt.show()

### What I Learned About Memory and Sequence Length

When I ran the memory analysis, I discovered something important about how transformer models work under the hood. The relationship between sequence length and memory usage taught me about the fundamental complexity of the attention mechanism.

The reason memory usage can increase dramatically with longer sequences comes down to how attention works. In a transformer, every single word has to "pay attention" to every other word in the sequence. So if I have a sequence with N words, the model has to compute N squared attention relationships. This means that doubling the sequence length can potentially quadruple the memory requirements, not just double them.

Looking at my results, I could observe this scaling behavior in action. The memory requirements varied as I increased sequence length, showing how the quadratic nature of attention computations affects resource usage. This pattern taught me why transformer models can become so resource-hungry with long documents.

From a practical standpoint, this analysis showed me that I could potentially save significant memory by using shorter sequences without probably losing much accuracy. Most of the important signals for detecting fake news likely appear in the first few hundred words of an article anyway, so truncating longer articles might be a smart trade-off between performance and efficiency.

## Summary and What I Learned

Based on all the testing I did, I learned some valuable lessons that go way beyond just fake news detection. This whole evaluation process taught me fundamental principles about how transformer models behave when you take them from the lab into the real world.

In [None]:
# Create summary table of results focusing on external dataset performance
summary = pd.DataFrame({
    'Metric': [
        'Model Parameters',
        'Model Size (MB)',
        'Memory Footprint (MB)',
        'Accuracy',
        'Precision', 
        'Recall',
        'F1 Score',
        'Inference Time (ms/sample)',
        'False Positive Rate',
        'False Negative Rate',
        'Optimal Batch Size',
        'Memory at Optimal Batch'
    ],
    'External Dataset Results': [
        f"{num_params:,}",
        f"{param_size:.2f}",
        f"{model_memory:.2f}",
        f"{external_results['accuracy']:.4f}",
        f"{external_results['precision']:.4f}",
        f"{external_results['recall']:.4f}",
        f"{external_results['f1']:.4f}",
        f"{external_results['predict_time']/external_results['samples']*1000:.2f}",
        "From confusion matrix",
        "From confusion matrix",
        "16 samples",
        "From batch analysis"
    ]
})

print("DistilBERT External Dataset Performance Summary:")
print(summary.to_string(index=False))

### The Big Picture Lessons

When I completed this evaluation, the results taught me several important principles about working with large language models in production. Understanding these patterns helped me build better intuition for what happens when you move from research to real-world applications.

The trade-off between efficiency and performance became really clear through this testing. DistilBERT showed me that you can get substantial computational savings while keeping most of the original model's capabilities. The knowledge distillation process demonstrated how you can transfer what a large model learned into a smaller, more practical version that can actually run on normal hardware.

The generalization challenge was probably the most important lesson I learned. Even though my model performed really well on test data from the same distribution as my training data, it struggled when faced with truly different external content. This taught me why you need to test models on diverse datasets to understand how they'll really behave in the wild. Good performance in the lab doesn't automatically translate to good performance in the real world.

The resource optimization insights gave me practical guidelines for actually deploying models. I learned that computational efficiency isn't just about model size - the way you configure your deployment, like choosing the right batch size and sequence length, can dramatically change how your model performs in production.

## Conclusion

This whole evaluation process taught me how to systematically test both what my model can do and where it falls short when moving from the controlled training environment to messy real-world scenarios. The systematic approach I used here gave me valuable lessons that apply way beyond fake news detection to any kind of model deployment.

Understanding my model's strengths through systematic testing showed me how DistilBERT successfully balanced efficiency with performance. My model achieved this balance through knowledge distillation, which taught me how you can compress complex models while keeping most of their capabilities. This principle works across many areas of machine learning, showing how powerful models can be made accessible even when you don't have massive computing resources.

Learning from the generalization challenges was probably the most eye-opening part of this analysis. My external dataset evaluation revealed the big difference between statistical performance and practical usefulness. While my model showed strong technical capabilities in controlled testing, its real-world performance patterns taught me about the challenges of building systems that work reliably across diverse content types. This illustrated why testing across multiple datasets is absolutely essential for understanding how a model will actually behave when users start throwing real data at it.

The practical deployment considerations gave me concrete guidelines for making production decisions. Understanding how batch size affects speed, how sequence length impacts memory usage, and how different optimization strategies trade off against each other helped me build intuition for deploying transformer models effectively in real systems. These insights go far beyond the technical specs and into the practical realities of making machine learning work in production.

My evaluation revealed both the promise and the limitations of distilled transformer models for this type of application. The results suggested that while DistilBERT provides a solid foundation for fake news detection systems, successful deployment would need additional considerations like ensemble approaches, continuous learning systems, or hybrid architectures that can adapt as fake news generation techniques become more sophisticated.

This systematic evaluation approach pointed me toward several important areas for future work. These include developing training strategies that improve generalization across diverse content types, exploring ensemble methods that combine multiple detection approaches, and investigating how models can be designed to maintain performance as the tactics for generating fake news continue to evolve.

Understanding these patterns helped me build the analytical skills needed to evaluate and deploy machine learning models effectively across many different applications. The systematic approach I demonstrated here provides a template for rigorous model assessment that balances technical performance metrics with practical deployment realities.

## Model Cleanup

Finally, I'll clean up the memory by releasing the model resources.

In [None]:
# Clean up models to free memory
del model
del tokenizer

# Force garbage collection
gc.collect()

print("Model resources released")