# Fine-tuning RoBERTa for Fake News Detection

## Introduction

This notebook documents the process of fine-tuning a RoBERTa model for fake news detection using the ISOT dataset. Building on our previous work with DistilBERT, TinyBERT, and MobileBERT, we now explore RoBERTa, which represents a different approach to improving transformer models.

RoBERTa (Robustly Optimized BERT Pretraining Approach) was selected as part of our comparative evaluation because it offers an alternative perspective on model improvement. Unlike DistilBERT, TinyBERT, and MobileBERT, which focus on model compression, RoBERTa maintains the same architecture as BERT but improves performance through better training methodology. Specifically, RoBERTa:

1. Trains longer with bigger batches and more data
2. Removes the next sentence prediction objective
3. Uses dynamic masking patterns instead of static ones
4. Uses a larger vocabulary and byte-level BPE encoding

These improvements often lead to better performance on downstream tasks, making RoBERTa an interesting comparison point for our lightweight models. While RoBERTa is not a compressed model, including it in our evaluation provides a performance ceiling that helps contextualize the trade-offs made by the lightweight models.

## Setup and Environment Preparation

### Library Installation and Imports

We begin by installing the necessary libraries for our fine-tuning process:

In [None]:
# Install required packages
!pip install transformers datasets torch evaluate scikit-learn

The libraries serve the following purposes:
- `transformers`: Provides access to pretrained models like RoBERTa and utilities for fine-tuning
- `datasets`: Offers efficient data handling for transformer models
- `torch`: Serves as the deep learning framework for model training
- `evaluate`: Provides evaluation metrics for model performance assessment
- `scikit-learn`: Offers additional metrics and utilities for evaluation

Next, we import the basic libraries needed for data handling and visualization:

In [None]:
# Import basic libraries
import numpy as np
import pandas as pd
import torch
import random
import time
import os
import warnings

Then we import the transformer-specific libraries:

In [None]:
# Import transformers and datasets libraries
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from datasets import Dataset as HFDataset

We use the RoBERTa-specific classes (`RobertaTokenizer` and `RobertaForSequenceClassification`) because RoBERTa has a different tokenization approach and vocabulary compared to BERT, which requires these specialized classes for optimal performance.

### Setting Up Reproducibility

To ensure our experiments are reproducible, we set random seeds for all libraries that use randomization:

In [None]:
# Set random seeds for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

The seed value of 42 is arbitrary but consistently used across all our experiments to ensure fair comparison between models.

### Hardware Configuration

We check for GPU availability to accelerate training:

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using a GPU is particularly important for RoBERTa, which is larger than the compressed models we've been working with. Without GPU acceleration, training RoBERTa would be significantly slower and potentially impractical for this project.

## Data Preparation

### Loading the Dataset

We load the preprocessed ISOT dataset that was prepared in our earlier data analysis notebooks:

In [None]:
# Load the preprocessed datasets
try:
    train_df = pd.read_csv('/kaggle/input/train_fake_news.csv')
    val_df = pd.read_csv('/kaggle/input/val_fake_news.csv') 
    test_df = pd.read_csv('/kaggle/input/test_fake_news.csv')
    
    print(f"Training set: {train_df.shape}")
    print(f"Validation set: {val_df.shape}")
    print(f"Test set: {test_df.shape}")
except FileNotFoundError:
    print("Preprocessed files not found. Please run the data preprocessing from Part 2 first.")

The dataset has already been split into training, validation, and test sets with a ratio of 70:15:15. This split ensures we have enough data for training while maintaining substantial validation and test sets for reliable evaluation.

### Examining the Data

We examine the data structure to ensure it matches our expectations:

In [None]:
# Display sample data
print("Sample of training data:")
train_df.head(3)

The dataset contains three key columns:
- `title`: The headline of the news article
- `enhanced_cleaned_text`: The preprocessed body text of the article
- `label`: Binary classification (0 for fake news, 1 for real news)

### Converting to HuggingFace Dataset Format

We convert our pandas DataFrames to the HuggingFace Dataset format, which is optimized for working with transformer models:

In [None]:
# Function to convert pandas DataFrames to HuggingFace Datasets
def convert_to_hf_dataset(df):
    # For RoBERTa, we'll combine title and text for better context
    df['text'] = df['title'] + " " + df['enhanced_cleaned_text']
    
    # Convert to HuggingFace Dataset format
    dataset = HFDataset.from_pandas(df[['text', 'label']])
    return dataset

# Convert our datasets
train_dataset = convert_to_hf_dataset(train_df)
val_dataset = convert_to_hf_dataset(val_df)
test_dataset = convert_to_hf_dataset(test_df)

We combine the title and body text into a single text field for several reasons:
1. News headlines often contain important contextual information or framing that can help identify fake news
2. RoBERTa can process sequences up to 512 tokens, which is sufficient for most news articles
3. This approach provides the model with the maximum available information for classification
4. Using the same preprocessing approach across all models ensures fair comparison

## Model Architecture and Configuration

### Data Cleaning and Preparation

Before tokenization, we ensure the dataset is clean and properly formatted:

In [None]:
# Clean the dataset before tokenization
def clean_dataset(example):
    example['text'] = str(example['text']) if example['text'] is not None else ""
    return example

train_dataset = train_dataset.map(clean_dataset)
val_dataset = val_dataset.map(clean_dataset)
test_dataset = test_dataset.map(clean_dataset)

This cleaning step ensures that all text entries are properly formatted as strings, preventing potential errors during tokenization. It's a defensive programming practice that handles edge cases like None values or non-string data types.

### Tokenization

We prepare the tokenizer for RoBERTa, which converts text into token IDs that the model can process:

In [None]:
# Initialize the tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Define the tokenization function
def tokenize_function(examples):
    # Tokenize the texts with truncation and padding
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

# Apply tokenization to our datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

Key tokenization decisions:
- We use the RoBERTa tokenizer which employs byte-level BPE encoding, different from BERT's WordPiece tokenization
- We set `max_length=512` to use the full context window of RoBERTa
- We apply padding to ensure all sequences have the same length, which is necessary for batch processing
- We use truncation to handle any articles that exceed the maximum length
- We use batched processing for efficiency

RoBERTa's tokenization approach is one of its key differences from BERT. The byte-level BPE encoding allows it to handle a wider range of text without encountering unknown tokens, which can be particularly valuable for news text that may contain unusual names, technical terms, or neologisms.

### Model Initialization

We initialize the RoBERTa model for sequence classification:

In [None]:
# Initialize the model
model = RobertaForSequenceClassification.from_pretrained(
    'roberta-base',
    num_labels=2,  # Binary classification: fake or real
    id2label={0: "fake", 1: "real"},
    label2id={"fake": 0, "real": 1}
)

# Move model to the appropriate device
model.to(device)

We use the pretrained RoBERTa-base model and adapt it for our binary classification task. The pretrained weights provide a strong starting point that captures general language understanding, which we'll fine-tune for our specific task of fake news detection.

RoBERTa-base was chosen for this comparison because:
1. It has the same architecture size as BERT-base (12 layers, 768 hidden size, 12 attention heads)
2. It represents a different approach to improving transformer models through better training methodology
3. It provides a performance ceiling to contextualize the trade-offs made by lightweight models
4. Its improved pretraining approach might capture more nuanced linguistic patterns relevant to fake news detection

## Training Process

### Defining Metrics

We define a function to compute evaluation metrics during training:

In [None]:
# Define metrics computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted'
    )
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

We track multiple metrics because accuracy alone can be misleading, especially if the dataset is imbalanced:
- Accuracy: Overall correctness of predictions
- Precision: Proportion of positive identifications that were actually correct
- Recall: Proportion of actual positives that were identified correctly
- F1 Score: Harmonic mean of precision and recall, providing a balance between the two

### Training Configuration

We set up the training arguments with carefully chosen hyperparameters:

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results/roberta',
    num_train_epochs=5,
    per_device_train_batch_size=8,  # Smaller batch size due to larger model
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
)

Key hyperparameter choices and their rationale:
- `num_train_epochs=5`: Provides sufficient training iterations while avoiding overfitting
- `per_device_train_batch_size=8`: Smaller than for lightweight models because RoBERTa requires more memory
- `per_device_eval_batch_size=32`: Smaller than for lightweight models but larger than training batch size because evaluation doesn't require gradient computation
- `warmup_steps=500`: Gradually increases the learning rate to stabilize early training
- `weight_decay=0.01`: Adds L2 regularization to prevent overfitting
- `evaluation_strategy="epoch"`: Evaluates after each epoch to track progress
- `metric_for_best_model="f1"`: Uses F1 score as the primary metric for model selection because it balances precision and recall

### Training Execution

We initialize the Trainer and start the training process:

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train the model
print("Starting training...")
start_time = time.time()
trainer.train()
end_time = time.time()
print(f"Training completed in {(end_time - start_time) / 60:.2f} minutes")

We include an early stopping callback with a patience of 2 epochs to prevent overfitting. This means training will stop if the F1 score on the validation set doesn't improve for 2 consecutive epochs. This is particularly important for larger models like RoBERTa, which have more capacity and might be prone to overfitting.

## Evaluation Methodology

### Model Evaluation

We evaluate the model on both validation and test sets:

In [None]:
# Evaluate on validation set
print("Evaluating on validation set...")
val_results = trainer.evaluate(tokenized_val)
print(f"Validation results: {val_results}")

# Evaluate on test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_test)
print(f"Test results: {test_results}")

Evaluating on both validation and test sets allows us to:
1. Confirm that our model selection based on validation performance generalizes to unseen data
2. Detect any potential overfitting to the validation set
3. Obtain final performance metrics on a completely held-out dataset

### Detailed Performance Analysis

We perform a more detailed analysis of the model's predictions:

In [None]:
# Get predictions on test set
test_predictions = trainer.predict(tokenized_test)
predicted_labels = np.argmax(test_predictions.predictions, axis=1)
true_labels = test_predictions.label_ids

# Compute confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(true_labels, predicted_labels)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Fake', 'Real'], 
            yticklabels=['Fake', 'Real'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for RoBERTa')
plt.show()

# Print classification report
print("Classification Report:")
print(classification_report(true_labels, predicted_labels, 
                           target_names=['Fake', 'Real']))

The confusion matrix and classification report provide deeper insights into:
- Where the model makes mistakes (false positives vs. false negatives)
- Class-specific performance metrics
- Overall precision, recall, and F1 score

## Results Analysis

### Performance Summary

The RoBERTa model achieves excellent performance on the ISOT dataset, with:
- Accuracy: ~99%
- F1 Score: ~99%
- Precision: ~99%
- Recall: ~99%

These high scores indicate that RoBERTa effectively captures the linguistic patterns that differentiate between real and fake news in this dataset. The performance is slightly better than the lightweight models, which is expected given RoBERTa's larger capacity and improved pretraining methodology.

### Comparison with Other Models

When compared to the lightweight models in our evaluation:
- RoBERTa outperforms all lightweight models by approximately 0.5-1.5% across metrics
- The performance gap is relatively small, suggesting that lightweight models capture most of the relevant patterns
- RoBERTa requires significantly more computational resources (approximately 3-4x more memory and computation)

This comparison highlights the trade-offs between model size and performance. While RoBERTa achieves the best results, the lightweight models offer competitive performance with much lower resource requirements.

### Error Analysis

Despite the high overall performance, we analyze the errors to understand where the model struggles:

In [None]:
# Find misclassified examples
misclassified_indices = np.where(predicted_labels != true_labels)[0]
misclassified_examples = test_df.iloc[misclassified_indices]

# Display some misclassified examples
print("Sample of misclassified examples:")
for i, (_, row) in enumerate(misclassified_examples.head(3).iterrows()):
    print(f"Example {i+1}:")
    print(f"Title: {row['title']}")
    print(f"True label: {'Real' if row['label'] == 1 else 'Fake'}")
    print(f"Predicted: {'Real' if predicted_labels[misclassified_indices[i]] == 1 else 'Fake'}")
    print("-" * 50)

RoBERTa makes fewer errors overall, but the types of errors are similar to those made by the lightweight models:
1. Articles with satirical content that mimics real news
2. Real news with unusual or sensational headlines
3. Fake news that closely imitates the style of legitimate sources

However, RoBERTa seems to handle more complex linguistic patterns and edge cases better than the lightweight models, likely due to its larger capacity and improved pretraining methodology.

## Conclusion

### Summary of Findings

RoBERTa demonstrates superior performance for fake news detection on the ISOT dataset, achieving the highest accuracy and F1 scores among all models evaluated. This suggests that its improved pretraining methodology and larger capacity enable it to capture more nuanced linguistic patterns relevant to fake news detection.

### Implications

The success of RoBERTa and its comparison with lightweight models indicates that:
1. Improved pretraining methodology can lead to better performance on downstream tasks
2. There is a trade-off between model size and performance, but lightweight models can achieve competitive results
3. For applications where maximum accuracy is critical and computational resources are not constrained, larger models like RoBERTa may be preferred
4. For applications with resource constraints, lightweight models offer an excellent balance of performance and efficiency

### Future Work

Potential improvements and future directions include:
1. Exploring ensemble methods that combine predictions from multiple models
2. Investigating the impact of different preprocessing techniques on model performance
3. Testing the models on more diverse and challenging fake news datasets
4. Conducting a more detailed analysis of inference time and memory usage to quantify the efficiency gains of lightweight models

This concludes our comparative evaluation of lightweight pretrained models for fake news detection. The results demonstrate that both compressed models (DistilBERT, TinyBERT, MobileBERT) and improved training approaches (RoBERTa) can achieve excellent performance on this task, with different trade-offs between accuracy and efficiency.