# Clickbait Detection with Transformer Models

In this assignment, you will build and evaluate transformer-based models for detecting clickbait headlines. Clickbait refers to content with misleading or sensationalized headlines designed primarily to attract attention and encourage visitors to click on a link, often at the expense of accuracy or quality. Detecting clickbait automatically is an important NLP task with applications in content moderation and media literacy.


In [2]:
# note that in handin.py this step would need to be removed
!pip install transformers datasets wandb



In [3]:
# This code block has just standard setup code for running in Python

# Import PyTorch
import torch
from torch import nn
from torch.utils.data import DataLoader, random_split
import numpy as np
import matplotlib.pyplot as plt
import os
import json
import random
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset

# For experiment tracking
import wandb

# Fix the random seed for reproducability
torch.random.manual_seed(8942764)
torch.cuda.manual_seed(8942764)
np.random.seed(8942764)

In [4]:
# Please set your device by uncommenting the right version below

# On colab or on a machine with access to an Nvidia GPU  use the following setting
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# if you have an Apple Silicon machine with a GPU, use the following setting
# this should about 3-4 times faster that running it on a plain CPU
# device = 'mps' if torch.backends.mps.is_available() else 'cpu'

# If you will use a cpu, this is the setting
# device='cpu'

print(f"Using device: {device}")


Using device: cuda


## Utility Functions
These functions help with model setup, data processing, training, and evaluation.

You'll need to implement the parts marked with TODO comments.

In [5]:
# Function to load the dataset
def load_data():
    """
    Load the clickbait dataset from Hugging Face

    Returns:
        dataset: A dataset dictionary containing train, validation, and test splits
    """
    # TODO: Implement this function to load the dataset from Hugging Face
    # Use the load_dataset function to load the clickbait dataset
    # The dataset ID is "christinacdl/clickbait_notclickbait_dataset"
    clickbait_dataset = load_dataset("christinacdl/clickbait_notclickbait_dataset")
    return clickbait_dataset

In [6]:
# Function to initialize and return instance of Autotokenizer with the given model name
def get_tokenizer(model_name):
    """
    Get the appropriate tokenizer for the given model name

    Args:
        model_name: Name of the pre-trained model (e.g., 'bert-base-uncased')

    Returns:
        tokenizer: The tokenizer for the specified model
    """
    # TODO: Implement this function
    # Load and return the pre-trained Autotokenizer for the specified model name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer

In [7]:
# Tokenization function for data processing
def tokenize(batch, tokenizer):
    """
    Transform text data to tokenized format for model input

    Args:
        batch: Batch of examples from the dataset
        tokenizer: Tokenizer to use for encoding

    Returns:
        Dict with tokenized inputs and labels
    """
    sentences = [x['text'] for x in batch]
    labels = torch.LongTensor([x['label'] for x in batch])
    new_batch = dict(tokenizer(sentences, padding=True, truncation=True, return_tensors="pt"))
    new_batch['label'] = labels
    return new_batch

In [8]:
# Function to initialize wandb for experiment tracking
def init_wandb(config, project_name):
    """Initialize wandb with given config"""
    wandb.init(
        project=project_name,
        config=config
    )
    return wandb.config

In [9]:
# Training function
def train(model,
          train_dataset,
          val_dataset,
          num_epochs,
          batch_size,
          optimizer_cls,
          lr,
          weight_decay,
          device,
          tokenizer,
          use_wandb=False):
    """
    Train the model and track with wandb if specified

    Args:
        model: Model to train
        train_dataset: Training dataset
        val_dataset: Validation dataset
        num_epochs: Number of epochs to train for
        batch_size: Batch size for training
        optimizer_cls: Name of optimizer to use ('SGD', 'Adam', 'AdamW')
        lr: Learning rate
        weight_decay: Weight decay for regularization
        device: Device to train on
        tokenizer: Tokenizer for processing inputs
        use_wandb: Whether to log metrics to wandb

    Returns:
        Tuple of (trained model, training history)
    """
    # TODO: Set the model to training mode and move it to the specified device
    model.train()
    model.to(device)

    dataloader = DataLoader(train_dataset, batch_size, shuffle=True,
                          collate_fn=lambda batch: tokenize(batch, tokenizer))

    # TODO: Initialize the optimizers based on the optimizer_cls parameter, with the specified learning rates and weight decays.
    if optimizer_cls == 'SGD':
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif optimizer_cls == 'Adam':
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif optimizer_cls == 'AdamW':
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

    train_loss_history = []
    train_acc_history = []
    val_loss_history = []
    val_acc_history = []

    lossfn = nn.CrossEntropyLoss()  # Using CrossEntropyLoss which expects logits

    global_step = 0

    for e in range(num_epochs):
        epoch_loss_history = []
        epoch_acc_history = []

        # Training loop
        model.train()
        for i, batch in enumerate(dataloader):
            batch = {k:v.to(device) for k,v in batch.items() if isinstance(v, torch.Tensor)}
            y = batch.pop('label')

            # TODO: Implement forward pass and loss computation
            # 1. Pass the batch through the model to get logits
            logits = model(**batch)

            # 2. Calculate the loss using lossfn
            loss = lossfn(logits, y)

            # 3. Calculate the accuracy (percentage of correct predictions)
            predictions = torch.argmax(logits, dim=1)
            accuracy = (predictions == y).float().mean().item()

            # 4. Append each epoch's loss and accuracy to epoch_loss_history and epoch_acc_history
            epoch_loss_history.append(loss.item())
            epoch_acc_history.append(accuracy)

            global_step += 1

            # Print every 100 steps
            if global_step % 100 == 0:
                print(f'Epoch: {e+1}, Step: {global_step}, Train Loss: {epoch_loss_history[-1]:.3e}, Train Accuracy: {epoch_acc_history[-1]:.3f}')

                # Log batch metrics to WandB
                if use_wandb:
                    wandb.log({
                        "global_step": global_step,  # Correct step tracking
                        "train_loss_step": epoch_loss_history[-1],  # Current batch loss
                        "train_accuracy_step": epoch_acc_history[-1],  # Current batch accuracy
                        "epoch": e + 1,
                    })

            # TODO: Implement backward pass and optimization step
            # 1. Zero the gradients
            optimizer.zero_grad()

            # 2. Backpropagate the loss
            loss.backward()

            # 3. Update the model parameters using the optimizer
            optimizer.step()

        # Evaluation on validation set
        # TODO: Set the model to Evaluation mode
        model.eval()
        val_loss, val_acc, _, _, _ = evaluate(model, val_dataset, batch_size, device, tokenizer)

        train_loss_history.append(np.mean(epoch_loss_history))
        train_acc_history.append(np.mean(epoch_acc_history))
        val_loss_history.append(val_loss)
        val_acc_history.append(val_acc)

        print(f'epoch: {e + 1}\t train_loss: {train_loss_history[-1]:.3e}\t train_accuracy:{train_acc_history[-1]:.3f}\t val_loss: {val_loss_history[-1]:.3e}\t val_accuracy:{val_acc_history[-1]:.3f}')

        # Log metrics to wandb if enabled
        if use_wandb:
            wandb.log({
                "epoch": e + 1,
                "train_loss": train_loss_history[-1],
                "train_accuracy": train_acc_history[-1],
                "val_loss": val_loss_history[-1],
                "val_accuracy": val_acc_history[-1]
            })

    return model, (train_loss_history, train_acc_history, val_loss_history, val_acc_history)

In [10]:
# Evaluation function
@torch.no_grad()
def evaluate(model, dataset, batch_size, device, tokenizer):
    """
    Evaluate model on dataset

    Args:
        model: Model to evaluate
        dataset: Dataset to evaluate on
        batch_size: Batch size for evaluation
        device: Device to run evaluation on
        tokenizer: Tokenizer for processing inputs

    Returns:
        Tuple of (loss, accuracy, predictions, labels, logits)
    """
    # TODO: Set the model to evaluation mode and move it to the specified device
    model.eval()
    model.to(device)

    dataloader = DataLoader(dataset, batch_size, shuffle=False,
                           collate_fn=lambda batch: tokenize(batch, tokenizer))
    lossfn = nn.CrossEntropyLoss()  # Using CrossEntropyLoss which expects logits

    loss_history = []
    acc_history = []
    all_preds = []
    all_labels = []
    all_logits = []

    for i, batch in enumerate(dataloader):
        batch = {k:v.to(device) for k,v in batch.items() if isinstance(v, torch.Tensor)}
        y = batch.pop('label')

        # TODO: Implement the evaluation loop
        # Loop through batches in the dataloader
        # 1. Get model predictions by passing in the batch (logits)
        logits = model(**batch)

        # 2. Calculate loss
        loss = lossfn(logits, y)

        # 3. Get the predictions from the logits in the variable pred
        pred = torch.argmax(logits, dim=1)

        acc = (pred == y).float().mean()

        all_preds.extend(pred.cpu().numpy())
        all_labels.extend(y.cpu().numpy())
        all_logits.extend(logits.cpu().numpy())

        loss_history.append(loss.item())
        acc_history.append(acc.item())

    # TODO: Calculate and return the evaluation metrics
    # Return the mean loss, mean accuracy, all predictions, all labels, and all logits
    return np.mean(loss_history), np.mean(acc_history), all_preds, all_labels, all_logits

In [11]:
# Function to load a test set and generate predictions
def predict_on_test_set(model, tokenizer, test_file_path, output_file_path, device):
    """
    Generate predictions on a test set and save to file

    Args:
        model: Trained model
        tokenizer: Tokenizer for the model
        test_file_path: Path to the test data file
        output_file_path: Path to save predictions
        device: Device to run inference on
    """
    # Load test data
    with open(test_file_path, 'r') as f:
        test_data = json.load(f)

    print(f"Loaded {len(test_data)} examples from {test_file_path}")

    # Make predictions
    # TODO: Set the model to evaluation mode
    model.eval()

    # TODO: Initialize an empty list to store predictions
    predictions = []

    for item in test_data:
        # TODO: Tokenize the text and move to the correct device as variable inputs
        inputs = tokenizer(item['text'], padding=True, truncation=True, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # Get predictions
        # TODO: Disable gradient propagation
        with torch.no_grad():
            # TODO: Pass the inputs to the model to get logits
            logits = model(**inputs)
            # TODO: Get the prediction from logits
            pred = torch.argmax(logits, dim=1).item()

        # Store prediction as a string (0 or 1)
        predictions.append(str(pred))

    # Write predictions to file - one prediction per line
    with open(output_file_path, 'w') as f:
        f.write('\n'.join(predictions))

    print(f"Predictions saved to {output_file_path}")

## Model Architecture
This section defines the model architecture for transformer-based text classification.


In [12]:
# Base Transformer Model class for text classification
class TransformerForTextClassification(nn.Module):
    def __init__(self, model_name, num_classes, freeze_base=False, hidden_size=128, num_layers=1):
        """
        Transformer model with a classification head

        Args:
            model_name: Name of the base transformer model (e.g., 'bert-base-uncased')
            num_classes: Number of output classes
            freeze_base: Whether to freeze the base model parameters
            hidden_size: Size of the hidden layers in the classifier
            num_layers: Number of hidden layers in the classifier
        """
        super().__init__()

        self.base_model = AutoModel.from_pretrained(model_name)

        # Freeze base model if specified
        self.base_model.requires_grad_(not freeze_base)

        if not freeze_base:
            self.unfreeze_top_k_layers(5)

        # Get the hidden size from the base model config
        base_hidden_size = self.base_model.config.hidden_size

        # Build classifier with variable number of hidden layers
        if num_layers == 1:
            self.classifier = nn.Sequential(
                nn.Linear(base_hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, num_classes)
            )
        elif num_layers == 2:
            self.classifier = nn.Sequential(
                nn.Linear(base_hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, num_classes)
            )
        elif num_layers == 3:
            self.classifier = nn.Sequential(
                nn.Linear(base_hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, num_classes)
            )
        else:
            raise ValueError(f"Unsupported number of layers: {num_layers}")

    def unfreeze_top_k_layers(self, k=5):
        """
        Unfreezes the top k layers of a Transformer model (BERT or ModernBERT).

        Parameters:
            model: The Transformer model (e.g., BERT or ModernBERT).
            k: Number of top layers to unfreeze (default is 5).
        """
        # First, freeze all layers in the base model
        for param in self.base_model.parameters():
            param.requires_grad = False

        # Detect whether the model uses standard BERT or ModernBERT
        if hasattr(self.base_model, "encoder"):  # Standard BERT
            layers = self.base_model.encoder.layer
        elif hasattr(self.base_model, "layers"):  # ModernBERT
            layers = self.base_model.layers
        else:
            raise ValueError("Unrecognized model architecture: Cannot find encoder layers.")

        # Get the total number of layers
        total_layers = len(layers)

        # Unfreeze the last k layers
        for i in range(total_layers - k, total_layers):
            for param in layers[i].parameters():
                param.requires_grad = True

        print(f"Unfroze the last {k} layers out of {total_layers} total layers.")

    def forward(self, **base_model_kwargs):
        """Forward pass through the model"""
        outputs = self.base_model(**base_model_kwargs)
        # Use the pooled output for classification
        pooled_output = outputs.last_hidden_state[:, 0, :]  # Extract the [CLS] token embedding
        # Return logits (not probabilities)
        logits = self.classifier(pooled_output)
        return logits

In [13]:
# Function to create model with specified architecture
def get_model(model_name, num_classes, freeze_base=False, hidden_size=128, num_layers=1):
    """Create model with specified architecture"""
    return TransformerForTextClassification(
        model_name=model_name,
        num_classes=num_classes,
        freeze_base=freeze_base,
        hidden_size=hidden_size,
        num_layers=num_layers
    )

### Dataset
We'll be using a dataset of headlines labeled as either clickbait (1) or not clickbait (0). The dataset comes from Hugging Face and includes training, validation, and test splits. You'll have an opportunity to explore the data distribution and characteristics before building your models.


In [14]:
# Load dataset
dataset = load_data()
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/437 [00:00<?, ?B/s]

train.json:   0%|          | 0.00/3.70M [00:00<?, ?B/s]

val.json:   0%|          | 0.00/184k [00:00<?, ?B/s]

test.json:   0%|          | 0.00/742k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43802 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2191 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8760 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 43802
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 2191
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 8760
    })
})


In [15]:
# Look at some examples
print("\nExamples from training set:")
for i in range(3):
    print(f"Example {i}: {dataset['train'][i]}")

print("\nExamples from validation set:")
for i in range(3):
    print(f"Example {i}: {dataset['validation'][i]}")


Examples from training set:
Example 0: {'label': 0, 'text': 'Alphabet Scraps Plan to Blanket Globe With Internet Balloons'}
Example 1: {'label': 0, 'text': 'US Boy Scouts and hikers airlifted from wildfire in Utah'}
Example 2: {'label': 1, 'text': "Here's What Happened When I Road Tripped Around Southern California For A Week"}

Examples from validation set:
Example 0: {'label': 1, 'text': '27 Happy Gifts For People Who Love Jamaica'}
Example 1: {'label': 1, 'text': 'How Adulthood Happens '}
Example 2: {'label': 0, 'text': 'President Donald Trump Has Historically Low Approval Ratings As He Nears 100-Day Mark'}


In [16]:
# Look at class distribution
train_labels = [example['label'] for example in dataset['train']]
val_labels = [example['label'] for example in dataset['validation']]
test_labels = [example['label'] for example in dataset['test']]

print("\nClass distribution:")
print(f"Training set: Clickbait: {train_labels.count(1)}, Not clickbait: {train_labels.count(0)}")
print(f"Validation set: Clickbait: {val_labels.count(1)}, Not clickbait: {val_labels.count(0)}")
print(f"Test set: Clickbait: {test_labels.count(1)}, Not clickbait: {test_labels.count(0)}")



Class distribution:
Training set: Clickbait: 16257, Not clickbait: 27545
Validation set: Clickbait: 813, Not clickbait: 1378
Test set: Clickbait: 3252, Not clickbait: 5508


## Explore tokenization

In [17]:
# Explore tokenization
bert_tokenizer = get_tokenizer("bert-base-uncased")
modernbert_tokenizer = get_tokenizer("answerdotai/ModernBERT-base")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

In [18]:
print("\nTokenization examples:")
example_text = dataset['train'][8]['text']
print(f"Original text: '{example_text}'")
print(f"BERT tokenization: {bert_tokenizer.tokenize(example_text)}")
print(f"ModernBERT tokenization: {modernbert_tokenizer.tokenize(example_text)}")


Tokenization examples:
Original text: '15 Things You Never Noticed About Owning A Cat'
BERT tokenization: ['15', 'things', 'you', 'never', 'noticed', 'about', 'owning', 'a', 'cat']
ModernBERT tokenization: ['15', 'ĠThings', 'ĠYou', 'ĠNever', 'ĠNot', 'iced', 'ĠAbout', 'ĠOw', 'ning', 'ĠA', 'ĠCat']


## Task 1: Model Selection

In this task, you will compare the performance of two different transformer architectures for the clickbait detection task:

1. **BERT (bert-base-uncased)**: A widely-used transformer model developed by Google that has been pre-trained on a large corpus of English text.

2. **ModernBERT (answerdotai/ModernBERT-base)**: A more recent transformer variant that has been trained on newer text data and may have better performance on contemporary language patterns.

You'll train and evaluate both models with the same baseline configuration to determine which architecture provides a stronger foundation for our clickbait detection system.

This comparison will help us understand:
- Which model better captures the linguistic patterns characteristic of clickbait
- Whether the newer ModernBERT has advantages over the classic BERT architecture for this specific application

After completing this task, you'll select the better-performing model to use as the foundation for further refinement in Task 2.

In [19]:
def run_model_selection():
    """Run model selection task comparing BERT and ModernBERT"""

    # Define models to compare
    model_configs = [
        {
            "name": "bert-base-uncased",
            "display_name": "BERT",
            "freeze_base": True
        },
        {
            "name": "answerdotai/ModernBERT-base",
            "display_name": "ModernBERT",
            "freeze_base": True
        }
    ]

    # Training parameters
    train_params = {
        "num_epochs": 3,
        "batch_size": 32,
        "optimizer_cls": "Adam",
        "lr": 1e-3,
        "weight_decay": 1e-4,
        "hidden_size": 128,
        "num_layers": 1
    }

    results = []

    for config in model_configs:
        model_name = config["name"]
        display_name = config["display_name"]
        print(f"\n{'='*50}")
        print(f"Training and evaluating {display_name} model")
        print(f"{'='*50}")

        # Initialize the tokenizer for the selected model
        tokenizer = get_tokenizer(model_name)

        # Initialize wandb
        wandb_config = {**config, **train_params}
        init_wandb(wandb_config, "clickbait-detection-task1")

        # Create model
        model = get_model(
            model_name=model_name,
            num_classes=2,
            freeze_base=config["freeze_base"],
            hidden_size=train_params["hidden_size"],
            num_layers=train_params["num_layers"]
        )

        # Print the number of trainable parameters in the model
        num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print(f"Model trainable parameters: {num_params}")

        # Train model
        model, logs = train(
            model=model,
            train_dataset=dataset['train'],
            val_dataset=dataset['validation'],
            num_epochs=train_params["num_epochs"],
            batch_size=train_params["batch_size"],
            optimizer_cls=train_params["optimizer_cls"],
            lr=train_params["lr"],
            weight_decay=train_params["weight_decay"],
            device=device,
            tokenizer=tokenizer,
            use_wandb=True
        )

        # Evaluate on validation set
        val_loss, val_acc, _, _, _ = evaluate(
            model=model,
            dataset=dataset['validation'],
            batch_size=train_params["batch_size"],
            device=device,
            tokenizer=tokenizer
        )

        # Record results
        results.append({
            "model_name": model_name,
            "display_name": display_name,
            "val_accuracy": val_acc,
            "val_loss": val_loss,
            "logs": logs,
            "model": model,
            "tokenizer": tokenizer
        })

        wandb.finish()

    # Compare results
    print("\nModel Selection Results:")
    print(f"{'Model':<10} {'Validation Accuracy':<20} {'Validation Loss':<15}")
    print("-" * 45)

    for result in results:
        print(f"{result['display_name']:<10} {result['val_accuracy']:.4f}{' '*15} {result['val_loss']:.4f}")

    return results


In [20]:
# Run model selection
task1_results = run_model_selection()


Training and evaluating BERT model


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgmawen[0m ([33mgmawen-carnegie-mellon-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model trainable parameters: 98690
Epoch: 1, Step: 100, Train Loss: 3.538e-01, Train Accuracy: 0.844
Epoch: 1, Step: 200, Train Loss: 2.389e-01, Train Accuracy: 0.906
Epoch: 1, Step: 300, Train Loss: 2.750e-01, Train Accuracy: 0.938
Epoch: 1, Step: 400, Train Loss: 1.916e-01, Train Accuracy: 0.938
Epoch: 1, Step: 500, Train Loss: 1.992e-01, Train Accuracy: 0.969
Epoch: 1, Step: 600, Train Loss: 3.457e-01, Train Accuracy: 0.844
Epoch: 1, Step: 700, Train Loss: 3.239e-01, Train Accuracy: 0.844
Epoch: 1, Step: 800, Train Loss: 1.829e-01, Train Accuracy: 0.938
Epoch: 1, Step: 900, Train Loss: 3.100e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1000, Train Loss: 2.143e-01, Train Accuracy: 0.844
Epoch: 1, Step: 1100, Train Loss: 2.096e-01, Train Accuracy: 0.906
Epoch: 1, Step: 1200, Train Loss: 2.914e-01, Train Accuracy: 0.906
Epoch: 1, Step: 1300, Train Loss: 3.348e-01, Train Accuracy: 0.844
epoch: 1	 train_loss: 2.907e-01	 train_accuracy:0.884	 val_loss: 2.819e-01	 val_accuracy:0.895
Epoch: 2,

0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅█████████████
global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train_accuracy,▁▇█
train_accuracy_step,▂▅▆▆▇▂▂▆▃▂▅▅▂▃▃▆▇▅▁▇▇▅▂▂▃▃▃█▆▅▃▅▆▆▆▅▂▅▁▁
train_loss,█▃▁
train_loss_step,▅▃▃▂▂▅▄▂▄▃▂▄▄▄▄▃▃▂█▂▁▃▅▄▅▆▃▂▂▃▃▅▂▂▂▄▃▂▄▅
val_accuracy,▆▁█
val_loss,█▂▁

0,1
epoch,3.0
global_step,4100.0
train_accuracy,0.89525
train_accuracy_step,0.8125
train_loss,0.26298
train_loss_step,0.4047
val_accuracy,0.89662
val_loss,0.27334



Training and evaluating ModernBERT model


config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

Model trainable parameters: 98690


W0325 20:48:48.060000 535 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode


Epoch: 1, Step: 100, Train Loss: 6.223e-01, Train Accuracy: 0.750
Epoch: 1, Step: 200, Train Loss: 5.656e-01, Train Accuracy: 0.719
Epoch: 1, Step: 300, Train Loss: 1.991e-01, Train Accuracy: 0.969
Epoch: 1, Step: 400, Train Loss: 3.309e-01, Train Accuracy: 0.875
Epoch: 1, Step: 500, Train Loss: 2.957e-01, Train Accuracy: 0.844
Epoch: 1, Step: 600, Train Loss: 2.619e-01, Train Accuracy: 0.844
Epoch: 1, Step: 700, Train Loss: 3.086e-01, Train Accuracy: 0.906
Epoch: 1, Step: 800, Train Loss: 1.548e-01, Train Accuracy: 0.938
Epoch: 1, Step: 900, Train Loss: 4.826e-01, Train Accuracy: 0.781
Epoch: 1, Step: 1000, Train Loss: 5.075e-01, Train Accuracy: 0.781
Epoch: 1, Step: 1100, Train Loss: 1.750e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1200, Train Loss: 1.117e-01, Train Accuracy: 0.969
Epoch: 1, Step: 1300, Train Loss: 1.373e-01, Train Accuracy: 0.938
epoch: 1	 train_loss: 2.987e-01	 train_accuracy:0.882	 val_loss: 2.721e-01	 val_accuracy:0.891
Epoch: 2, Step: 1400, Train Loss: 5.241e-01

0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅█████████████
global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train_accuracy,▁▇█
train_accuracy_step,▂▁▇▅▄▄▆▆▃▃▆▇▆▃▃▆▅▆█▄▆▃▃▇▆▆▆▆▅▆▆▄▆▃▅█▆▆▄▄
train_loss,█▂▁
train_loss_step,█▇▃▄▄▃▄▂▆▇▂▂▂▇▆▄▃▃▁▆▃▄▅▂▃▄▃▂▂▂▄▄▃▆▆▁▂▄▆▅
val_accuracy,▁▅█
val_loss,█▂▁

0,1
epoch,3.0
global_step,4100.0
train_accuracy,0.90072
train_accuracy_step,0.84375
train_loss,0.25751
train_loss_step,0.34687
val_accuracy,0.90528
val_loss,0.24926



Model Selection Results:
Model      Validation Accuracy  Validation Loss
---------------------------------------------
BERT       0.8966                0.2733
ModernBERT 0.9053                0.2493


In [21]:
# TODO: Get the best model from Task 1 results
# 1. Find the model with the highest validation accuracy in task1_results
# 2. Extract the model, model_name, and tokenizer from the best result
# 3. Print information about which model will be used for further tasks

# Get the best model from Task 1 results
# 1. Find the model with the highest validation accuracy in task1_results
best_result = max(task1_results, key=lambda x: x['val_accuracy'])

# 2. Extract the model, model_name, and tokenizer from the best result
best_model = best_result['model']
best_model_name = best_result['model_name']
best_tokenizer = best_result['tokenizer']

# 3. Print information about which model will be used for further tasks
print(f"\nBest model from Task 1: {best_model_name}")
print(f"Using {best_model_name} for further tasks")


Best model from Task 1: answerdotai/ModernBERT-base
Using answerdotai/ModernBERT-base for further tasks


## Task 2: Hyperparameter Tuning

Now that we have selected the best model architecture, let's tune its hyperparameters to optimize performance.

In this task, you'll experiment with different hyperparameter configurations to find the best model. You should explore variations in:

- **Hidden layer sizes**: Try different sizes for the hidden layers in your classifier (e.g., 64, 128, 256, 512)
- **Number of hidden layers**: Experiment with adding more layers to your classifier (e.g., 1, 2, 3 layers)
- **Batch sizes**: Test different batch sizes (e.g., 16, 32, 64) - note that larger batch sizes may cause memory issues
- **Learning rates**: Try different learning rates (e.g., 1e-3, 5e-4, 1e-4)
- **Freezing base parameters**: Experiment with keeping the whole base model frozen vs unfreezing just the top 5 layers of the base model.
- **Optimizer**: You can try different optimizers like Adam, AdamW, or SGD

You should run at least 5 different hyperparameter configurations and track their performance using wandb. Below is a template for setting up your experiments.


In [22]:
# Define your hyperparameter configurations to test
# You should define at least 5 different configurations to explore the hyperparameter space
hp_configs = [
    # Configuration 1 (baseline)
    {
        "config_name": "Baseline",
        "hidden_size": 128,
        "num_layers": 1,
        "batch_size": 32,
        "optimizer": "Adam",
        "learning_rate": 1e-3,
        "weight_decay": 1e-4,  # Do NOT change
        "freeze_base": True,
        "num_epochs": 5        # Do NOT change
    },

    # Configuration 1 - MODIFY THIS!
    {
        "config_name": "Config 1",
        "hidden_size": 256,    # Increased hidden size
        "num_layers": 2,       # Added an extra layer
        "batch_size": 16,      # Reduced batch size
        "optimizer": "AdamW",  # Changed optimizer
        "freeze_base": False,  # Unfreeze base model
        "learning_rate": 5e-4, # Adjusted learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
    },

    # Configuration 2 - MODIFY THIS!
    {
        "config_name": "Config 2",
        "hidden_size": 512,    # Increased hidden size
        "num_layers": 1,       # Single layer
        "batch_size": 64,      # Increased batch size
        "optimizer": "SGD",    # Changed optimizer
        "freeze_base": True,   # Freeze base model
        "learning_rate": 1e-4, # Reduced learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
    },

    # Configuration 3 - MODIFY THIS!
    {
        "config_name": "Config 3",
        "hidden_size": 64,     # Reduced hidden size
        "num_layers": 3,       # Added more layers
        "batch_size": 32,      # Default batch size
        "optimizer": "Adam",   # Default optimizer
        "freeze_base": False,  # Unfreeze base model
        "learning_rate": 1e-3, # Default learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
    },

    # Configuration 4 - MODIFY THIS!
    {
        "config_name": "Config 4",
        "hidden_size": 128,    # Default hidden size
        "num_layers": 2,       # Added an extra layer
        "batch_size": 16,      # Reduced batch size
        "optimizer": "AdamW",  # Changed optimizer
        "freeze_base": True,   # Freeze base model
        "learning_rate": 1e-4, # Reduced learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
    },
    # Configuration 5 - MODIFY THIS!
    {
        "config_name": "Config 5",
        "hidden_size": 32,    # Default hidden size
        "num_layers": 2,       # Added an extra layer
        "batch_size": 16,      # Reduced batch size
        "optimizer": "Adam",  # Changed optimizer
        "freeze_base": False,   # Freeze base model
        "learning_rate": 1e-4, # Reduced learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
    }
]

In [23]:
def run_hyperparameter_tuning(model_name, base_tokenizer):
    """
    Run hyperparameter tuning experiments

    Args:
        model_name: Name of the model to use
        base_tokenizer: Tokenizer for the model

    Returns:
        List of experiment results
    """
    print(f"\n{'='*50}")
    print(f"Running Hyperparameter Tuning for {model_name}")
    print(f"{'='*50}")

    results = []
    best_val_acc = 0
    best_config_idx = 0
    best_model = None

    # For each configuration in your hp_configs list
    for i, config in enumerate(hp_configs):
        print(f"\nRunning experiment {i+1}/{len(hp_configs)}: {config['config_name']}")

        # Initialize wandb for this experiment
        wandb_config = {**config, "model_name": model_name}
        init_wandb(wandb_config, "clickbait-detection-task2")

        # Create model with this configuration
        model = get_model(
            model_name=model_name,
            num_classes=2,
            freeze_base=config["freeze_base"],
            hidden_size=config["hidden_size"],
            num_layers=config["num_layers"]
        )

        # Train model
        model, logs = train(
            model=model,
            train_dataset=dataset['train'],
            val_dataset=dataset['validation'],
            num_epochs=config["num_epochs"],
            batch_size=config["batch_size"],
            optimizer_cls=config["optimizer"],
            lr=config["learning_rate"],
            weight_decay=config["weight_decay"],
            device=device,
            tokenizer=base_tokenizer,
            use_wandb=True
        )

        # Evaluate on validation set
        val_loss, val_acc, _, _, _ = evaluate(
            model=model,
            dataset=dataset['validation'],
            batch_size=config["batch_size"],
            device=device,
            tokenizer=base_tokenizer
        )

        # Log final validation metrics
        wandb.log({
            "final_val_loss": val_loss,
            "final_val_accuracy": val_acc
        })

        # Finish wandb run
        wandb.finish()

        # Record results
        results.append({
            "config_name": config["config_name"],
            "hidden_size": config["hidden_size"],
            "num_layers": config["num_layers"],
            "batch_size": config["batch_size"],
            "learning_rate": config["learning_rate"],
            "weight_decay": config["weight_decay"],
            "optimizer": config["optimizer"],
            "val_loss": val_loss,
            "val_accuracy": val_acc,
            "model": model
        })

        # Keep track of best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_config_idx = i
            best_model = model

    # Display results in a table
    print("\nHyperparameter Tuning Results:")
    print("-" * 120)
    print(f"{'Config':<10} {'Hidden Size':<12} {'Layers':<8} {'Batch Size':<12} {'Learning Rate':<14} {'Weight Decay':<14} {'Optimizer':<10} {'Val Accuracy':<15}")
    print("-" * 120)

    for result in results:
        print(f"{result['config_name']:<10} {result['hidden_size']:<12} {result['num_layers']:<8} {result['batch_size']:<12} {result['learning_rate']:<14} {result['weight_decay']:<14} {result['optimizer']:<10} {result['val_accuracy']:.4f}")

    print("-" * 120)
    print(f"Best configuration: {results[best_config_idx]['config_name']} with validation accuracy: {best_val_acc:.4f}")

    return results, best_model, best_config_idx

In [24]:
# Run Task 2: Hyperparameter Tuning
# Run all hyperparameter experiments
tuning_results, best_tuned_model, best_config_idx = run_hyperparameter_tuning(best_model_name, best_tokenizer)



Running Hyperparameter Tuning for answerdotai/ModernBERT-base

Running experiment 1/6: Baseline


Epoch: 1, Step: 100, Train Loss: 4.093e-01, Train Accuracy: 0.844
Epoch: 1, Step: 200, Train Loss: 2.605e-01, Train Accuracy: 0.875
Epoch: 1, Step: 300, Train Loss: 3.315e-01, Train Accuracy: 0.875
Epoch: 1, Step: 400, Train Loss: 2.737e-01, Train Accuracy: 0.938
Epoch: 1, Step: 500, Train Loss: 1.765e-01, Train Accuracy: 0.969
Epoch: 1, Step: 600, Train Loss: 2.064e-01, Train Accuracy: 0.906
Epoch: 1, Step: 700, Train Loss: 1.687e-01, Train Accuracy: 0.969
Epoch: 1, Step: 800, Train Loss: 2.066e-01, Train Accuracy: 0.906
Epoch: 1, Step: 900, Train Loss: 3.053e-01, Train Accuracy: 0.844
Epoch: 1, Step: 1000, Train Loss: 3.278e-01, Train Accuracy: 0.906
Epoch: 1, Step: 1100, Train Loss: 1.584e-01, Train Accuracy: 0.969
Epoch: 1, Step: 1200, Train Loss: 3.128e-01, Train Accuracy: 0.906
Epoch: 1, Step: 1300, Train Loss: 2.775e-01, Train Accuracy: 0.906
epoch: 1	 train_loss: 2.997e-01	 train_accuracy:0.882	 val_loss: 2.773e-01	 val_accuracy:0.888
Epoch: 2, Step: 1400, Train Loss: 3.074e-01

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆███████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇██
train_accuracy,▁▅▇██
train_accuracy_step,▄▅▆▇▅▄▅▅▇▇▅▇▅▅▁▂▅▅▅▇▂▅▅▅▅▅▄█▃▇▅▆▆▅▆▅▅▇▇▅
train_loss,█▄▂▂▁
train_loss_step,▆▄▃▂▂▄▄▁▃▄▃▁▄▅█▃▂▅▇▅▆▂▃▃▂▃█▂▂▅▂▁▄▃▅▂▂▅▁▃
val_accuracy,▁█▄██
val_loss,█▂▃▁▃

0,1
epoch,5.0
final_val_accuracy,0.90166
final_val_loss,0.26068
global_step,6800.0
train_accuracy,0.90526
train_accuracy_step,0.84375
train_loss,0.24467
train_loss_step,0.23347
val_accuracy,0.90166
val_loss,0.26068



Running experiment 2/6: Config 1


Unfroze the last 5 layers out of 22 total layers.
Epoch: 1, Step: 100, Train Loss: 4.219e-01, Train Accuracy: 0.875
Epoch: 1, Step: 200, Train Loss: 6.072e-02, Train Accuracy: 1.000
Epoch: 1, Step: 300, Train Loss: 3.629e-01, Train Accuracy: 0.875
Epoch: 1, Step: 400, Train Loss: 2.956e-01, Train Accuracy: 0.938
Epoch: 1, Step: 500, Train Loss: 1.412e-01, Train Accuracy: 0.938
Epoch: 1, Step: 600, Train Loss: 3.165e-01, Train Accuracy: 0.875
Epoch: 1, Step: 700, Train Loss: 5.439e-01, Train Accuracy: 0.812
Epoch: 1, Step: 800, Train Loss: 1.176e-01, Train Accuracy: 1.000
Epoch: 1, Step: 900, Train Loss: 1.683e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1000, Train Loss: 4.456e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1100, Train Loss: 1.863e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1200, Train Loss: 3.853e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1300, Train Loss: 3.920e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1400, Train Loss: 2.033e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1500,

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆██████████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇██
train_accuracy,▁▅▆▆█
train_accuracy_step,▅█▆▆▅▆▅▆▅▁██▃▆▆█▆▅█▅▆█▆▆▅▆▆▆█▆█▆█▆▅█▆▆▆▆
train_loss,█▄▃▂▁
train_loss_step,█▆▃▃▄█▄▂▂▄▁█▃▂▃▂▃▁▃▃▆▂▄▃▂▁▁▃▃▄▃▂▂▃▂▇▅▅▃▃
val_accuracy,▂▆▅█▁
val_loss,▆▁▇▂█

0,1
epoch,5.0
final_val_accuracy,0.8969
final_val_loss,0.25753
global_step,13600.0
train_accuracy,0.92748
train_accuracy_step,1.0
train_loss,0.18275
train_loss_step,0.03927
val_accuracy,0.8969
val_loss,0.25753



Running experiment 3/6: Config 2


Epoch: 1, Step: 100, Train Loss: 6.386e-01, Train Accuracy: 0.672
Epoch: 1, Step: 200, Train Loss: 6.129e-01, Train Accuracy: 0.688
Epoch: 1, Step: 300, Train Loss: 6.615e-01, Train Accuracy: 0.562
Epoch: 1, Step: 400, Train Loss: 6.864e-01, Train Accuracy: 0.516
Epoch: 1, Step: 500, Train Loss: 6.307e-01, Train Accuracy: 0.609
Epoch: 1, Step: 600, Train Loss: 5.796e-01, Train Accuracy: 0.734
epoch: 1	 train_loss: 6.322e-01	 train_accuracy:0.625	 val_loss: 6.089e-01	 val_accuracy:0.629
Epoch: 2, Step: 700, Train Loss: 5.922e-01, Train Accuracy: 0.641
Epoch: 2, Step: 800, Train Loss: 5.640e-01, Train Accuracy: 0.703
Epoch: 2, Step: 900, Train Loss: 6.360e-01, Train Accuracy: 0.531
Epoch: 2, Step: 1000, Train Loss: 5.849e-01, Train Accuracy: 0.688
Epoch: 2, Step: 1100, Train Loss: 5.607e-01, Train Accuracy: 0.703
Epoch: 2, Step: 1200, Train Loss: 5.567e-01, Train Accuracy: 0.703
Epoch: 2, Step: 1300, Train Loss: 5.630e-01, Train Accuracy: 0.672
epoch: 2	 train_loss: 5.927e-01	 train_accu

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇███
train_accuracy,▁▂▄▆█
train_accuracy_step,▅▅▂▁▃▆▄▅▁▅▅▅▅▅▆▅▅▆▅▆▅▇▄▅▆▇▅▇███▅█▆
train_loss,█▅▄▂▁
train_loss_step,▆▅▇█▆▄▄▃▆▄▃▃▃▂▂▃▅▃▃▄▅▂▄▅▂▂▄▂▁▁▂▂▁▁
val_accuracy,▁▃▄▇█
val_loss,█▆▄▂▁

0,1
epoch,5.0
final_val_accuracy,0.80077
final_val_loss,0.52633
global_step,3400.0
train_accuracy,0.79109
train_accuracy_step,0.75
train_loss,0.52877
train_loss_step,0.52128
val_accuracy,0.80077
val_loss,0.52633



Running experiment 4/6: Config 3


Unfroze the last 5 layers out of 22 total layers.
Epoch: 1, Step: 100, Train Loss: 2.287e-01, Train Accuracy: 0.938
Epoch: 1, Step: 200, Train Loss: 3.348e-01, Train Accuracy: 0.906
Epoch: 1, Step: 300, Train Loss: 2.642e-01, Train Accuracy: 0.906
Epoch: 1, Step: 400, Train Loss: 8.130e-02, Train Accuracy: 1.000
Epoch: 1, Step: 500, Train Loss: 2.798e-01, Train Accuracy: 0.875
Epoch: 1, Step: 600, Train Loss: 1.307e-01, Train Accuracy: 0.938
Epoch: 1, Step: 700, Train Loss: 8.572e-02, Train Accuracy: 1.000
Epoch: 1, Step: 800, Train Loss: 3.483e-01, Train Accuracy: 0.844
Epoch: 1, Step: 900, Train Loss: 2.048e-01, Train Accuracy: 0.906
Epoch: 1, Step: 1000, Train Loss: 1.235e-01, Train Accuracy: 0.969
Epoch: 1, Step: 1100, Train Loss: 1.053e-01, Train Accuracy: 0.969
Epoch: 1, Step: 1200, Train Loss: 8.168e-02, Train Accuracy: 0.969
Epoch: 1, Step: 1300, Train Loss: 5.117e-01, Train Accuracy: 0.844
epoch: 1	 train_loss: 2.495e-01	 train_accuracy:0.908	 val_loss: 2.375e-01	 val_accuracy

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆███████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇██
train_accuracy,▁▅▆▇█
train_accuracy_step,▆▅▅█▃█▂▅▇▇▆▇▅▆▇▅▇▆▅▁▇▂▃▅▆▃▃▇▇▆▅▃▃▆▇▆▆▇▃▅
train_loss,█▄▃▂▁
train_loss_step,▃▃▁▄▁▆▁▃▁▁▃▃▁▅▂▅▁▂▁▁▄▂▂▂▂▂▂▂▁▃▂█▂▁▂▂▂▂▃▃
val_accuracy,▁▁█▄▅
val_loss,█▄▁▆▄

0,1
epoch,5.0
final_val_accuracy,0.91117
final_val_loss,0.22861
global_step,6800.0
train_accuracy,0.92744
train_accuracy_step,0.90625
train_loss,0.18788
train_loss_step,0.24032
val_accuracy,0.91117
val_loss,0.22861



Running experiment 5/6: Config 4


Epoch: 1, Step: 100, Train Loss: 3.759e-01, Train Accuracy: 0.875
Epoch: 1, Step: 200, Train Loss: 2.342e-01, Train Accuracy: 1.000
Epoch: 1, Step: 300, Train Loss: 3.366e-01, Train Accuracy: 0.938
Epoch: 1, Step: 400, Train Loss: 4.158e-01, Train Accuracy: 0.812
Epoch: 1, Step: 500, Train Loss: 4.680e-01, Train Accuracy: 0.812
Epoch: 1, Step: 600, Train Loss: 2.483e-01, Train Accuracy: 0.938
Epoch: 1, Step: 700, Train Loss: 1.695e-01, Train Accuracy: 0.938
Epoch: 1, Step: 800, Train Loss: 2.694e-01, Train Accuracy: 0.875
Epoch: 1, Step: 900, Train Loss: 1.519e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1000, Train Loss: 2.320e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1100, Train Loss: 1.044e-01, Train Accuracy: 1.000
Epoch: 1, Step: 1200, Train Loss: 5.118e-01, Train Accuracy: 0.812
Epoch: 1, Step: 1300, Train Loss: 2.271e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1400, Train Loss: 3.985e-01, Train Accuracy: 0.812
Epoch: 1, Step: 1500, Train Loss: 3.552e-01, Train Accuracy: 0.750
Epoc

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆██████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▁▂▃▃▃▃▃▃▃▃▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇█████
train_accuracy,▁▆▇██
train_accuracy_step,█▅█▄▄▅▄▇▄▁▇▂█▄▇▄▄▇▄▇▅▄▅▇█▄▇▅▅▇▄▅▇▄▂▄▅▇▇▇
train_loss,█▃▂▂▁
train_loss_step,▃▄▅▂▅▄█▄▅▃▃▄▂▂▂▃▃▆▁▂▂▄▅▅▅▁▂▅▂▂▂▅▃▆▅▃▂▂▁▇
val_accuracy,▁▄▆█▄
val_loss,█▂▁▁▂

0,1
epoch,5.0
final_val_accuracy,0.89595
final_val_loss,0.26241
global_step,13600.0
train_accuracy,0.9063
train_accuracy_step,1.0
train_loss,0.24267
train_loss_step,0.07672
val_accuracy,0.89595
val_loss,0.26241



Running experiment 6/6: Config 5


Unfroze the last 5 layers out of 22 total layers.
Epoch: 1, Step: 100, Train Loss: 2.666e-01, Train Accuracy: 0.938
Epoch: 1, Step: 200, Train Loss: 7.686e-01, Train Accuracy: 0.688
Epoch: 1, Step: 300, Train Loss: 2.469e-01, Train Accuracy: 0.938
Epoch: 1, Step: 400, Train Loss: 4.307e-01, Train Accuracy: 0.812
Epoch: 1, Step: 500, Train Loss: 1.394e-01, Train Accuracy: 1.000
Epoch: 1, Step: 600, Train Loss: 2.851e-01, Train Accuracy: 0.875
Epoch: 1, Step: 700, Train Loss: 1.504e-01, Train Accuracy: 1.000
Epoch: 1, Step: 800, Train Loss: 4.758e-01, Train Accuracy: 0.688
Epoch: 1, Step: 900, Train Loss: 2.854e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1000, Train Loss: 2.671e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1100, Train Loss: 4.357e-01, Train Accuracy: 0.812
Epoch: 1, Step: 1200, Train Loss: 1.646e-02, Train Accuracy: 1.000
Epoch: 1, Step: 1300, Train Loss: 1.609e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1400, Train Loss: 3.722e-01, Train Accuracy: 0.875
Epoch: 1, Step: 1500,

0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆████████
final_val_accuracy,▁
final_val_loss,▁
global_step,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇██
train_accuracy,▁▄▅▆█
train_accuracy_step,▁█▃▁█▃▃▃▆█▆▃▃▆▃█▁▃▃▃▃█▃▆█▆██▆▆█▆███▃▆▆▆█
train_loss,█▅▄▃▁
train_loss_step,█▂▁▃▃▁▂▂▄▂▁▁▄▃▃▃▃▂▁▂▂▁▃▁▂▁▂▁▃▁▂▁▂▄▃▂▁▂▂▂
val_accuracy,▇██▄▁
val_loss,▃▁▂▅█

0,1
epoch,5.0
final_val_accuracy,0.8996
final_val_loss,0.26233
global_step,13600.0
train_accuracy,0.94627
train_accuracy_step,0.9375
train_loss,0.13542
train_loss_step,0.08831
val_accuracy,0.8996
val_loss,0.26233



Hyperparameter Tuning Results:
------------------------------------------------------------------------------------------------------------------------
Config     Hidden Size  Layers   Batch Size   Learning Rate  Weight Decay   Optimizer  Val Accuracy   
------------------------------------------------------------------------------------------------------------------------
Baseline   128          1        32           0.001          0.0001         Adam       0.9017
Config 1   256          2        16           0.0005         0.0001         AdamW      0.8969
Config 2   512          1        64           0.0001         0.0001         SGD        0.8008
Config 3   64           3        32           0.001          0.0001         Adam       0.9112
Config 4   128          2        16           0.0001         0.0001         AdamW      0.8960
Config 5   32           2        16           0.0001         0.0001         Adam       0.8996
-----------------------------------------------------------

## Task 3: Final Evaluation and Error Analysis

In this task, we'll:
1. Train the best model configuration from Task 2 on the combined training and validation data
2. Evaluate this final model on the test set
3. Generate predictions for the provided held-out test set
4. Perform detailed error analysis to understand the model's strengths and weaknesses

In [25]:
# First, train your best model from Task 2
# Replace these parameters with those from your best configuration in Task 2
best_config = {
       "config_name": "Config 2",
        "hidden_size": 256,    # Increased hidden size
        "num_layers": 2,       # Added an extra layer
        "batch_size": 16,      # Reduced batch size
        "optimizer": "AdamW",  # Changed optimizer
        "freeze_base": False,  # Unfreeze base model
        "learning_rate": 5e-4, # Adjusted learning rate
        "weight_decay": 1e-4,  # Do NOT change
        "num_epochs": 5        # Do NOT change
}

In [26]:
print("\nTraining final model with best configuration...")
final_model = get_model(
    model_name=best_model_name,
    num_classes=2,
    freeze_base=best_config["freeze_base"],
    hidden_size=best_config["hidden_size"],
    num_layers=best_config["num_layers"]
)


Training final model with best configuration...
Unfroze the last 5 layers out of 22 total layers.


In [27]:
final_model, _ = train(
    model=final_model,
    train_dataset=dataset['train'],
    val_dataset=dataset['validation'],
    num_epochs=best_config["num_epochs"],
    batch_size=best_config["batch_size"],
    optimizer_cls=best_config["optimizer"],
    lr=best_config["learning_rate"],
    weight_decay=best_config["weight_decay"],
    device=device,
    tokenizer=best_tokenizer,
    use_wandb=False
)

Epoch: 1, Step: 100, Train Loss: 2.230e-01, Train Accuracy: 0.875
Epoch: 1, Step: 200, Train Loss: 4.124e-01, Train Accuracy: 0.812
Epoch: 1, Step: 300, Train Loss: 1.058e-01, Train Accuracy: 0.938
Epoch: 1, Step: 400, Train Loss: 3.652e-01, Train Accuracy: 0.812
Epoch: 1, Step: 500, Train Loss: 7.551e-02, Train Accuracy: 1.000
Epoch: 1, Step: 600, Train Loss: 2.432e-01, Train Accuracy: 0.875
Epoch: 1, Step: 700, Train Loss: 1.513e-01, Train Accuracy: 0.938
Epoch: 1, Step: 800, Train Loss: 6.277e-01, Train Accuracy: 0.750
Epoch: 1, Step: 900, Train Loss: 1.559e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1000, Train Loss: 4.063e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1100, Train Loss: 6.766e-02, Train Accuracy: 1.000
Epoch: 1, Step: 1200, Train Loss: 2.734e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1300, Train Loss: 1.682e-01, Train Accuracy: 0.938
Epoch: 1, Step: 1400, Train Loss: 4.019e-02, Train Accuracy: 1.000
Epoch: 1, Step: 1500, Train Loss: 1.947e-01, Train Accuracy: 0.938
Epoc

In [28]:
print("\nEvaluating final model on test set...")
test_loss, test_acc, test_preds, test_labels, test_logits = evaluate(
    final_model,
    dataset['test'],
    batch_size=32,
    device=device,
    tokenizer=best_tokenizer
)

print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")



Evaluating final model on test set...
Test Accuracy: 0.9095
Test Loss: 0.2308


In [29]:
# Generate predictions for the held-out test set (test-DIST.json)
test_file_path = 'second-test-data-DIST.json'
if os.path.exists(test_file_path):
    print(f"\nFound test file: {test_file_path}")
    print("Generating predictions for the held-out test set...")

    predict_on_test_set(
        model=final_model,
        tokenizer=best_tokenizer,
        test_file_path=test_file_path,
        output_file_path='test-results.txt',
        device=device
    )

    # Verify the test-results.txt file was created successfully
    if os.path.exists('test-results.txt'):
        with open('test-results.txt', 'r') as f:
            predictions = f.read().strip().split('\n')
        print(f"SUCCESS: Created test-results.txt with {len(predictions)} predictions")
        print(f"Sample predictions (first 5): {predictions[:5] if len(predictions) >= 5 else predictions}")
    else:
        print("ERROR: Failed to create test-results.txt. Please check for errors.")
else:
    print(f"\nERROR: Test file {test_file_path} not found!")
    print("You need this file for your submission. Please make sure it's in your working directory.")
    print("If you're working in Colab, upload the test-DIST.json file to your session.")



Found test file: second-test-data-DIST.json
Generating predictions for the held-out test set...
Loaded 882 examples from second-test-data-DIST.json
Predictions saved to test-results.txt
SUCCESS: Created test-results.txt with 882 predictions
Sample predictions (first 5): ['0', '0', '1', '1', '1']


## Error Analysis


In [33]:
# Evaluate on validation set
print("\nEvaluating final model on validation set...")
val_loss, val_acc, val_preds, val_labels, val_logits = evaluate(
    final_model,
    dataset['validation'],
    batch_size=32,
    device=device,
    tokenizer=best_tokenizer
)

# Confusion matrix
val_conf_matrix = confusion_matrix(val_labels, val_preds)
print("\nConfusion Matrix for Validation set:")
print(val_conf_matrix)
print("="*50)
print(f"Validation Accuracy: {val_acc:.4f}")  # Fixed: using val_acc instead of test_acc
print(f"Validation Loss: {val_loss:.4f}")    # Fixed: using val_loss instead of test_loss
print("="*50)
print(f"False Positives (Real to Fake): {val_conf_matrix[0][1]}")
print(f"False Negatives (Fake to Real): {val_conf_matrix[1][0]}")
print("="*50)

# Get validation texts
val_texts = [item['text'] for item in dataset['validation']]

# Analyze misclassified examples
misclassified_indices = [i for i, (pred, label) in enumerate(zip(val_preds, val_labels)) if pred != label]

if misclassified_indices:
    print("\nAnalyzing misclassified examples...")
    print(f"Total misclassified: {len(misclassified_indices)}")

    # Print first 5 examples of each error type
    false_positives = [i for i in misclassified_indices if val_preds[i] == 1]
    false_negatives = [i for i in misclassified_indices if val_preds[i] == 0]

    if false_positives:
        print("\nFalse Positives (Real news predicted as Fake):")
        for i in false_positives[:5]:
            print(f"\nExample {i}:")
            print(f"Text: {val_texts[i]}")
            print(f"True Label: Real")
            print(f"Predicted Label: Fake")

    if false_negatives:
        print("\nFalse Negatives (Fake news predicted as Real):")
        for i in false_negatives[:5]:
            print(f"\nExample {i}:")
            print(f"Text: {val_texts[i]}")
            print(f"True Label: Fake")
            print(f"Predicted Label: Real")
else:
    print("\nNo misclassified examples found!")


Evaluating final model on validation set...

Confusion Matrix for Validation set:
[[1310   68]
 [ 147  666]]
Validation Accuracy: 0.9016
Validation Loss: 0.2381
False Positives (Real to Fake): 68
False Negatives (Fake to Real): 147

Analyzing misclassified examples...
Total misclassified: 215

False Positives (Real news predicted as Fake):

Example 49:
Text: The Totally, Very Much Unedited Video Of Trump's Truck Photo Op
True Label: Real
Predicted Label: Fake

Example 99:
Text: Kim Kardashian's fashion icon may just be one of your favorite characters from 'The Office'
True Label: Real
Predicted Label: Fake

Example 123:
Text: Here Are The 2017 Grammy Nominations
True Label: Real
Predicted Label: Fake

Example 191:
Text: 5 Andra Day Songs You Should Know
True Label: Real
Predicted Label: Fake

Example 207:
Text: Why Jay Z's New Venture Fund Should Come As No Surprise
True Label: Real
Predicted Label: Fake

False Negatives (Fake news predicted as Real):

Example 18:
Text: Survey Points 

## Submission Instructions

When you're ready to submit your assignment, please follow these steps:

1. Submit this completed ipynb **without clearing any output cells**, renamed as `hw3.ipynb`
   - Make sure your code can load the best model configuration
   - Include necessary functions for tokenization, prediction, etc.
   - Make sure it can run the test predictions on `test-DIST.json`

2. Download `test-results.txt` this notebook generated with your predictions on the provided test set. Perform a manual check to see if:
   - Each line should contain a single character: '1' for clickbait or '0' for non-clickbait
   - No other content should be in this file

3. Read and complete the hw3 report (report.pdf). It has the following sections:
   - Part 1: Model Selection
     - Compare BERT vs ModernBERT performance
     - Explain your choice of model and why it performed better
     - Discuss any observations about tokenization differences
   
   - Part 2: Hyperparameter Tuning
     - Table with all configurations tested and their results
     - Analysis of how different hyperparameters affected performance
     - Include and explain wandb plots
     - Justify your final model configuration

Submit the following files to Gradescope:
- `hw3.ipynb`: Your code implementation
- `test-results.txt`: Predictions on the test set
- `report.pdf`: Your detailed report

Make sure your name and Andrew ID are on the first page of your report.