# Spam Classification using Encoder LLMs with Linear Probing [5 points]
In this part, we will use encoder Large Language Models (LLMs) for spam classification. We will leverage the rich features of pre-trained LLMs without fine-tuning them. Instead, we will freeze the LLM weights and train a lightweight classifier head (MLP) on top for spam classification.

**Dataset:** Enron Spam Dataset

**Expected Performance (Best Model):** {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}

In [None]:
import os
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


1. Load the Enron Spam dataset. Use the train/val/test splits and tokenize the text using your pre-trained LLM’s tokenizer. Use your best judgement for the relevant input fields.

In [None]:
dataset = load_dataset("SetFit/enron_spam")
print("Initial dataset splits:", list(dataset.keys()))
if "validation" not in dataset:
    print("Creating a validation split from the train split...")
    split = dataset["train"].train_test_split(test_size=0.1, seed=42)
    dataset["train"] = split["train"]
    dataset["validation"] = split["test"]
print("Dataset splits after adjustment:", list(dataset.keys()))

def tokenize_dataset(dataset, tokenizer):

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

    tokenized_dataset = {}

    for split in dataset.keys():
        tokenized_split = dataset[split].map(tokenize_function, batched=True)
        if "label" in tokenized_split.column_names:
            tokenized_split = tokenized_split.rename_column("label", "labels")
        tokenized_split.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
        tokenized_dataset[split] = tokenized_split

    return tokenized_dataset

Repo card metadata block was not found. Setting CardData to empty.


Initial dataset splits: ['train', 'test']
Creating a validation split from the train split...
Dataset splits after adjustment: ['train', 'test', 'validation']


2. Model Setup – Probing:

   a. Load a pre-trained LLM (e.g., DistilBERT, BART-encoder) for sequence classification. Choose a lightweight encoder model that is amenable to your GPU size. Consider using DistilBERT, TinyBERT, MobileBERT, AlBERT, or others. **Specify the chosen LLM below.**

  **Chosen Encoder LLM:** I have chosen the DistilBERT LLM <br>
DistilBERT is a streamlined, efficient variant of the BERT language model, created through a process called knowledge distillation where a smaller "student" model learns to mimic a larger, well-trained "teacher" model. This approach allows DistilBERT to retain nearly 97% of BERT's language understanding capabilities while being about 40% smaller and roughly 60% faster, making it particularly well-suited for applications that require low latency or are deployed on resource-constrained devices. Despite a minor trade-off in performance for certain specialized tasks, its ability to be fine-tuned for a wide range of natural language processing challenges—from text classification to question answering—ensures that it remains a versatile and practical choice in both research and real-world applications.

   b. Freeze all base model weights and attach a lightweight MLP (the classification head) that maps the model’s representations to binary labels. You may want to create a separate model class that defines these components and a forward function or use out of the box 🤗 classification wrappers.

 c. Use the [CLS] token if available or mean-pooled final hidden states from the LLM as input to your classifier head.

In [None]:
class SpamClassifier(nn.Module):
    def __init__(self, base_model_name, hidden_dim=128, dropout=0.1):
        super(SpamClassifier, self).__init__()
        self.encoder = AutoModel.from_pretrained(base_model_name)

        for param in self.encoder.parameters():
            param.requires_grad = False

        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Sequential(
            nn.Linear(self.encoder.config.hidden_size, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, 2)
        )

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
        return {"loss": loss, "logits": logits}

3. Configure your training parameters (learning rate, batch size, epochs) and train the model using only the classifier head while the LLM remains frozen.

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

def run_experiment(base_model_name, experiment_label, num_epochs=5):
    print(f"\nRunning experiment: {experiment_label} with model {base_model_name}")
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenized_dataset = tokenize_dataset(dataset, tokenizer)
    model = SpamClassifier(base_model_name, hidden_dim=128)
    model.to(device)
    output_dir = f"./results_{experiment_label}"
    training_args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=1e-3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        logging_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        save_total_limit=1,
        report_to="none"
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"],
        compute_metrics=compute_metrics,
    )
    trainer.train()
    metrics = trainer.evaluate(tokenized_dataset["test"])
    print(f"Test Metrics for {experiment_label}:")
    print(metrics)
    return metrics

4. Evaluation and Analysis:

a. Evaluate the model on the test set using accuracy, precision, recall, and F1-score.

In [None]:
metrics_distilbert = run_experiment("distilbert-base-uncased", "DistilBERT", num_epochs=5)


Running experiment: DistilBERT with model distilbert-base-uncased


Map:   0%|          | 0/28544 [00:00<?, ? examples/s]

Map:   0%|          | 0/3172 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1339,0.081061,0.974464,0.967489,0.983476,0.975417
2,0.0945,0.068655,0.973518,0.974877,0.973684,0.97428
3,0.0947,0.066315,0.975095,0.979051,0.97246,0.975745
4,0.057,0.058345,0.979823,0.979243,0.98164,0.98044
5,0.0771,0.056693,0.979508,0.9769,0.983476,0.980177


Test Metrics for DistilBERT:
{'eval_loss': 0.03968503698706627, 'eval_accuracy': 0.9855, 'eval_precision': 0.9870646766169154, 'eval_recall': 0.9841269841269841, 'eval_f1': 0.9855936413313463, 'eval_runtime': 14.5649, 'eval_samples_per_second': 137.316, 'eval_steps_per_second': 8.582, 'epoch': 5.0}


   b. Select **two** encoder LLMs, repeat steps 2-4 for the second LLM, and compare and discuss any performance trends between the two models. **Specify the second chosen LLM below and report performance comparison.**

   **Second Chosen Encoder LLM:** I have chosen the albert-base LLM <br>
   ALBERT-base-v2 is a more efficient and thoughtfully optimized version of the original BERT model, designed to deliver strong language understanding while using fewer resources. It achieves this by sharing parameters across layers and simplifying how word embeddings are handled, which significantly reduces the model size without compromising performance. Pre-trained on large-scale text data, ALBERT-base-v2 captures deep contextual meaning and can be fine-tuned for a wide range of language tasks like classification or question answering. Its blend of efficiency and accuracy makes it a practical choice for both academic research and real-world applications where performance and scalability matter.

In [None]:
metrics_albert = run_experiment("albert-base-v2", "ALBERT", num_epochs=5)


Running experiment: ALBERT with model albert-base-v2


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Map:   0%|          | 0/28544 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3172 [00:00<?, ? examples/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1416,0.092324,0.966898,0.960265,0.976132,0.968134
2,0.1062,0.077063,0.974779,0.968637,0.982864,0.975699
3,0.0876,0.06751,0.977301,0.974484,0.98164,0.978049
4,0.0868,0.065811,0.977301,0.971049,0.985312,0.978129
5,0.072,0.061235,0.978562,0.976857,0.98164,0.979243


Test Metrics for ALBERT:
{'eval_loss': 0.055980827659368515, 'eval_accuracy': 0.978, 'eval_precision': 0.9762845849802372, 'eval_recall': 0.9801587301587301, 'eval_f1': 0.9782178217821782, 'eval_runtime': 32.7682, 'eval_samples_per_second': 61.035, 'eval_steps_per_second': 3.815, 'epoch': 5.0}


   **Performance Comparison and Trend Discussion:**

c. The best model is expected to attain {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}. Report whether your best model achieves these metrics and discuss.

In [None]:
print("\n=== Comparison of Models ===")
print("DistilBERT Metrics:", metrics_distilbert)
print("ALBERT Metrics:", metrics_albert)


=== Comparison of Models ===
DistilBERT Metrics: {'eval_loss': 0.03968503698706627, 'eval_accuracy': 0.9855, 'eval_precision': 0.9870646766169154, 'eval_recall': 0.9841269841269841, 'eval_f1': 0.9855936413313463, 'eval_runtime': 14.5649, 'eval_samples_per_second': 137.316, 'eval_steps_per_second': 8.582, 'epoch': 5.0}
ALBERT Metrics: {'eval_loss': 0.055980827659368515, 'eval_accuracy': 0.978, 'eval_precision': 0.9762845849802372, 'eval_recall': 0.9801587301587301, 'eval_f1': 0.9782178217821782, 'eval_runtime': 32.7682, 'eval_samples_per_second': 61.035, 'eval_steps_per_second': 3.815, 'epoch': 5.0}


   **Performance vs. Expected Metrics Discussion:**

DistilBERT demonstrates a remarkable blend of accuracy and efficiency. Its evaluation loss of 0.0397 (versus ALBERT’s 0.0560) shows a tighter fit to the validation data, and it achieves 98.55 % accuracy compared with ALBERT’s 97.80 %. That edge carries through to precision (0.9871 vs. 0.9763), recall (0.9841 vs. 0.9802) and F1 score (0.9856 vs. 0.9782), meaning DistilBERT makes both fewer false positives and fewer false negatives overall.

Beyond raw metrics, DistilBERT is also more than twice as fast in evaluation: it completes the run in about 14.6 seconds (137 samples/sec) whereas ALBERT takes roughly 32.8 seconds (61 samples/sec). This dramatic throughput advantage makes DistilBERT particularly well suited for applications where low latency and high volume processing matter. ALBERT still offers strong performance in a smaller model footprint, but for a balance of top‑tier accuracy and speed, DistilBERT is the clear frontrunner.

5. References. Include details on all the resources used to complete this part.

1) dataset link - https://huggingface.co/datasets/SetFit/enron_spam <br>
2) https://huggingface.co/docs/datasets/v1.7.0/loading_datasets.html <br>
3) https://huggingface.co/docs/transformers/en/main_classes/tokenizer <br>
4) https://medium.com/@devesh_kumar/building-a-simple-spam-classifier-using-scikit-learn-d3a84e6f3112 <br>
5) https://huggingface.co/distilbert/distilbert-base-uncased <br>
6) https://huggingface.co/albert/albert-base-v2