
# BERT Spam Classifier — End-to-End (Hugging Face + Transformers)

This notebook walks through a clean, reproducible pipeline:

1. **Install necessary libraries**  
2. **Import & configuration**  
3. **Load dataset** (`bourigue/data_email_spam`)  
4. **Normalize columns** to `text` and `label` (0=ham, 1=spam)  
5. **Create train/validation/test splits** (stratified)  
6. **Tokenize** with `bert-base-uncased`  
7. **Train** with `Trainer`  
8. **Evaluate** (accuracy, precision, recall, F1)  
9. **Save** model & tokenizer  
10. **Reload** and **predict** on sample inputs  


## Necessary libraries

In [None]:

# If running locally, uncomment the next line to install dependencies.
# In Colab, you can leave it as is or run it once.
# !pip install -U datasets transformers accelerate scikit-learn torch evaluate


## Imports and configuration

In [None]:

import os
import re
import numpy as np
from typing import Dict, Any, Tuple

from datasets import load_dataset, DatasetDict, ClassLabel
from sklearn.model_selection import train_test_split

import torch
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    pipeline,
    set_seed,
)

# --- Config (adjust as needed) ---
HF_DATASET = "bourigue/data_email_spam"
BASE_MODEL = "bert-base-uncased"
OUTPUT_DIR = "models/bert-spam"
MAX_LEN = 256
EPOCHS = 3
BATCH_SIZE = 16
LR = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.06
SEED = 42

os.makedirs(OUTPUT_DIR, exist_ok=True)
set_seed(SEED)



## Load the dataset

In [None]:

dataset_raw = load_dataset(HF_DATASET)
dataset_raw


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

spam_email_dataset.csv:   0%|          | 0.00/112M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/85782 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text_combined', 'label'],
        num_rows: 85782
    })
})

## Quick peek at a few rows

In [None]:

for split in dataset_raw:
    print(f"Split: {split}, size={len(dataset_raw[split])}")
    print(dataset_raw[split][0])
    break


Split: train, size=85782
{'text_combined': 'fark rssfeedsspamassassintaintorg url httpwwwnewsisfreecomclick482680291717 date 20020926t0822400100 img httpwwwnewsisfreecomimagesfarkazcentralgif azcentral thu 26 sep 2002 152859 0000 teen dies starvation stepfather puts bus', 'label': 0}


## Data preprocessing — normalize `text` and `label` columns

In [None]:

def guess_columns(example: Dict[str, Any]) -> Tuple[str, str]:
    text_candidates = ["text", "message", "email", "content", "body"]
    label_candidates = ["label", "labels", "category", "target", "spam"]

    keys = set(example.keys())
    text_key = next((k for k in text_candidates if k in keys), None)
    label_key = next((k for k in label_candidates if k in keys), None)

    if text_key is None:
        for k in example.keys():
            if isinstance(example[k], str):
                text_key = k
                break
    if label_key is None:
        for k in example.keys():
            if k != text_key and isinstance(example[k], (int, float, np.integer, np.floating, bool)):
                label_key = k
                break

    if text_key is None or label_key is None:
        raise ValueError(f"Could not infer text/label keys from example keys: {example.keys()}")
    return text_key, label_key

any_split = next(iter(dataset_raw.keys()))
first_row = dataset_raw[any_split][0]
text_key, label_key = guess_columns(first_row)
print(f"Inferred columns -> text: '{text_key}', label: '{label_key}'")

def clean_text(t: str) -> str:
    t = t.strip()
    t = re.sub(r"\s+", " ", t)
    return t

def to_int_label(val) -> int:
    if isinstance(val, (int, np.integer)):
        return int(val)
    if isinstance(val, str):
        low = val.lower()
        if low in {"spam", "1", "true", "yes"}:
            return 1
        if low in {"ham", "0", "false", "no"}:
            return 0
        return 1 if low != "ham" else 0
    if isinstance(val, bool):
        return int(val)
    return int(val)

def normalize_split(ds):
    return ds.map(
        lambda ex: {
            "text": clean_text(ex[text_key]),
            "label": to_int_label(ex[label_key]),
        },
        remove_columns=[c for c in ds.column_names if c not in [text_key, label_key]],
    )

normalized_splits = {split: normalize_split(ds) for split, ds in dataset_raw.items()}
normalized = DatasetDict(normalized_splits)
normalized


Inferred columns -> text: 'text_combined', label: 'label'


Map:   0%|          | 0/85782 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text_combined', 'label', 'text'],
        num_rows: 85782
    })
})

## Train/Validation/Test splits (stratified)

In [None]:

# If there's no explicit test split, make one from the first available split
if "test" not in normalized:
    base_split_name = next(iter(normalized.keys()))
    base_split = normalized[base_split_name]
    labels = base_split["label"]
    idx_train, idx_test = train_test_split(
        np.arange(len(labels)),
        test_size=0.2,
        random_state=SEED,
        stratify=labels
    )
    normalized = DatasetDict({
        "train": base_split.select(idx_train),
        "test":  base_split.select(idx_test),
    })

# Carve out a validation split (10% of train), stratified
train_labels = normalized["train"]["label"]
idx_train, idx_val = train_test_split(
    np.arange(len(train_labels)),
    test_size=0.1,
    random_state=SEED,
    stratify=train_labels
)

dataset = DatasetDict({
    "train": normalized["train"].select(idx_train),
    "validation": normalized["train"].select(idx_val),
    "test": normalized["test"]
})

# Make labels nicer (ClassLabel)
dataset = dataset.cast_column("label", ClassLabel(names=["ham", "spam"]))
dataset


Casting the dataset:   0%|          | 0/61762 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/6863 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/17157 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text_combined', 'label', 'text'],
        num_rows: 61762
    })
    validation: Dataset({
        features: ['text_combined', 'label', 'text'],
        num_rows: 6863
    })
    test: Dataset({
        features: ['text_combined', 'label', 'text'],
        num_rows: 17157
    })
})

## Tokenizer

In [None]:

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_LEN,
        padding=False,  # dynamic padding via data collator
    )

tokenized = dataset.map(tokenize_batch, batched=True, remove_columns=["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
list(tokenized.keys()), tokenized["train"][0].keys()


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/61762 [00:00<?, ? examples/s]

Map:   0%|          | 0/6863 [00:00<?, ? examples/s]

Map:   0%|          | 0/17157 [00:00<?, ? examples/s]

(['train', 'validation', 'test'],
 dict_keys(['text_combined', 'label', 'input_ids', 'token_type_ids', 'attention_mask']))

## Metrics

In [None]:

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary", zero_division=0)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}



## Model and Trainer

In [None]:
# === Model & Trainer (GPU-friendly + version-agnostic) ===
import os, torch
from packaging import version
import transformers as _tf

# Use GPU in Colab if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Torch CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

num_labels = 2
id2label = {0: "ham", 1: "spam"}
label2id = {"ham": 0, "spam": 1}

config = AutoConfig.from_pretrained(
    BASE_MODEL, num_labels=num_labels, id2label=id2label, label2id=label2id
)
model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL, config=config).to(device)

# Try modern TrainingArguments first; fall back if your transformers is older
try:
    training_args = TrainingArguments(
        output_dir=os.path.join(OUTPUT_DIR, "trainer_runs"),
        logging_dir=os.path.join(OUTPUT_DIR, "logs"),
        learning_rate=LR,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=EPOCHS,
        weight_decay=WEIGHT_DECAY,
        warmup_ratio=WARMUP_RATIO,          # safe on modern versions
        evaluation_strategy="epoch",        # <-- modern
        save_strategy="epoch",              # <-- modern
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        fp16=torch.cuda.is_available(),     # use mixed precision on GPU
        report_to=[],                       # or ["tensorboard"]
    )
except TypeError as e:
    print("Falling back to legacy TrainingArguments (older transformers):", e)
    # Minimal, broadly compatible args for older versions (no evaluation/save strategy)
    base_kwargs = dict(
        output_dir=os.path.join(OUTPUT_DIR, "trainer_runs"),
        logging_dir=os.path.join(OUTPUT_DIR, "logs"),
        learning_rate=LR,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=EPOCHS,
        logging_steps=100,
        save_steps=500,
        fp16=torch.cuda.is_available(),     # supported on many older versions too
        # no_cuda defaults to False, so GPU will be used if available
    )
    try:
        training_args = TrainingArguments(weight_decay=WEIGHT_DECAY, **base_kwargs)
    except TypeError:
        # some very old versions may not accept weight_decay
        training_args = TrainingArguments(**base_kwargs)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer


Torch CUDA available: True
GPU: Tesla T4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Falling back to legacy TrainingArguments (older transformers): TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'


  trainer = Trainer(


<transformers.trainer.Trainer at 0x7d0a104b4740>

## Training

In [None]:

train_result = trainer.train()
train_result


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msatya-gmsv[0m ([33msatya-gmsv-concordia-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,0.3379
200,0.176
300,0.1283
400,0.1302
500,0.1242
600,0.1036
700,0.1149
800,0.0841
900,0.1081
1000,0.0871


TrainOutput(global_step=11583, training_loss=0.03135137342808727, metrics={'train_runtime': 3762.8761, 'train_samples_per_second': 49.241, 'train_steps_per_second': 3.078, 'total_flos': 2.436219878229096e+16, 'train_loss': 0.03135137342808727, 'epoch': 3.0})

## Evaluation on validation and test sets

In [None]:

val_metrics = trainer.evaluate(tokenized["validation"])
test_metrics = trainer.evaluate(tokenized["test"])

print("Validation metrics:", val_metrics)
print("Test metrics:", test_metrics)


Validation metrics: {'eval_loss': 0.033346664160490036, 'eval_accuracy': 0.9943173539268542, 'eval_precision': 0.9964881474978051, 'eval_recall': 0.9921328671328671, 'eval_f1': 0.994305738063951, 'eval_runtime': 25.0558, 'eval_samples_per_second': 273.909, 'eval_steps_per_second': 17.122, 'epoch': 3.0}
Test metrics: {'eval_loss': 0.03302673622965813, 'eval_accuracy': 0.9944046161916419, 'eval_precision': 0.9967205434527993, 'eval_recall': 0.9920727442294242, 'eval_f1': 0.9943912129002104, 'eval_runtime': 62.465, 'eval_samples_per_second': 274.666, 'eval_steps_per_second': 17.178, 'epoch': 3.0}


## Save model and tokenizer

In [None]:

trainer.save_model(OUTPUT_DIR)   # saves model & config
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Saved to: {OUTPUT_DIR}")


Saved to: models/bert-spam


In [None]:
# Zip the saved model folder
!zip -r bert-spam-model.zip {OUTPUT_DIR}

# Download to your local machine
from google.colab import files
files.download("bert-spam-model.zip")


  adding: models/bert-spam/ (stored 0%)
  adding: models/bert-spam/vocab.txt (deflated 53%)
  adding: models/bert-spam/config.json (deflated 50%)
  adding: models/bert-spam/logs/ (stored 0%)
  adding: models/bert-spam/logs/events.out.tfevents.1760760017.38fc27add6e2.972.1 (deflated 47%)
  adding: models/bert-spam/logs/events.out.tfevents.1760756222.38fc27add6e2.972.0 (deflated 67%)
  adding: models/bert-spam/training_args.bin (deflated 54%)
  adding: models/bert-spam/tokenizer.json (deflated 71%)
  adding: models/bert-spam/trainer_runs/ (stored 0%)
  adding: models/bert-spam/trainer_runs/checkpoint-10500/ (stored 0%)
  adding: models/bert-spam/trainer_runs/checkpoint-10500/vocab.txt (deflated 53%)
  adding: models/bert-spam/trainer_runs/checkpoint-10500/config.json (deflated 50%)
  adding: models/bert-spam/trainer_runs/checkpoint-10500/optimizer.pt (deflated 11%)
  adding: models/bert-spam/trainer_runs/checkpoint-10500/rng_state.pth (deflated 26%)
  adding: models/bert-spam/trainer_run

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Reload the saved model and predict on sample inputs

In [None]:

clf = pipeline(
    "text-classification",
    model=OUTPUT_DIR,
    tokenizer=OUTPUT_DIR,
    device=0 if torch.cuda.is_available() else -1,
    truncation=True
)

samples = [
    "Congratulations! You've won a $1000 gift card. Click the link to claim now.",
    "Hey, are we still on for lunch tomorrow at noon?",
    "URGENT: Your account will be suspended. Verify here: http://scam.example",
]

for s in samples:
    pred = clf(s, top_k=None)[0]
    print(f"TEXT: {s}\nPREDICTION: {pred['label']}  (score={pred['score']:.4f})\n")


Device set to use cuda:0


TEXT: Congratulations! You've won a $1000 gift card. Click the link to claim now.
PREDICTION: spam  (score=0.9998)

TEXT: Hey, are we still on for lunch tomorrow at noon?
PREDICTION: ham  (score=0.9975)

TEXT: URGENT: Your account will be suspended. Verify here: http://scam.example
PREDICTION: spam  (score=0.9994)



In [None]:
samples = [
    # Promotions / obvious spam
    "🔥 LIMITED TIME OFFER: Get 90% OFF on all items! Redeem now: http://bit.ly/xyz",
    "You’ve been selected for a FREE iPhone 15 Pro. Confirm your address!",
    "Winner! Claim your $500 Walmart gift card within 24 hours.",
    "URGENT: Your Netflix account will be cancelled. Update payment at http://netflix-payments.example",
    "Final notice: We attempted to deliver your parcel. Pay customs fee: http://tracking-fee.example",

    # Phishing / credential harvest
    "Security Alert: Multiple login attempts detected. Verify your identity here: https://secure-login.example",
    "Your mailbox is almost full. Increase your storage now to avoid losing emails.",
    "We suspended your banking account due to suspicious activity. Download the attached form and verify.",
    "Two-factor reset requested. If this wasn't you, re-enter your password to cancel.",

    # Ham: personal / school / work
    "Hey, can you send me the slides from yesterday’s class? I missed the last 10 minutes.",
    "Mom’s birthday dinner is at 7pm on Friday. Can you pick up the cake?",
    "I pushed the latest changes to the repo—check the PR and leave comments when free.",
    "Great meeting today. Let’s finalize the budget sheet by Tuesday and sync with finance.",
    "Reminder: Project stand-up tomorrow at 9:30 AM in Room 214.",

    # Borderline marketing (legit newsletter style)
    "Weekly Digest: Top 10 data engineering articles you should read",
    "Your order #948231 has shipped. Track your package here.",
    "Thanks for subscribing! Your 10% discount code is WELCOME10.",
    "Status update: Your return has been received. Refund will be processed in 5–7 days.",

    # Obfuscated / spammy tricks
    "C0ngr@tulations!!! You w0n a pr1ze. Cl1ck h3re n0w!!!",
    "Get r.i.c.h quick with our guaranteed system. No skills needed.",
    "This is not a scam. We just need your bank info to transfer the funds.",

    # Social / calendar / transactional ham
    "Google Calendar: Event invitation — ML Study Group, Monday 6–7 PM.",
    "Your verification code is 482199. Do not share this with anyone.",
    "Invoice INV-1042 is due on 11/05. View or pay online.",
    "Your Uber is arriving now. Driver: Alex (Toyota Corolla, Black).",
    "We received your support request (#39201). Our team will reply within 24 hours.",

    # Foreign language + mixed
    "Oferta especial por tiempo limitado: 70% de descuento en cursos online.",
    "Bonjour, pouvez-vous confirmer votre présence à la réunion de demain ?",
    "Пожалуйста, подтвердите адрес доставки для вашего заказа.",
]

# Run in batch for speed
preds = clf(samples, truncation=True)
for s, p in zip(samples, preds):
    print(f"TEXT: {s}\nPREDICTION: {p['label']}  (score={p['score']:.4f})\n")


TEXT: 🔥 LIMITED TIME OFFER: Get 90% OFF on all items! Redeem now: http://bit.ly/xyz
PREDICTION: spam  (score=1.0000)

TEXT: You’ve been selected for a FREE iPhone 15 Pro. Confirm your address!
PREDICTION: spam  (score=0.9998)

TEXT: Winner! Claim your $500 Walmart gift card within 24 hours.
PREDICTION: spam  (score=1.0000)

TEXT: URGENT: Your Netflix account will be cancelled. Update payment at http://netflix-payments.example
PREDICTION: spam  (score=0.9998)

TEXT: Final notice: We attempted to deliver your parcel. Pay customs fee: http://tracking-fee.example
PREDICTION: spam  (score=0.9978)

TEXT: Security Alert: Multiple login attempts detected. Verify your identity here: https://secure-login.example
PREDICTION: spam  (score=0.9964)

TEXT: Your mailbox is almost full. Increase your storage now to avoid losing emails.
PREDICTION: spam  (score=0.9978)

TEXT: We suspended your banking account due to suspicious activity. Download the attached form and verify.
PREDICTION: spam  (score=0.9

## (Optional) Helper: classify your own text

In [None]:

def classify(text: str):
    res = clf(text, top_k=None)[0]
    print(f"Prediction: {res['label']} (score={res['score']:.4f})")
    return res

# Example:
classify("Limited time offer!!! Reply STOP to unsubscribe.")


Prediction: spam (score=1.0000)


{'label': 'spam', 'score': 0.9999948740005493}