# Lab 3-2 — Weak Supervision: Data Augmentation for Email Classification

In this lab we use Snorkel **transformation functions (TFs)** to synthetically expand our labeled email training set, then measure whether augmentation improves downstream classifier performance.

Data augmentation creates new valid training examples by applying class-preserving transformations to existing ones. In text, this means small edits that change surface form but preserve meaning and label.

## What Changes vs Lab 1

Lab 1 generated labels from scratch using labeling functions on *unlabeled* data.
This lab starts with *labeled* training data (gold labels kept) and focuses on **expanding** it.
The goal is to give the classifier more variation to learn from.

## 1. Setup and Loading Data

In [1]:
import os
import re
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

random.seed(42)
np.random.seed(42)

pd.set_option("display.max_colwidth", 80)

In [2]:
# Load from lab1's data folder
df = pd.read_csv("../lab1/emails.csv")

df_train, df_test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["label"]
)
df_train = df_train.reset_index(drop=True)
df_test  = df_test.reset_index(drop=True)

# Gold labels are kept for both splits in this lab
Y_train = df_train["label"].values
Y_test  = df_test["label"].values

print(f"Training set: {len(df_train)} emails")
print(f"Test set:     {len(df_test)} emails")

Training set: 56 emails
Test set:     24 emails


In [3]:
df_train[["subject", "body", "label"]].head(8)

Unnamed: 0,subject,body,label
0,Alert: Direct deposit failed,"Your payroll deposit of $2,340 was returned. Verify your account now.",1
1,AWS bill for January - $234.12,Your AWS invoice for January 2026 is now available in the billing console.,0
2,"FREE MONEY - No catch, limited slots!",Government relief funds available. Apply now before slots run out.,1
3,Exclusive: Make $500/day from home,Join thousands earning from home. No experience needed. Sign up free!,1
4,BANK ALERT: Transaction declined - verify now,Your recent transaction was blocked. Confirm your identity immediately.,1
5,New hire: Sarah joining marketing Monday,Please join us in welcoming Sarah Chen to the marketing team starting Monday!,0
6,Special offer: 90% OFF today only,Hurry! This exclusive deal expires in 1 hour. Buy now!!,1
7,New documentation site launched,The new developer docs site is live at docs.example.com. Feedback welcome.,0


## 2. Writing Transformation Functions

Transformation functions take a data point and return a **modified copy** (or `None` if no valid transformation applies). They must be **class-preserving** — the label should still be correct after the transformation.

For emails, good TFs are edits that a real sender might make: rewording a phrase, swapping a synonym, changing punctuation style, or slightly varying the subject line.

### a) Synonym substitution

Replace common phishing or work keywords with close synonyms. We use a small hand-built dictionary — no NLTK or external downloads needed.

In [9]:
from snorkel.augmentation import transformation_function

# Small domain-specific synonym map
SYNONYMS = {
    # Phishing synonyms
    "urgent": "immediate",
    "immediately": "right away",
    "verify": "confirm",
    "suspended": "deactivated",
    "claim": "collect",
    "reward": "prize",
    "free": "complimentary",
    "limited": "restricted",
    "account": "profile",
    "payment": "transaction",
    # Legitimate synonyms
    "meeting": "call",
    "feedback": "comments",
    "reminder": "heads-up",
    "attached": "enclosed",
    "schedule": "calendar",
    "team": "group",
    "review": "check",
    "complete": "finish",
    "update": "refresh",
    "submit": "send",
}

@transformation_function()
def tf_synonym_subject(x):
    """Replace a word in the subject line with a synonym."""
    words = x.subject.split()
    candidates = [(i, w) for i, w in enumerate(words) if w.lower().rstrip("!?,.:") in SYNONYMS]
    if not candidates:
        return None
    i, word = random.choice(candidates)
    clean = re.sub(r'[!?,.:]$', '', word.lower())
    words[i] = word.replace(clean, SYNONYMS[clean])
    result = x.copy()
    result["subject"] = " ".join(words)
    return result

@transformation_function()
def tf_synonym_body(x):
    """Replace a word in the body with a synonym."""
    words = x.body.split()
    candidates = [(i, w) for i, w in enumerate(words) if w.lower().rstrip("!?,.:") in SYNONYMS]
    if not candidates:
        return None
    i, word = random.choice(candidates)
    clean = re.sub(r'[!?,.:]$', '', word.lower())
    words[i] = word.replace(clean, SYNONYMS[clean])
    result = x.copy()
    result["body"] = " ".join(words)
    return result

### b) Punctuation and casing variation

Phishing emails often use ALL CAPS and excessive punctuation. TFs that add or remove these variations help the model become robust to surface-level formatting.

In [10]:
@transformation_function()
def tf_remove_exclamations(x):
    """Strip exclamation marks — teaches model the label isn't just about punctuation."""
    new_subject = x.subject.replace("!", "").replace("!!", "")
    new_body = x.body.replace("!", ".")
    if new_subject == x.subject and new_body == x.body:
        return None
    result = x.copy()
    result["subject"] = new_subject
    result["body"] = new_body
    return result

@transformation_function()
def tf_lowercase_subject(x):
    """Lowercase an ALL CAPS subject — same phishing intent, different surface form."""
    if x.subject != x.subject.upper() or len(x.subject) < 5:
        return None
    result = x.copy()
    result["subject"] = x.subject.capitalize()
    return result

### c) Subject line rewording

Add or remove a common prefix to vary how the subject is phrased.

In [11]:
PHISHING_PREFIXES = ["Action required: ", "Notice: ", "Alert: "]
LEGIT_PREFIXES    = ["Re: ", "Fwd: ", "Quick note: "]

@transformation_function()
def tf_add_subject_prefix(x):
    """Prepend a label-consistent prefix to the subject line."""
    prefixes = PHISHING_PREFIXES if x.label == 1 else LEGIT_PREFIXES
    # Skip if subject already starts with one of these
    if any(x.subject.startswith(p) for p in prefixes):
        return None
    result = x.copy()
    result["subject"] = random.choice(prefixes) + x.subject
    return result

@transformation_function()
def tf_strip_subject_prefix(x):
    """Remove a Re:/Fwd:/Alert: prefix if present."""
    match = re.match(r'^(Re|Fwd|Fw|Alert|Notice|Action required):\s*', x.subject, re.I)
    if not match:
        return None
    result = x.copy()
    result["subject"] = x.subject[match.end():]
    return result

### Preview transformations

Let's inspect what each TF actually does to a few examples before applying at scale.

In [12]:
def preview_tfs(df, tfs, n=1, random_state=42):
    """Show one before/after example per TF."""
    rows = []
    shuffled = df.sample(frac=1, random_state=random_state)
    for tf in tfs:
        for _, row in shuffled.iterrows():
            transformed = tf(row)
            if transformed is not None:
                rows.append({
                    "TF": tf.name,
                    "Original subject": row.subject,
                    "Transformed subject": transformed["subject"],
                    "Label": "PHISHING" if row.label == 1 else "LEGITIMATE",
                })
                break
    return pd.DataFrame(rows)

tfs = [
    tf_synonym_subject,
    tf_synonym_body,
    tf_remove_exclamations,
    tf_lowercase_subject,
    tf_add_subject_prefix,
    tf_strip_subject_prefix,
]

preview_tfs(df_train, tfs)

Unnamed: 0,TF,Original subject,Transformed subject,Label
0,tf_synonym_subject,Homework feedback - great job on part 2,Homework comments - great job on part 2,LEGITIMATE
1,tf_synonym_body,Alert: Direct deposit failed,Alert: Direct deposit failed,PHISHING
2,tf_remove_exclamations,New hire: Sarah joining marketing Monday,New hire: Sarah joining marketing Monday,LEGITIMATE
3,tf_add_subject_prefix,New hire: Sarah joining marketing Monday,Quick note: New hire: Sarah joining marketing Monday,LEGITIMATE
4,tf_strip_subject_prefix,Alert: Direct deposit failed,Direct deposit failed,PHISHING


## 3. Applying Transformation Functions

We define a **policy** that controls how TFs are composed and how many augmented copies to generate per original example.

- `RandomPolicy`: picks TFs uniformly at random
- `MeanFieldPolicy`: picks TFs according to a custom probability distribution

We use `MeanFieldPolicy` to apply synonym TFs more often than structural ones, since synonym swaps produce more natural-sounding variations.

In [13]:
from snorkel.augmentation import PandasTFApplier, MeanFieldPolicy, RandomPolicy

# Weight synonym TFs higher — they produce more natural augmentations
mean_field_policy = MeanFieldPolicy(
    len(tfs),
    sequence_length=2,
    n_per_original=2,
    keep_original=True,
    p=[0.30, 0.30, 0.15, 0.10, 0.10, 0.05],
)

tf_applier = PandasTFApplier(tfs, mean_field_policy)
df_train_augmented = tf_applier.apply(df_train)
Y_train_augmented = df_train_augmented["label"].values

print(f"Original training set:  {len(df_train)} emails")
print(f"Augmented training set: {len(df_train_augmented)} emails")
print(f"Expansion factor: {len(df_train_augmented) / len(df_train):.1f}x")

100%|██████████| 56/56 [00:00<00:00, 1974.07it/s]

Original training set:  56 emails
Augmented training set: 125 emails
Expansion factor: 2.2x





In [14]:
# Spot-check a few augmented examples
df_train_augmented[["subject", "body", "label"]].sample(6, random_state=1)

Unnamed: 0,subject,body,label
32,Security breach: change your password NOW,We detected a breach. Reset your password right away at http://reset.xyz,1
18,Act now: IRS tax refund pending,"You are owed $1,240. Submit your SSN to process your refund today.",1
19,Quick note: Your subscription receipt - $12.99/month,Thank you for subscribing. Your receipt for February is attached.,0
11,Free vacation package - you qualify,You and a guest qualify for 5 nights in Cancun. Claim before midnight.,1
28,Library book due back next Monday,Just a heads up - your borrowed book is due back on Monday the 10th.,0
29,Monthly expense report due,Please submit your expense reports by the 5th for timely reimbursement.,0


## 4. Training and Comparing Models

We train the same `LogisticRegression` on both the original and augmented training sets, then compare test accuracy. A good augmentation strategy should improve generalization.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

def prepare_features(df_tr, df_te):
    """Bag-of-words on subject + body."""
    vec = CountVectorizer(ngram_range=(1, 2), min_df=1)
    X_tr = vec.fit_transform(df_tr["subject"] + " " + df_tr["body"])
    X_te = vec.transform(df_te["subject"] + " " + df_te["body"])
    return X_tr, X_te

def train_and_evaluate(X_tr, Y_tr, X_te, Y_te, label=""):
    clf = LogisticRegression(C=1.0, solver="liblinear", random_state=42)
    clf.fit(X_tr, Y_tr)
    preds = clf.predict(X_te)
    acc = accuracy_score(Y_te, preds)
    print(f"[{label}] Test accuracy: {acc * 100:.1f}%")
    return acc, preds

In [16]:
# Train on original data
X_train_orig, X_test = prepare_features(df_train, df_test)
acc_orig, preds_orig = train_and_evaluate(X_train_orig, Y_train, X_test, Y_test, "Original")

# Train on augmented data
X_train_aug, X_test_aug = prepare_features(df_train_augmented, df_test)
acc_aug, preds_aug = train_and_evaluate(X_train_aug, Y_train_augmented, X_test_aug, Y_test, "Augmented")

print(f"\nImprovement: {(acc_aug - acc_orig) * 100:+.1f} percentage points")

[Original] Test accuracy: 79.2%
[Augmented] Test accuracy: 83.3%

Improvement: +4.2 percentage points


In [17]:
print("=== Original training data ===")
print(classification_report(Y_test, preds_orig, target_names=["LEGITIMATE", "PHISHING"]))

print("=== Augmented training data ===")
print(classification_report(Y_test, preds_aug, target_names=["LEGITIMATE", "PHISHING"]))

=== Original training data ===
              precision    recall  f1-score   support

  LEGITIMATE       0.89      0.67      0.76        12
    PHISHING       0.73      0.92      0.81        12

    accuracy                           0.79        24
   macro avg       0.81      0.79      0.79        24
weighted avg       0.81      0.79      0.79        24

=== Augmented training data ===
              precision    recall  f1-score   support

  LEGITIMATE       1.00      0.67      0.80        12
    PHISHING       0.75      1.00      0.86        12

    accuracy                           0.83        24
   macro avg       0.88      0.83      0.83        24
weighted avg       0.88      0.83      0.83        24



## Summary

**Key takeaways:**
- Transformation functions must be class-preserving — they change surface form, not meaning or label
- A `MeanFieldPolicy` lets you prioritize higher-quality TFs over weaker ones
- Augmentation is most impactful on small datasets (like this one) where the classifier has limited variation to learn from
- The TFs here are intentionally lightweight — no NLP models, no external downloads

**What to try:**
- Add more synonyms to the dictionary and observe coverage changes
- Try `n_per_original=4` and see if more augmentation helps or starts to overfit
- Compare `RandomPolicy` vs `MeanFieldPolicy` accuracy