# Experiment C: Multi-task PCL + Per-Category Classification

Train `PCLCategoryDeBERTa` with a **single unified head** (`1 + n_categories` logits):
- **logits[:, 0]**: overall PCL vs non-PCL — binary BCE + `pos_weight`
- **logits[:, 1:]**: multi-label per PCL category — BCE (one logit per category)

A single `PCLClassifierHead(n_out=1+n_categories)` produces all outputs from the
same shared pooled representation. Category labels come from
`dontpatronizeme_categories.tsv`: a multi-hot vector indicating which of the 7
PCL categories appear in each paragraph.  
Non-PCL paragraphs have all-zero category labels (confirmed: all 993 annotated
paragraphs have binary_label=1).

**PCL categories** (7):
`Authority_voice`, `Compassion`, `Metaphors`, `Presupposition`,
`Shallow_solution`, `The_poorer_the_merrier`, `Unbalanced_power_relations`

Total loss = `binary_BCE + category_weight × category_BCE`  
A single `BCEWithLogitsLoss(reduction='none')` is applied to the combined
`(B, 1+n_categories)` output; the binary and category components are then
averaged separately before weighting.

Fixed hyperparameters (not searched):
- VAL_FRACTION=0.15, BATCH_SIZE=32, NUM_EPOCHS=12, PATIENCE=4
- `pooling=MEAN` — best default for DeBERTa-v3 (RTD pretraining; no special CLS)
- `warmup_fraction=0.10`, `label_smoothing=0.0` — fixed to save trials for key params

Searched: `lr`, `weight_decay`, `hidden_dim ∈ {0, 256}`, `dropout_rate ∈ {0.1, 0.3}`,
`head_lr_multiplier ∈ {1, 3, 5}`, `category_weight ∈ {0.1, 0.2, 0.5, 1.0}`

In [1]:
import os
import sys
import random
import logging
import gc
import json

import numpy as np
import torch
from transformers import AutoTokenizer
from sklearn.metrics import classification_report
import optuna
from optuna.visualization.matplotlib import (
    plot_optimization_history,
    plot_param_importances,
    plot_parallel_coordinate,
)
import matplotlib.pyplot as plt

sys.path.insert(0, "..")
from utils.data import load_data_categories, PCL_CATEGORIES
from utils.split import split_train_val
from utils.dataloaders import make_category_dataloaders
from utils.pcl_deberta import PCLDeBERTa, PoolingStrategy
from utils.optim import compute_pos_weight
from utils.training_loop import train_category_model

SEED = 42
DATA_DIR = "../data"
OUT_DIR = "out"
MODEL_NAME = "microsoft/deberta-v3-base"
MAX_LENGTH = 256
VAL_FRACTION = 0.15
BATCH_SIZE = 32
N_TRIALS = 20
NUM_EPOCHS = 12
PATIENCE = 4
N_EVAL_STEPS = 35
N_CATEGORIES = len(PCL_CATEGORIES)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s:\t%(message)s")
LOG = logging.getLogger(__name__)
LOG.info(f"Device: {DEVICE}")
LOG.info(f"PCL categories ({N_CATEGORIES}): {PCL_CATEGORIES}")
os.makedirs(OUT_DIR, exist_ok=True)

2026-03-01 11:37:05,608 INFO:	Device: cuda
2026-03-01 11:37:05,609 INFO:	PCL categories (7): ['Authority_voice', 'Compassion', 'Metaphors', 'Presupposition', 'Shallow_solution', 'The_poorer_the_merrier', 'Unbalanced_power_relations']


## 1. Data Loading

`load_data_categories` returns DataFrames with columns:
`text`, `binary_label` (0/1), and one binary column per `PCL_CATEGORIES` entry.

In [2]:
train_df, dev_df = load_data_categories(DATA_DIR)
train_sub_df, val_sub_df = split_train_val(train_df, val_frac=VAL_FRACTION, seed=SEED)
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME)

LOG.info(f"Train: {len(train_sub_df)}, Val: {len(val_sub_df)}, Dev: {len(dev_df)}")
LOG.info("Category label counts in train_sub:")
for cat in PCL_CATEGORIES:
    n = train_sub_df[cat].sum()
    LOG.info(f"  {cat}: {n} ({n/len(train_sub_df)*100:.1f}%)")

train_df[train_df["binary_label"] == 1].head()

2026-03-01 11:37:05,915 INFO:	Train/val split: 7118 train, 1257 val (val_frac=0.15)
2026-03-01 11:37:05,918 INFO:	Train, val positive count: 675, 119
2026-03-01 11:37:06,044 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-base/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-03-01 11:37:06,052 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-base/8ccc9b6f36199bec6961081d44eb72fb3f7353f3/config.json "HTTP/1.1 200 OK"
2026-03-01 11:37:06,148 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-base/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
2026-03-01 11:37:06,156 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-base/8ccc9b6f36199bec6961081d44eb72fb3f7353f3/tokenizer_config.json "HTTP/1.1 200 OK"
2026-03-01 11:37:06,253 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-base/tree/main/additional

Unnamed: 0_level_0,text,binary_label,Authority_voice,Compassion,Metaphors,Presupposition,Shallow_solution,The_poorer_the_merrier,Unbalanced_power_relations
par_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
33,Arshad said that besides learning many new asp...,1,0,0,0,0,0,0,1
34,Fast food employee who fed disabled man become...,1,0,0,0,0,1,0,1
42,Vanessa had feelings of hopelessness in her fi...,1,0,1,0,0,0,0,0
77,"In September , Major Nottle set off on foot fr...",1,0,0,0,0,1,0,1
83,The demographics of Pakistan and India are ver...,1,0,0,0,0,1,0,1


## 2. Hyperparameter Search

`category_weight` is the key experiment-specific hyperparameter, controlling how
much the per-category auxiliary loss influences the shared representation.

Secondary hyperparameters are narrowed to avoid wasting trials:
- `pooling` fixed to MEAN (DeBERTa-v3 RTD pretraining; no special CLS token)
- `warmup_fraction` fixed to 0.10, `label_smoothing` fixed to 0.0
- `hidden_dim ∈ {0, 256}` (0 = single linear, 256 = MLP)
- `dropout_rate ∈ {0.1, 0.3}` (only sampled when hidden_dim=256)
- `head_lr_multiplier ∈ {1, 3, 5}`

In [3]:
POOLING = PoolingStrategy.CLS_MEAN   # fixed: best default for DeBERTa-v3 RTD pretraining
EXP_NAME = "C_multitask"


def objective(trial: optuna.trial.Trial) -> float:
    lr              = trial.suggest_float("lr", 4e-6, 6e-5, log=True)
    weight_decay    = trial.suggest_float("weight_decay", 1e-5, 1e-2, log=True)
    hidden_dim      = trial.suggest_categorical("hidden_dim", [0, 256])
    dropout_rate    = trial.suggest_categorical("dropout_rate", [0.1, 0.3]) if hidden_dim > 0 else 0.0
    head_lr_mult    = trial.suggest_categorical("head_lr_multiplier", [1, 3, 5])
    category_weight = trial.suggest_categorical("category_weight", [0.1, 0.2, 0.5, 1.0])

    # Fixed — not worth spending trials on
    warmup_fraction = 0.10
    label_smoothing = 0.0

    LOG.info(f"[{EXP_NAME}] Trial {trial.number}: lr={lr:.2e}, hidden={hidden_dim}, "
             f"cat_w={category_weight}, head_lr_mult={head_lr_mult}")

    train_loader, val_loader, dev_loader = make_category_dataloaders(
        train_sub_df, val_sub_df, dev_df, BATCH_SIZE, MAX_LENGTH, tokeniser
    )

    model = PCLDeBERTa(
        hidden_dim=hidden_dim,
        dropout_rate=dropout_rate,
        pooling=POOLING,
        n_out=1 + N_CATEGORIES,
    ).to(DEVICE)

    pos_weight = compute_pos_weight(train_sub_df, DEVICE)

    results = train_category_model(
        model=model, device=DEVICE,
        train_loader=train_loader, val_loader=val_loader, dev_loader=dev_loader,
        pos_weight=pos_weight, lr=lr, weight_decay=weight_decay,
        num_epochs=NUM_EPOCHS, warmup_fraction=warmup_fraction,
        patience=PATIENCE, category_weight=category_weight,
        head_lr_multiplier=head_lr_mult,
        label_smoothing=label_smoothing,
        eval_every_n_steps=N_EVAL_STEPS,
        trial=trial,
    )

    trial.set_user_attr("best_val_f1",    results["best_val_f1"])
    trial.set_user_attr("best_threshold", results["best_threshold"])
    trial.set_user_attr("dev_f1",         results["dev_metrics"]["f1"])
    trial.set_user_attr("dev_precision",  results["dev_metrics"]["precision"])
    trial.set_user_attr("dev_recall",     results["dev_metrics"]["recall"])

    try:
        prev_best = trial.study.best_value
    except ValueError:
        prev_best = -float("inf")
    if results["best_val_f1"] > prev_best:
        torch.save(
            {k: v.cpu() for k, v in model.state_dict().items()},
            os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_model.pt")
        )
        config = {
            **trial.params,
            "pooling": POOLING.name,
            "warmup_fraction": warmup_fraction,
            "label_smoothing": label_smoothing,
            "batch_size": BATCH_SIZE, "num_epochs": NUM_EPOCHS, "patience": PATIENCE,
            "best_threshold": results["best_threshold"],
            "n_categories": N_CATEGORIES, "pcl_categories": PCL_CATEGORIES,
        }
        with open(os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_params.json"), "w") as f:
            json.dump(config, f, indent=2)
        LOG.info(f"[{EXP_NAME}] New best saved (val F1={results['best_val_f1']:.4f})")

    del model, train_loader, val_loader, dev_loader
    gc.collect()
    torch.cuda.empty_cache()
    return results["best_val_f1"]

## 3. Run Experiment

In [None]:
gc.collect()
torch.cuda.empty_cache()

study = optuna.create_study(
    direction="maximize",
    study_name=f"pcl_deberta_exp_{EXP_NAME}",
    sampler=optuna.samplers.TPESampler(seed=SEED),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=6, n_warmup_steps=300),
)
study.optimize(objective, n_trials=N_TRIALS)

best = study.best_trial
LOG.info(f"Best trial: {best.number}")
LOG.info(f"Val F1: {best.user_attrs['best_val_f1']:.4f} | Dev F1: {best.user_attrs['dev_f1']:.4f}")
LOG.info(f"Best params: {best.params}")

[32m[I 2026-03-01 11:37:07,343][0m A new study created in memory with name: pcl_deberta_exp_C_multitask[0m
2026-03-01 11:37:07,345 INFO:	[C_multitask] Trial 0: lr=1.10e-05, hidden=0, cat_w=0.1, head_lr_mult=1
2026-03-01 11:37:11,236 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-base/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-03-01 11:37:11,245 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-base/8ccc9b6f36199bec6961081d44eb72fb3f7353f3/config.json "HTTP/1.1 200 OK"
2026-03-01 11:37:11,549 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-base/resolve/main/model.safetensors "HTTP/1.1 404 Not Found"
2026-03-01 11:37:11,647 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-base "HTTP/1.1 200 OK"
2026-03-01 11:37:11,756 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-base/commits/main "HTTP/1.1 200 OK"
2026-03-01 11:37:1

Loading weights:   0%|          | 0/198 [00:00<?, ?it/s]

2026-03-01 11:37:12,201 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-base/resolve/refs%2Fpr%2F14/model.safetensors "HTTP/1.1 302 Found"
[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-base
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 

[3mN

Loading weights:   0%|          | 0/198 [00:00<?, ?it/s]

[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-base
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
2026-03-01 12:39:34,658 INFO:	Bac

## 4. Results

Binary head dev metrics + per-category predictions from the best model.

In [None]:
for plot_fn, suffix in [
    (plot_optimization_history, "history"),
    (plot_param_importances, "importances"),
    (plot_parallel_coordinate, "parallel"),
]:
    plot_fn(study)
    plt.tight_layout()
    plt.savefig(f"{OUT_DIR}/{EXP_NAME}_optuna_{suffix}.png", dpi=300)
    plt.show()

best = study.best_trial
best_params = best.params

model = PCLDeBERTa(
    hidden_dim=best_params["hidden_dim"],
    dropout_rate=best_params.get("dropout_rate", 0.0),
    pooling=POOLING,
    n_out=1 + N_CATEGORIES,
).to(DEVICE)

state_dict = torch.load(
    os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_model.pt"), map_location=DEVICE
)
model.load_state_dict(state_dict)

_, _, dev_loader = make_category_dataloaders(
    train_sub_df, val_sub_df, dev_df, BATCH_SIZE, MAX_LENGTH, tokeniser
)

best_threshold = best.user_attrs["best_threshold"]
model.eval()
all_binary_scores, all_binary_labels = [], []
all_cat_probs, all_cat_labels = [], []

with torch.no_grad():
    for batch in dev_loader:
        input_ids      = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
        all_binary_scores.append(logits[:, 0].cpu())
        all_binary_labels.append(batch["labels"].cpu())
        all_cat_probs.append(torch.sigmoid(logits[:, 1:]).cpu())
        all_cat_labels.append(batch["category_labels"].cpu())

binary_probs  = torch.sigmoid(torch.cat(all_binary_scores))
binary_labels = torch.cat(all_binary_labels).long().numpy()
binary_preds  = (binary_probs >= best_threshold).long().numpy()
cat_probs  = torch.cat(all_cat_probs).numpy()   # (N, 7)
cat_labels = torch.cat(all_cat_labels).numpy()  # (N, 7)

print(f"\n{'='*60}")
print(f"{EXP_NAME.upper()} — Dev Set Binary Results (threshold={best_threshold:.3f})")
print(f"{'='*60}")
print(classification_report(binary_labels, binary_preds, target_names=["Non-PCL", "PCL"]))

from sklearn.metrics import f1_score as sk_f1
print(f"\n{'='*60}")
print("Per-category dev F1 (threshold=0.5):")
print(f"{'='*60}")
for i, cat in enumerate(PCL_CATEGORIES):
    cat_pred = (cat_probs[:, i] >= 0.5).astype(int)
    f1 = sk_f1(cat_labels[:, i].astype(int), cat_pred, zero_division=0)
    n_pos = int(cat_labels[:, i].sum())
    print(f"  {cat:<35s} F1={f1:.4f}  (dev positives: {n_pos})")

print("\nBest hyperparams:")
for k, v in best_params.items():
    print(f"  {k}: {v}")
print(f"  pooling: mean (fixed)")
print(f"  warmup_fraction: 0.10 (fixed)")
print(f"  label_smoothing: 0.0 (fixed)")

del model
gc.collect()
torch.cuda.empty_cache()