# Experiment F: DeBERTa-v3-large

Fine-tune `microsoft/deberta-v3-large` (~400M params, hidden=1024, 24 layers) on the PCL task.

## Memory strategy

DeBERTa-v3 is **incompatible with gradient checkpointing + gradient accumulation** due to its
relative position embedding cache being part of the computational graph. With gradient accumulation,
micro-batch 2's backward tries to reuse cached position tensors already freed by micro-batch 1:
- `use_reentrant=False` → tensor count mismatch error
- `use_reentrant=True` → "backward through freed graph" error

**Solution:** disable gradient checkpointing and rely on **bitsandbytes 8-bit AdamW** alone.

| Component | Memory |
|---|---|
| Weights (fp32) | 1.68 GB |
| Gradients (fp32) | 1.68 GB |
| 8-bit optimizer states | ~0.84 GB |
| Activations (no checkpointing, bs=4, seq=256) | ~1.0 GB |
| CUDA overhead | ~0.3 GB |
| **Total** | **~5.5 GB** ✓ |

`MICRO_BATCH_SIZE=4`, `ACCUMULATION_STEPS=8` → effective batch=32. Much better GPU utilisation
than batch=1, and memory is comfortably within 8 GB.

**No extra features** — isolates the effect of the larger backbone.

**Fixed corrections:** VAL_FRACTION=0.15, effective BATCH_SIZE=32, NUM_EPOCHS=12, PATIENCE=4,
pooling searched over CLS/MEAN/MAX/CLS_MEAN (SCALAR_MIX excluded — hardcodes 13 hidden states).

In [1]:
import os
import sys
import random
import logging
import gc
import json

import numpy as np
import torch
from transformers import AutoTokenizer
from sklearn.metrics import classification_report
import optuna
from optuna.visualization.matplotlib import (
    plot_optimization_history,
    plot_param_importances,
    plot_parallel_coordinate,
)
import matplotlib.pyplot as plt

sys.path.insert(0, "..")
from utils.data import load_data
from utils.split import split_train_val
from utils.dataloaders import make_dataloaders
from utils.pcl_deberta import PCLDeBERTa, PoolingStrategy
from utils.optim import compute_pos_weight
from utils.training_loop import train_model
from utils.eval import evaluate

SEED = 42
DATA_DIR = "../data"
OUT_DIR = "out"
MODEL_NAME = "microsoft/deberta-v3-large"
MAX_LENGTH = 256
VAL_FRACTION = 0.15
# Gradient checkpointing is disabled (incompatible with DeBERTa-v3 + grad accumulation).
# Rely on bitsandbytes 8-bit Adam for memory savings instead.
MICRO_BATCH_SIZE = 2       # no checkpointing: bs=4 uses ~1 GB activations → ~5.5 GB total
ACCUMULATION_STEPS = 16    # effective batch size = 4 * 8 = 32
N_TRIALS = 10
NUM_EPOCHS = 12
PATIENCE = 4
N_EVAL_STEPS = 35          # in optimizer steps (not micro-steps)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s:\t%(message)s")
LOG = logging.getLogger(__name__)
LOG.info(f"Device: {DEVICE}")
os.makedirs(OUT_DIR, exist_ok=True)

# Check bitsandbytes availability
try:
    import bitsandbytes as bnb
    USE_8BIT_ADAM = True
    LOG.info(f"bitsandbytes {bnb.__version__} available — will use 8-bit AdamW")
except ImportError:
    USE_8BIT_ADAM = False
    LOG.warning("bitsandbytes not installed — using standard AdamW (may OOM on 8 GB)")
    LOG.warning("Install with: pip install bitsandbytes>=0.43")

if torch.cuda.is_available():
    total_vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
    LOG.info(f"GPU: {torch.cuda.get_device_name(0)} | VRAM: {total_vram:.1f} GB")

2026-02-28 23:15:32,788 INFO:	Device: cuda
2026-02-28 23:15:32,789 INFO:	bitsandbytes 0.49.2 available — will use 8-bit AdamW
2026-02-28 23:15:32,790 INFO:	GPU: NVIDIA GeForce RTX 3060 Ti | VRAM: 8.0 GB


In [2]:
train_df, dev_df = load_data(DATA_DIR)
train_sub_df, val_sub_df = split_train_val(train_df, val_frac=VAL_FRACTION, seed=SEED)
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME)
LOG.info(f"Train: {len(train_sub_df)}, Val: {len(val_sub_df)}, Dev: {len(dev_df)}")

2026-02-28 23:15:33,008 INFO:	Train/val split: 7118 train, 1257 val (val_frac=0.15)
2026-02-28 23:15:33,009 INFO:	Train, val positive count: 675, 119
2026-02-28 23:15:33,136 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-large/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 23:15:33,145 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-large/64a8c8eab3e352a784c658aef62be1662607476f/config.json "HTTP/1.1 200 OK"
2026-02-28 23:15:33,247 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-large/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 23:15:33,255 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-large/64a8c8eab3e352a784c658aef62be1662607476f/tokenizer_config.json "HTTP/1.1 200 OK"
2026-02-28 23:15:33,356 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-large/tree/main/addit

## 2. Hyperparameter Search

With only 10 trials, the full search space (~54,000 discrete combinations) is untractable —
TPE degenerates to near-random sampling. We narrow to ~72 discrete combinations by:

| Parameter | Change | Reason |
|---|---|---|
| `warmup_fraction` | **Fixed at 0.10** | Standard reliable value; 18 options wasted |
| `label_smoothing` | **Fixed at 0.0** | Rarely decisive at this scale |
| `hidden_dim` | `[0, 256]` (drop 128) | 256 is the meaningful MLP choice |
| `dropout_rate` | `[0.1, 0.3]` step=0.1 | Tighter range; ≤0.05 too close to 0 |
| `head_lr_mult` | `[1, 3, 5]` (drop 10) | 10× too aggressive for large models |
| `lr` | `[2e-6, 2e-5]` (narrowed) | Large models prefer lower LR than base |
| `pooling` | Keep all 4 | Strong structural impact; unknown for large |
| `weight_decay` | `[1e-4, 5e-3]` (narrowed) | Tighter, well-motivated range |

Resulting discrete combinations: `2 (hidden) × 3 (dropout) × 3 (head_lr) × 4 (pooling)` = **72**,
plus two continuous dims (`lr`, `weight_decay`). 10 TPE trials can meaningfully explore this.

In [3]:
POOLING_MAP = {
    "cls": PoolingStrategy.CLS,
    "mean": PoolingStrategy.MEAN,
    "max": PoolingStrategy.MAX,
    "cls_mean": PoolingStrategy.CLS_MEAN,
}
EXP_NAME = "F_large"

WARMUP_FRACTION = 0.10   # fixed — reliable standard value
LABEL_SMOOTHING = 0.0    # fixed — rarely decisive for this task
POOLING = PoolingStrategy.CLS_MEAN  # fixed — generally strong


def objective(trial: optuna.trial.Trial) -> float:
    lr            = trial.suggest_float("lr", 2e-6, 2e-5, log=True)
    hidden_dim    = trial.suggest_categorical("hidden_dim", [0, 256])
    dropout_rate  = trial.suggest_float("dropout_rate", 0.1, 0.3, step=0.1) if hidden_dim > 0 else 0.0
    weight_decay  = trial.suggest_float("weight_decay", 1e-4, 5e-3, log=True)
    head_lr_mult  = trial.suggest_categorical("head_lr_multiplier", [3, 5, 10])

    LOG.info(f"[{EXP_NAME}] Trial {trial.number}: lr={lr:.2e}, pool={POOLING}, "
             f"hidden={hidden_dim}, wd={weight_decay:.1e}, head_lr_mult={head_lr_mult}")

    train_loader, val_loader, dev_loader = make_dataloaders(
        train_sub_df, val_sub_df, dev_df, MICRO_BATCH_SIZE, MAX_LENGTH, tokeniser
    )

    model = PCLDeBERTa(
        hidden_dim=hidden_dim,
        dropout_rate=dropout_rate,
        pooling=POOLING,
        model_name=MODEL_NAME,
        gradient_checkpointing=False,  # incompatible with DeBERTa-v3 + grad accumulation
    ).to(DEVICE)

    pos_weight = compute_pos_weight(train_sub_df, DEVICE)

    results = train_model(
        model=model, device=DEVICE,
        train_loader=train_loader, val_loader=val_loader, dev_loader=dev_loader,
        pos_weight=pos_weight, lr=lr, weight_decay=weight_decay,
        num_epochs=NUM_EPOCHS, warmup_fraction=WARMUP_FRACTION,
        patience=PATIENCE, head_lr_multiplier=head_lr_mult,
        label_smoothing=LABEL_SMOOTHING,
        eval_every_n_steps=N_EVAL_STEPS,
        accumulate_grad_batches=ACCUMULATION_STEPS,
        use_8bit_adam=USE_8BIT_ADAM,
        trial=trial,
    )

    trial.set_user_attr("best_val_f1",    results["best_val_f1"])
    trial.set_user_attr("best_threshold", results["best_threshold"])
    trial.set_user_attr("dev_f1",         results["dev_metrics"]["f1"])
    trial.set_user_attr("dev_precision",  results["dev_metrics"]["precision"])
    trial.set_user_attr("dev_recall",     results["dev_metrics"]["recall"])

    try:
        prev_best = trial.study.best_value
    except ValueError:
        prev_best = -float("inf")
    if results["best_val_f1"] > prev_best:
        torch.save(
            {k: v.cpu() for k, v in model.state_dict().items()},
            os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_model.pt")
        )
        config = {
            **trial.params,
            "warmup_fraction": WARMUP_FRACTION,
            "label_smoothing": LABEL_SMOOTHING,
            "pooling": POOLING.name,
            "batch_size": MICRO_BATCH_SIZE,
            "accumulation_steps": ACCUMULATION_STEPS,
            "effective_batch_size": MICRO_BATCH_SIZE * ACCUMULATION_STEPS,
            "num_epochs": NUM_EPOCHS,
            "patience": PATIENCE,
            "best_threshold": results["best_threshold"],
            "model_name": MODEL_NAME,
        }
        with open(os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_params.json"), "w") as f:
            json.dump(config, f, indent=2)
        LOG.info(f"[{EXP_NAME}] New best saved (val F1={results['best_val_f1']:.4f})")

    del model, train_loader, val_loader, dev_loader
    gc.collect()
    torch.cuda.empty_cache()
    return results["best_val_f1"]

## 3. Run Experiment

In [4]:
gc.collect()
torch.cuda.empty_cache()

In [5]:
gc.collect()
torch.cuda.empty_cache()

study = optuna.create_study(
    direction="maximize",
    study_name=f"pcl_deberta_exp_{EXP_NAME}",
    sampler=optuna.samplers.TPESampler(seed=SEED),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=4, n_warmup_steps=200),
)
study.optimize(objective, n_trials=N_TRIALS)

best = study.best_trial
LOG.info(f"Best trial: {best.number}")
LOG.info(f"Val F1: {best.user_attrs['best_val_f1']:.4f} | Dev F1: {best.user_attrs['dev_f1']:.4f}")
LOG.info(f"Best params: {best.params}")

[32m[I 2026-02-28 23:15:34,742][0m A new study created in memory with name: pcl_deberta_exp_F_large[0m
2026-02-28 23:15:34,745 INFO:	[F_large] Trial 0: lr=4.74e-06, pool=PoolingStrategy.CLS_MEAN, hidden=0, wd=1.0e-03, head_lr_mult=3
2026-02-28 23:15:38,434 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-large/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 23:15:38,443 INFO:	HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/microsoft/deberta-v3-large/64a8c8eab3e352a784c658aef62be1662607476f/config.json "HTTP/1.1 200 OK"
2026-02-28 23:15:38,743 INFO:	HTTP Request: HEAD https://huggingface.co/microsoft/deberta-v3-large/resolve/main/model.safetensors "HTTP/1.1 404 Not Found"
2026-02-28 23:15:38,845 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-large "HTTP/1.1 200 OK"
2026-02-28 23:15:38,965 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-large/commits/main "HTTP/1

Loading weights:   0%|          | 0/390 [00:00<?, ?it/s]

[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-large
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
2026-02-28 23:15:41,233 INFO:	Ba

Loading weights:   0%|          | 0/390 [00:00<?, ?it/s]

[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-large
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
2026-03-01 03:22:16,840 INFO:	Ba

Loading weights:   0%|          | 0/390 [00:00<?, ?it/s]

2026-03-01 04:42:08,296 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-large/commits/refs%2Fpr%2F13 "HTTP/1.1 200 OK"
[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-large
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 

[3mNotes:
- U

Loading weights:   0%|          | 0/390 [00:00<?, ?it/s]

2026-03-01 06:02:52,193 INFO:	HTTP Request: GET https://huggingface.co/api/models/microsoft/deberta-v3-large/commits/refs%2Fpr%2F13 "HTTP/1.1 200 OK"
[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-large
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 

[3mNotes:
- U

Loading weights:   0%|          | 0/390 [00:00<?, ?it/s]

[1mDebertaV2Model LOAD REPORT[0m from: microsoft/deberta-v3-large
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
mask_predictions.dense.weight           | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
2026-03-01 09:47:35,733 INFO:	Ba

KeyboardInterrupt: 

## 4. Results

In [None]:
for plot_fn, suffix in [
    (plot_optimization_history, "history"),
    (plot_param_importances, "importances"),
    (plot_parallel_coordinate, "parallel"),
]:
    plot_fn(study)
    plt.tight_layout()
    plt.savefig(f"{OUT_DIR}/{EXP_NAME}_optuna_{suffix}.png", dpi=300)
    plt.show()

best = study.best_trial
best_params = best.params
pooling = POOLING_MAP[best_params["pooling"]]

model = PCLDeBERTa(
    hidden_dim=best_params["hidden_dim"],
    dropout_rate=best_params.get("dropout_rate", 0.0),
    pooling=pooling,
    model_name=MODEL_NAME,
).to(DEVICE)

state_dict = torch.load(
    os.path.join(OUT_DIR, f"exp_{EXP_NAME}_best_model.pt"), map_location=DEVICE
)
model.load_state_dict(state_dict)

_, _, dev_loader = make_dataloaders(
    train_sub_df, val_sub_df, dev_df, MICRO_BATCH_SIZE, MAX_LENGTH, tokeniser
)
dev_metrics = evaluate(model, DEVICE, dev_loader, threshold=best.user_attrs["best_threshold"])

print(f"\n{'='*60}")
print(f"{EXP_NAME.upper()} — Dev Set Results (threshold={best.user_attrs['best_threshold']:.3f})")
print(f"Model: {MODEL_NAME}")
print(f"Effective batch: {MICRO_BATCH_SIZE} × {ACCUMULATION_STEPS} = {MICRO_BATCH_SIZE*ACCUMULATION_STEPS}")
print(f"8-bit AdamW: {USE_8BIT_ADAM}")
print(f"{'='*60}")
print(classification_report(dev_metrics["labels"], dev_metrics["preds"], target_names=["Non-PCL", "PCL"]))
print("Best hyperparams:")
for k, v in best_params.items():
    print(f"  {k}: {v}")

del model
gc.collect()
torch.cuda.empty_cache()