# SemEval 2026 Task 5 — Transformers Training Notebook

This notebook trains a **Transformer** model to predict plausibility scores (1–5) for word senses in narrative contexts.

**Data**: `semeval26-05-scripts/data/train.json` and `semeval26-05-scripts/data/dev.json`

**Output**: writes `predictions.jsonl` to the project root (required by your prompt), and optionally also to `semeval26-05-scripts/input/res/predictions.jsonl` for local scoring.

Metrics reported:
- Spearman correlation (integer predictions vs. gold average)
- Accuracy within standard deviation (same logic as the official scorer)

## 1) Setup
Configure paths and import libraries.

In [25]:
# If you haven't installed dependencies yet, run:
# %pip install -q transformers datasets accelerate evaluate scipy

# GPU PyTorch (Windows + NVIDIA):
# Run this in a *terminal* (recommended), then RESTART the notebook kernel:
#   D:/Fac/Fac/RN/CARN_project/.venv/Scripts/python.exe -m pip install --upgrade --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

from __future__ import annotations

import json
import statistics
import sys
from pathlib import Path
from typing import Any

import numpy as np
from scipy.stats import spearmanr

import torch
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
 )

PROJECT_ROOT = Path.cwd()
DATA_DIR = PROJECT_ROOT / 'semeval26-05-scripts' / 'data'
TRAIN_JSON = DATA_DIR / 'train.json'
DEV_JSON = DATA_DIR / 'dev.json'

assert TRAIN_JSON.exists(), f'Missing: {TRAIN_JSON}'
assert DEV_JSON.exists(), f'Missing: {DEV_JSON}'

print('Python:', sys.executable)
print('Torch:', torch.__version__)
print('Torch CUDA build:', torch.version.cuda)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))

print('Train file:', TRAIN_JSON)
print('Dev file:', DEV_JSON)

Python: d:\Fac\Fac\RN\CARN_project\.venv\Scripts\python.exe
Torch: 2.9.1+cu130
Torch CUDA build: 13.0
CUDA available: True
GPU: NVIDIA GeForce RTX 4060 Laptop GPU
Train file: d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\data\train.json
Dev file: d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\data\dev.json


## 2) Data loading
Load the JSON files and convert them into flat examples.

In [26]:
def load_split(path: Path) -> dict[str, dict[str, Any]]:
    with path.open('r', encoding='utf-8') as f:
        return json.load(f)

def iter_sorted_items(raw: dict[str, dict[str, Any]]):
    for k in sorted(raw.keys(), key=lambda x: int(x)):
        yield k, raw[k]

train_raw = load_split(TRAIN_JSON)
dev_raw = load_split(DEV_JSON)

print('Train samples:', len(train_raw))
print('Dev samples:', len(dev_raw))
print('Example fields:', list(next(iter(train_raw.values())).keys()))

Train samples: 2280
Dev samples: 588
Example fields: ['homonym', 'judged_meaning', 'precontext', 'sentence', 'ending', 'choices', 'average', 'stdev', 'nonsensical', 'sample_id', 'example_sentence']


## 3) Data preprocessing
Build the model input text and labels.

We fine-tune a Transformer **classifier** to predict an integer score (1–5).
To help the model, we format inputs with explicit sections (precontext/sentence/ending/etc.) plus a direct question.

In [27]:
def build_input_text(sample: dict[str, Any]) -> str:
    """Build structured input text - just facts, no instruction prompts."""
    # Extract context
    pre = str(sample.get('precontext', '')).strip()
    sent = str(sample.get('sentence', '')).strip()
    end = str(sample.get('ending', '')).strip()
    
    # Extract sense information
    hom = str(sample.get('homonym', '')).strip()
    meaning = str(sample.get('judged_meaning', '')).strip()
    ex = str(sample.get('example_sentence', '')).strip()
    
    # Simple concatenation - no LLM-style prompts since RoBERTa/DeBERTa are encoders, not LLMs
    return (
        f"Story: {pre} {sent} {end}\n"
        f"Word: {hom}\n"
        f"Meaning: {meaning}\n"
        f"Example: {ex}"
    )

def clip_round_to_1_5(x: float) -> int:
    """Round and clip float to integer in range [1, 5]."""
    return int(np.clip(int(round(float(x))), 1, 5))

def avg_to_class(avg: float) -> int:
    """Convert average score to class label (0-indexed)."""
    return clip_round_to_1_5(avg) - 1

# Prepare training data
train_ids: list[str] = []
train_texts: list[str] = []
train_labels_cls: list[int] = []

for k, s in iter_sorted_items(train_raw):
    avg = float(s['average'])
    train_ids.append(k)
    train_texts.append(build_input_text(s))
    train_labels_cls.append(avg_to_class(avg))

# Prepare dev data
dev_ids: list[str] = []
dev_texts: list[str] = []
dev_avg: list[float] = []
dev_labels_cls: list[int] = []
dev_choices: list[list[int]] = []

for k, s in iter_sorted_items(dev_raw):
    avg = float(s['average'])
    dev_ids.append(k)
    dev_texts.append(build_input_text(s))
    dev_avg.append(avg)
    dev_labels_cls.append(avg_to_class(avg))
    dev_choices.append(list(map(int, s['choices'])))

print('Train size:', len(train_ids), 'Dev size:', len(dev_ids))
print('Gold rounded dev distribution:', {i: sum(clip_round_to_1_5(a) == i for a in dev_avg) for i in range(1, 6)})
print('\nSample input:\n', train_texts[0][:300])


Train size: 2280 Dev size: 588
Gold rounded dev distribution: {1: 68, 2: 133, 3: 145, 4: 147, 5: 95}

Sample input:
 Story: The old machine hummed in the corner of the workshop. Clara examined its dusty dials with a furrowed brow. She wondered if it could be brought back to life. The potential couldn't be measured. She collected a battery reader and looked on earnestly, willing some life back into the old machine.


## 4) Tokenization
Tokenize the text with a pretrained Transformer tokenizer.

In [28]:
# Model and hyperparameters (smaller + faster)
# DeBERTa-v3-base is a strong mid-size encoder that trains faster and needs less VRAM than roberta-large.
MODEL_NAME = 'microsoft/deberta-v3-base'

# Speed knobs (biggest impact):
MAX_LENGTH = 256  # 128/192 are much faster than 256; try 256 only if you need more accuracy
LEARNING_RATE = 1e-5
NUM_EPOCHS = 5  # faster; increase later if you want more accuracy
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.06
SEED = 42

set_seed(SEED)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

train_ds = Dataset.from_dict({
    'id': train_ids,
    'text': train_texts,
    'labels': train_labels_cls,
})
dev_ds = Dataset.from_dict({
    'id': dev_ids,
    'text': dev_texts,
    'labels': dev_labels_cls,
})

def tokenize_batch(batch):
    return tokenizer(batch['text'], truncation=True, max_length=MAX_LENGTH)

train_tok = train_ds.map(tokenize_batch, batched=True, remove_columns=['text'])
dev_tok = dev_ds.map(tokenize_batch, batched=True, remove_columns=['text'])

# Helps Tensor Cores on NVIDIA GPUs (faster matmul when padding aligns)
USE_CUDA = torch.cuda.is_available()
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    pad_to_multiple_of=8 if USE_CUDA else None,
 )

print(f'Model: {MODEL_NAME} | MAX_LENGTH={MAX_LENGTH} | epochs={NUM_EPOCHS} | lr={LEARNING_RATE}')
print(train_tok)
print(dev_tok)


Map: 100%|██████████| 2280/2280 [00:00<00:00, 19407.36 examples/s]
Map: 100%|██████████| 588/588 [00:00<00:00, 16122.24 examples/s]

Model: microsoft/deberta-v3-base | MAX_LENGTH=256 | epochs=5 | lr=1e-05
Dataset({
    features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2280
})
Dataset({
    features: ['id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 588
})





## 5) Model + training
We fine-tune a pretrained model as a **5-class classifier** (labels 1–5).
On GPU, using a stronger encoder (e.g. RoBERTa-base) usually helps both loss and distribution.

In [31]:
# Train selecting best checkpoint by acc_within_sd (official-style metric)

# Assumes you already ran the earlier cells that define:
# - tokenizer, train_tok, dev_tok, data_collator
# - dev_avg, dev_choices (for metrics)
# - MODEL_NAME, LEARNING_RATE, NUM_EPOCHS, WEIGHT_DECAY, WARMUP_RATIO, SEED

import statistics
from collections import Counter

import numpy as np
import torch
import torch.nn as nn
from scipy.stats import spearmanr
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

use_cuda = torch.cuda.is_available()
print('CUDA:', use_cuda)
if use_cuda:
    print('GPU:', torch.cuda.get_device_name(0))
    # TF32 speeds up matmul on Ampere+ GPUs with minimal accuracy impact
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    try:
        torch.set_float32_matmul_precision('high')
    except Exception:
        pass

# --- Metrics (same logic as official scorer)
def is_within_standard_deviation(prediction: int, labels: list[int]) -> bool:
    avg = sum(labels) / len(labels)
    stdev = statistics.stdev(labels)
    if (avg - stdev) < prediction < (avg + stdev):
        return True
    if abs(avg - prediction) < 1:
        return True
    return False

def compute_metrics(eval_pred):
    logits, _ = eval_pred
    pred_class = np.argmax(logits, axis=-1)
    pred_int = (pred_class + 1).tolist()
    spearman_corr, _ = spearmanr(pred_int, np.asarray(dev_avg, dtype=float))
    acc_within_sd = sum(
        is_within_standard_deviation(p, choices)
        for p, choices in zip(pred_int, dev_choices)
    ) / len(dev_choices)
    return {
        'spearman_int_vs_avg': float(spearman_corr) if spearman_corr == spearman_corr else 0.0,
        'acc_within_sd': float(acc_within_sd),
    }

# --- Class-weighted loss (helps with label imbalance)
counts = Counter(train_tok['labels'])
freq = np.asarray([counts.get(i, 1) for i in range(5)], dtype=np.float32)
w = (1.0 / (freq ** 0.75))
w = w / w.mean()
class_weights_t = torch.tensor(w, dtype=torch.float32)

_class_weights_cache = {}
def compute_loss_func(outputs, labels, num_items_in_batch=None):
    logits = outputs.get('logits')
    device = logits.device
    w_dev = _class_weights_cache.get(device)
    if w_dev is None:
        w_dev = class_weights_t.to(device)
        _class_weights_cache[device] = w_dev
    loss_fct = nn.CrossEntropyLoss(weight=w_dev)
    return loss_fct(logits.view(-1, 5), labels.view(-1))

# --- Model
# IMPORTANT: do NOT load weights in float16 here. When fp16=True, Trainer uses AMP with a GradScaler,
# and loading fp16 weights can trigger: "Attempting to unscale FP16 gradients".
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=5,
    problem_type='single_label_classification',
)

# Batch settings (fast, but safe)
PER_DEVICE_TRAIN_BS = 8 if use_cuda else 2
GRAD_ACCUM = 1 if use_cuda else 4

training_args = TrainingArguments(
    output_dir=str(PROJECT_ROOT / 'transformer_runs_best'),
    overwrite_output_dir=True,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=PER_DEVICE_TRAIN_BS,
    per_device_eval_batch_size=8 if use_cuda else 4,
    gradient_accumulation_steps=GRAD_ACCUM,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    lr_scheduler_type='cosine',
    fp16=use_cuda,
    bf16=False,
    tf32=use_cuda,
    optim='adamw_torch',
    dataloader_pin_memory=use_cuda,
    dataloader_num_workers=0,
    # Train based on acc_within_sd: evaluate/save each epoch and keep the best checkpoint
    eval_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model='acc_within_sd',
    greater_is_better=True,
    logging_strategy='steps',
    logging_steps=200,
    report_to=[],
    seed=SEED,
 )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=dev_tok,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_loss_func=compute_loss_func,
    compute_metrics=compute_metrics,
 )

print(f"Training {MODEL_NAME} (select best by acc_within_sd) | bs={PER_DEVICE_TRAIN_BS} | grad_accum={GRAD_ACCUM} | epochs={NUM_EPOCHS} ...")
trainer.train()

# Evaluate best checkpoint (Trainer will have loaded it)
eval_result = trainer.evaluate()
print('\n=== Final Evaluation (best checkpoint) ===')
acc_pct = eval_result['eval_acc_within_sd'] * 100
print(f"Accuracy within SD: {eval_result['eval_acc_within_sd']:.4f} ({acc_pct:.1f}%)")
print(f"Spearman correlation: {eval_result['eval_spearman_int_vs_avg']:.4f}")


CUDA: True
GPU: NVIDIA GeForce RTX 4060 Laptop GPU


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.


Training microsoft/deberta-v3-base (select best by acc_within_sd) | bs=8 | grad_accum=1 | epochs=5 ...


Epoch,Training Loss,Validation Loss,Spearman Int Vs Avg,Acc Within Sd
1,1.61,1.605275,0.0,0.569728
2,1.6123,1.586541,0.256573,0.605442
3,1.4389,1.485641,0.464417,0.634354
4,1.2687,1.558559,0.500968,0.656463
5,1.0817,1.576891,0.504187,0.642857


  spearman_corr, _ = spearmanr(pred_int, np.asarray(dev_avg, dtype=float))



=== Final Evaluation (best checkpoint) ===
Accuracy within SD: 0.6565 (65.6%)
Spearman correlation: 0.5010


## 6) Generate predictions.jsonl
Export dev predictions as required by the SemEval format: one JSON per line with `id` and integer `prediction` in [1..5].

In [32]:
pred_out = trainer.predict(dev_tok)
logits = pred_out.predictions  # (N,5)
pred_class = np.argmax(logits, axis=-1)
pred_int = (pred_class + 1).tolist()

print('Logits stats (min/max):', float(np.min(logits)), float(np.max(logits)))
print('Prediction distribution:', {i: pred_int.count(i) for i in range(1, 6)})
print('Gold (rounded avg) distribution:', {i: sum(clip_round_to_1_5(a) == i for a in dev_avg) for i in range(1, 6)})

# Quick sanity-check: show first few (id, gold avg, pred)
for sample_id, gold_avg, pred in list(zip(dev_ids, dev_avg, pred_int))[:10]:
    print(sample_id, 'gold_avg=', gold_avg, 'pred=', pred)

def write_predictions_jsonl(ids: list[str], preds: list[int], out_path: Path) -> None:
    assert len(ids) == len(preds)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open('w', encoding='utf-8', newline='\n') as f:
        for sample_id, pred in zip(ids, preds):
            f.write(json.dumps({'id': str(sample_id), 'prediction': int(pred)}, ensure_ascii=False) + '\n')

# Required by your prompt
out_predictions_root = PROJECT_ROOT / 'predictions.jsonl'
write_predictions_jsonl(dev_ids, pred_int, out_predictions_root)
print('Wrote predictions:', out_predictions_root)

# Optional: also write where the SemEval scripts expect it
out_predictions_scorer = PROJECT_ROOT / 'semeval26-05-scripts' / 'input' / 'res' / 'predictions.jsonl'
write_predictions_jsonl(dev_ids, pred_int, out_predictions_scorer)
print('Also wrote predictions for scoring:', out_predictions_scorer)

Logits stats (min/max): -2.23828125 2.15234375
Prediction distribution: {1: 29, 2: 118, 3: 159, 4: 177, 5: 105}
Gold (rounded avg) distribution: {1: 68, 2: 133, 3: 145, 4: 147, 5: 95}
0 gold_avg= 3.6 pred= 4
1 gold_avg= 3.6 pred= 4
2 gold_avg= 3.8 pred= 3
3 gold_avg= 4.2 pred= 5
4 gold_avg= 3.0 pred= 3
5 gold_avg= 3.0 pred= 5
6 gold_avg= 4.6 pred= 5
7 gold_avg= 1.3333333333333333 pred= 4
8 gold_avg= 2.2 pred= 4
9 gold_avg= 3.8 pred= 4
Wrote predictions: d:\Fac\Fac\RN\CARN_project\predictions.jsonl
Also wrote predictions for scoring: d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\input\res\predictions.jsonl


## 7) (Optional) Run official scorer
This validates formatting and reports official metrics on dev.

In [33]:
import sys
import subprocess

scoring_script = PROJECT_ROOT / 'semeval26-05-scripts' / 'scoring.py'
gold = PROJECT_ROOT / 'semeval26-05-scripts' / 'input' / 'ref' / 'solution.jsonl'
preds = PROJECT_ROOT / 'semeval26-05-scripts' / 'input' / 'res' / 'predictions.jsonl'
scores_out = PROJECT_ROOT / 'semeval26-05-scripts' / 'output' / 'scores.json'
scores_out.parent.mkdir(parents=True, exist_ok=True)

cmd = [
    str(Path(sys.executable)),
    str(scoring_script),
    str(gold),
    str(preds),
    str(scores_out),
]

print('Running:', ' '.join(cmd))
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
print(result.stderr)
print('Scores JSON:', scores_out)

Running: d:\Fac\Fac\RN\CARN_project\.venv\Scripts\python.exe d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\scoring.py d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\input\ref\solution.jsonl d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\input\res\predictions.jsonl d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\output\scores.json
Importing...
Starting Scoring script...
Everything looks OK. Evaluating file d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\input\res\predictions.jsonl on d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\input\ref\solution.jsonl
----------
Spearman Correlation: 0.5009680615834642
Spearman p-Value: 1.1072815287587996e-38
----------
Accuracy: 0.6564625850340136 (386/588)
Results dumped into scores.json successfully.


Scores JSON: d:\Fac\Fac\RN\CARN_project\semeval26-05-scripts\output\scores.json
